AI Trained on Ild Scientific Papers Makes Discoveries Humans Missed

Scientists used machine learning to reveal new scientific knowledge hidden in old research papers.

By Madeleine Gregory|10 July 2019, 4:25am

Shutterstock

Using just the language in millions of old scientific papers, a machine learning algorithm was able to make completely new scientific discoveries.ADVERTISEMENT

In a study published in Nature on July 3, researchers from the Lawrence Berkeley National Laboratory used an algorithm called Word2Vec sift through scientific papers for connections humans had missed. Their algorithm then spit out predictions for possible thermoelectric materials, which convert heat to energy and are used in many heating and cooling applications. 

The algorithm didn’t know the definition of thermoelectric, though. It received no training in materials science. Using only word associations, the algorithm was able to provide candidates for future thermoelectric materials, some of which may be better than those we currently use.

“It can read any paper on material science, so can make connections that no scientists could,” researcher Anubhav Jain said. “Sometimes it does what a researcher would do; other times it makes these cross-discipline associations.”

To train the algorithm, the researchers assessed the language in 3.3 million abstracts related to material science, ending up with a vocabulary of about 500,000 words. They fed the abstracts to Word2vec, which used machine learning to analyze relationships between words. 

“The way that this Word2vec algorithm works is that you train a neural network model to remove each word and predict what the words next to it will be,” Jain said. “By training a neural network on a word, you get representations of words that can actually confer knowledge.”ADVERTISEMENT

Using just the words found in scientific abstracts, the algorithm was able to understand concepts such as the periodic table and the chemical structure of molecules. The algorithm linked words that were found close together, creating vectors of related words that helped define concepts. In some cases, words were linked to thermoelectric concepts but had never been written about as thermoelectric in any abstract they surveyed. This gap in knowledge is hard to catch with a human eye, but easy for an algorithm to spot. 

After showing its capacity to predict future materials, researchers took their work back in time, virtually. They scrapped recent data and tested the algorithm on old papers, seeing if it could predict scientific discoveries before they happened. Once again, the algorithm worked. 

In one experiment, researchers analyzed only papers published before 2009 and were able to predict one of the best modern-day thermoelectric materials four years before it was discovered in 2012.

This new application of machine learning goes beyond materials science. Because it’s not trained on a specific scientific dataset, you could easily apply it to other disciplines, retraining it on literature of whatever subject you wanted. Vahe Tshitoyan, the lead author on the study, says other researchers have already reached out, wanting to learn more. 

“This algorithm is unsupervised and it builds its own connections,” Tshitoyan said. “You could use this for things like medical research or drug discovery. The information is out there. We just haven’t made these connections yet because you can’t read every article.”

This article originally appeared on VICE US.MORE FROM VICE

Source: VICE US

Judith Chao Andrade

Apasionada del conocimiento, de compartirlo y de aprender de todo lo que me rodea, disfruto aprendiendo y realizando actividades. Actualmente estoy aprendiendo programación pero me fascinan los temas relacionados con los materiales especiales, las cuiriosidades, el humor, los eventos, las redes sociales ... Mi mayor interés podría decir que es no perder nunca la cuiriosidad por lo que si tienes un plan en mente solo proponlo !.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *