AI triggers a revolution in materials science, and there is a tough battle that cannot be avoided

“In my opinion, Nature should not have published this paper from Google at all because it violates the FAIR (Findable, Accessible, Interoperable, Reusable) data principle. … Google decided not to share the data used to generate the model, or even the model results themselves. The only data shared was the stable crystals that the model ultimately identified, which made it difficult to reproduce the model. … I think it’s important for companies like Google to participate in the scientific process, but they must also adhere to the same rigorous standards. No matter which standard you look at, a work that cannot be verified cannot be considered science.”

——Shyue Ping Ong (UCSD professor, founder of Materials Project)

Written by Liu Miao and Meng Sheng (Institute of Physics, Chinese Academy of Sciences/Songshan Lake Materials Laboratory)

Giants focus on "AI + Materials Science"

At the end of November 2023, Google's DeepMind published a major paper in Nature magazine, announcing that they had developed an artificial intelligence reinforcement learning model for materials science, Graph Networks for Materials Exploration (GNoME). Through this model and high-throughput first-principles calculations, they found more than 380,000 thermodynamically stable crystalline materials, which is equivalent to "adding 800 years of intellectual accumulation to mankind" and greatly accelerating the research speed of discovering new materials (Figure 1). [1]

Figure 1. Google's DeepMind released the GNoME dataset and model in Nature magazine.

In December 2023, a few days after Google published its GNoME model, Microsoft released MatterGen, an AI-generated model in the field of materials science that can predict new material structures on demand based on the required material properties. Microsoft’s president took to social media to support his company’s big model, commenting: “The MatterGen model we developed can greatly improve the efficiency of on-demand research and development of new materials” (Figure 2). [2]

Figure 2. Microsoft President comments on its own AI material generation model

In January 2024, Microsoft collaborated with the Pacific Northwest National Laboratory (PNNL) under the U.S. Department of Energy to use artificial intelligence and high-performance computing to screen out an all-solid-state electrolyte material from 32 million inorganic materials, completing a closed loop from prediction to experiment. This technology can help the research and development of next-generation lithium-ion battery materials (Figure 3). [3]

Figure 3. Microsoft scientists screened out all-solid-state electrolyte materials from 32 million inorganic materials and verified them experimentally.

Materials science is undoubtedly an important science and a pillar of the rapid development of modern industry. From the Stone Age to the Bronze Age and then to the Iron Age, each evolutionary stage of human civilization is closely related to materials. Ceramics have made great contributions to the prosperity of Chinese civilization; glass has contributed to the invention of optical devices and laid the foundation for the progress of cell biology and astronomy. It can be said that the history of human civilization is a history of the evolution of materials science.

Recently, the progress of artificial intelligence technology has been rapid. Introducing artificial intelligence methods into scientific research has become an important interdisciplinary direction. In addition to Google and Microsoft, Meta and ByteDance have also recently laid out similar research and development directions. Meta AI has cooperated with American universities to develop the industry's top catalytic material dataset Open Catalyst Project and organic metal framework adsorption dataset OpenDAC. For a time, technology giants have stirred up the field of materials science with their own technology. Inorganic materials science has become their new track.

Detailed interpretation of the GNoME materials science dataset

How does artificial intelligence change material research and development? Technology giants have identified the same technical route: (1) obtain material science data through theoretical calculations; (2) produce massive amounts of such data through high-throughput computing; (3) feed the data into artificial intelligence models; (4) use the models to infer the properties of unknown materials. This also means that this is an effective technical solution with broad prospects.

Will artificial intelligence change the way materials science is developed in the future? The answer is yes. Data, algorithms, and computing power will also become the core factors that promote this change. Amid the overwhelming news and publicity, let us take the data set released by Google as a starting point to explore its detailed content and logic.

1. Following the biopharmaceutical industry, materials science is the next hot spot for artificial intelligence to enter. A few years ago, AI stirred up the biological and pharmaceutical fields. The software and models of many companies such as Schrödinger and Atomwise in the United States allowed the pharmaceutical industry to see new opportunities. Screening target drug molecules at the atomic scale has become an important part of the R&D pipeline of major pharmaceutical companies.

However, drug development cycles are long, development costs are high, and approval procedures are strict, so some AI pharmaceutical companies have turned to material science. For example, Schrödinger has established a materials science department. In essence, whether it is biomedicine or material science, the logic behind AI empowerment is the same: through artificial intelligence methods, find solvers and simulators for interactions between atoms.

Technology giants have realized that materials science and pharmaceuticals share the same underlying logic. Everything is ready, except for "data". Data is the booster for the takeoff of artificial intelligence. The size and quality of the data set directly determine the predictive ability of artificial intelligence. Recently, thanks to the development of material genetic engineering and several material science databases, this field has high-quality data resources, and the prerequisites for the rise of artificial intelligence have been laid.

2. Datasets are the foundation of the artificial intelligence building. The field of artificial intelligence is highly dependent on data. The coverage and quality of the data set directly determine the height of the artificial intelligence model. The coverage of the data set determines the generalization ability of the model, and the consistency and comparability of the data set determine the prediction accuracy of the model. Among the three major elements of artificial intelligence data, algorithms, and computing power, data is the link with the most technical barriers. For example: large language models such as GPT 3.5 and Llama 2, even if the model source code is open source, they choose not to publish their data sets. Without excellent data sets as support, it is difficult for competitors in the industry to train excellent AI models.

Algorithms have gradually lost their role as technical barriers, and the possibility of relying on algorithms to lead the industry is very slim.

3. Theoretical calculations have made great contributions to the establishment of materials science databases. After decades of development, density functional theory has accumulated mature technical reserves and can produce highly standardized data sets in a short period of time. Density functional theory can efficiently solve the properties of compounds by solving the equations of motion for electrons in the system, thereby establishing a connection between the spatial distribution of atoms in the compound and the physical properties of the compound. By running hundreds or thousands of computing jobs simultaneously, people can produce massive data sets. Currently, the most widely used data sets in the field of materials science, such as Materials Project[4] and OQMD[5], are all obtained based on high-throughput calculations of density functional theory. The GNoME data set means that Google has mastered the data production capabilities for materials science.

According to current materials science research and development technology, it will be impossible to achieve similar data coverage and consistency in several years just by accumulating experimental data.

4. Google's paper includes two parts: the GNoME model code and the dataset. The dataset has very high coverage and accuracy. The GNoME dataset is derived from the Materials Project and uses the same computational standards and computational processes as the Materials Project, so it can be used in conjunction with the Materials Project[4]. Google claims that it has produced computational data for 2.2 million inorganic materials through high-throughput computing and density functional theory. At the same time, it continuously predicts new thermodynamically stable materials through active learning, and ultimately found 380,000 stable inorganic compounds, which is undoubtedly a huge boost to the field of materials science.

5. Although the GNoME dataset held by Google is very large, covering 2.2 million inorganic materials, the information published with the paper only contains a small part of the data, namely the structure, thermodynamic stability and model code of 380,000 inorganic compounds. Google has not yet disclosed the model parameters, so third parties cannot run the model's reasoning in an out-of-the-box manner. Google has not released enough data, and it is difficult for the outside world to conduct effective model training through this dataset. Therefore, Google is the exclusive party holding the GNoME model.

In the process of building big AI models in the future, data is a moat. Google does not open source complete data, which ensures its unsurpassed leading position in the industry. Even though Google has disclosed the structures and thermodynamic stability of 380,000 compounds, it has not disclosed a lot of key information, such as the formation energy of compounds. Competitors cannot train effective models based on the 380,000 material data that have been disclosed.

Data generation is the most time-consuming and labor-intensive process, but there are very few institutions, organizations, and researchers in the industry who dare to rise to the challenge and produce standardized data in various ways. Most people are looking forward to "riding a ride" and are full of expectations for data sharing, but are trying to avoid the "data production problem."

In order to solve this problem, a popular idea in the industry is data submission, which "stitches" various isolated data together to form a "large unified" data set. Undoubtedly, this is a way to integrate other people's data. It was highly praised several years ago, but no successful cases have been seen so far. For example, some special projects of the Ministry of Science and Technology have similar data submission mechanisms.

Undoubtedly, the tech giants are sober and they know that they must rise to the challenge and produce data themselves. They are also unlikely to be willing to fully and generously disclose these valuable data sets. This is reasonable, because these data may contain huge commercial value. From another perspective, the long-term social benefits of open source and data submission may not all be positive.

6. The phase space of inorganic materials is huge, and humans have only discovered a small part of it. The authors of this article analyzed the structural information of 380,000 compounds in the paper in detail and found that the element combinations of 30,345 materials (for example: "Zr-Ti-Se", "Ni-Te") can be found in the Materials Project, accounting for 7.8%. This means that in the chemical space familiar to humans, Google has found 30,345 thermodynamically stable materials. And most (92.2%) of the stable materials come from element combinations that humans have not yet set foot in (for example: "Rh-Ac", "Zn-Cs"). This means that in the unknown chemical space, there are still many undiscovered stable compounds, and the materials known to humans may only be the tip of the iceberg. However, for the chemical space that humans have not set foot in, most of the compounds contain low-abundance elements, and the application value of such materials is also questionable. (Figure 4)

Figure 4. Detailed analysis of Google's GNoME dataset. The GNoME dataset claims to have found 384,781 thermodynamically stable inorganic materials. It can be seen that among these compounds, ternary, quaternary, and quinary compounds are the main ones. Most of the compounds come from combinations of elements that humans rarely touch, and most of them are metal compounds.

7. The GNoME model samples a wider chemical space. The dataset covers a wider structural space and chemical space, so it is a more "broad-spectrum" dataset, which is very beneficial for the developed AI model. The essence of the AI modeling process is a kind of "average", which means that AI is better at interpolation between data rather than extrapolation.

When people measure the quality of an AI model, the usual indicator is prediction accuracy, but the generalization ability of the model is often not mentioned. Of course, the quality of generalization ability is also difficult to quantify. Improving generalization ability requires larger and more widely sampled data sets.

Compared with AI models that are usually developed based on Materials Project data in the industry (such as CHGNET[6] and m3gnet[7]), the GNoME model has a "higher-level" dataset foundation and is bound to have unique generalization capabilities.

8. The GNoME dataset is "seriously biased", with metal materials accounting for more than 60%. It is normal for alloy materials to have many unknown stable structures. This is because metal atoms easily combine to form metallic bonds, thereby reducing the energy of the system. This is a very common phenomenon. Then these metal elements are likely to form alloy phases with randomly distributed atoms in real materials, rather than intermetallic compounds in the GNoME dataset, so they are unlikely to be synthesized. (Figure 4 & Figure 5)

In reality, if you randomly mix a few metal elements, you will most likely form a thermodynamically stable alloy, but does this count as the discovery of a new material? If so, those who are engaged in alloy research are discovering thousands of new materials every day.

But for artificial intelligence model training, these data are still of great significance.

(a) GNoME

(b) Materials Project

Figure 5. Statistics of element occurrence probabilities in the GNoME dataset and the Materials Project. The GNoME data mainly explores low-abundance elements, which is a chemical space that humans rarely explore. The material system explored by the Materials Project is a more common chemical space.

9. The statistics of element occurrences in the GNoME dataset are very different from those in the Materials Project. The number of ionic compounds in the GNoME dataset is small, while metal elements, especially low-abundance elements, are more likely to appear. For example, Ho, Tb, Rh, Er, etc. appear a lot, while common elements such as O, P, and S appear less frequently. This further proves that the sampling space of GNoME is biased. (Figure 5)

The ratio is ~2.8%. The low ratio means that the compound is very likely to be an impurity compound rather than a new stable compound. It can be seen that there are many suspected "impurity compounds" in the GNoME dataset, rather than pure phases, which is particularly prominent in binary compounds.

10. In the GNoME dataset, in addition to the large proportion of multi-metallic compounds, the proportion of doped structures is also large, and such structures are also difficult to synthesize accurately. Figure 6 shows the proportion of the smallest element in the compound. It can be seen that some thermodynamic

The proportion of monovalent and quaternary compounds has decreased. (Figure 6)

11. All advanced algorithms in vision and language models will find their place in material science. Reinforcement learning, attention mechanisms, diffusion models, pre-trained models, multimodal technologies, generation algorithms, model alignment mechanisms, vector databases, etc. will sooner or later be continuously introduced into material science and produce corresponding tools.

The future is long, but full of hope

Google's GNoME dataset is a spark in the revolution of "AI + Materials Science". Although many specific details of the dataset have not yet been released, it undoubtedly shows that there are still many unknown new materials waiting for human exploration in the chemical space that has not yet been explored by humans. The release of the dataset opens up many possibilities in the field. Researchers around the world will have the opportunity to further explore these materials and may use this data to create more AI applications and discover more new materials. It is not just a dataset, it is a roadmap showing countless innovations that can reshape the world.

In the tide of "AI + Materials Science", data is of paramount importance. Producing data sets, especially those that support the industry, may be a thankless task, but it is an unavoidable "tough battle".

Note: A condensed English version of this article was published in Materials Futures on February 28, 2024.

DOI: 10.1088/2752-5724/ad2e0c

URL: https://iopscience.iop.org/article/10.1088/2752-5724/ad2e0c

References

[1] A. Merchant, S. Batzner, SS Schoenholz, M. Aykol, G. Cheon, and ED Cubuk, “Scaling deep learning for materials discovery,” Nature, vol. 624, no. 7990, pp. 80–85, Dec. 2023, doi: 10.1038/s41586-023-06735-9.

[2] C. Zeni et al., “MatterGen: a generative model for inorganic materials design,” Dec. 2023, doi: 10.48550/arXiv.2312.03687.

[3] C. Chen et al., “Accelerating computational materials discovery with artificial intelligence and cloud high-performance computing: from large-scale screening to experimental validation,” Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.04070

[4] A. Jain et al., “Commentary: The materials project: A materials genome approach to accelerating materials innovation,” APL Materials, vol. 1, no. 1. American Institute of Physics Inc., 2013. doi: 10.1063/1.4812323.

[5] JE Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton, “Materials design and discovery with high-throughput density functional theory: The open quantum materials database (OQMD),” JOM, vol. 65, no. 11, pp. 1501–1509, Nov. 2013, doi: 10.1007/s11837-013-0755-4.

[6] B. Deng et al., "CHGNet as a pretrained universal neural network potential for charge-informed atomistic modeling," Nat Mach Intell, vol. 5, no. 9, pp. 1031–1041, Sep. 2023, doi: 10.1038/s42256-023-00716-3.

[7] C. Chen and SP Ong, “A universal graph deep learning interatomic potential for the periodic table,” Nat Comput Sci, vol. 2, no. 11, pp. 718–728, Nov. 2022, doi: 10.1038/s43588-022-00349-3.

This article is supported by the Science Popularization China Starry Sky Project

Produced by: China Association for Science and Technology Department of Science Popularization

Producer: China Science and Technology Press Co., Ltd., Beijing Zhongke Xinghe Culture Media Co., Ltd.

Special Tips

1. Go to the "Featured Column" at the bottom of the menu of the "Fanpu" WeChat public account to read a series of popular science articles on different topics.

2. Fanpu provides a function to search articles by month. Follow the official account and reply with the four-digit year + month, such as "1903", to get the article index for March 2019, and so on.

Copyright statement: Personal forwarding is welcome. Any form of media or organization is not allowed to reprint or excerpt without authorization. For reprint authorization, please contact the backstage of the "Fanpu" WeChat public account.

<<: There are also weather changes in space, so do astronauts need to check the "weather forecast"?

>>: Many people don't realize the relationship between aging and eyebrows! Let's take a look at your eyebrows. Are they like this?

How to operate short videos? How to make short videos for self-media?

How did the leader in the field of superchargeable batteries come to be? Uncovering the secret of Juwan Technology Research's killer weapon

Blog

Who is the winner behind the competition for IPTV dominance between radio, television and telecommunications?

Recommend

A good place to drink tea in the three towns of Wuhan

Wuhan Tea Tasting Contact Information I strongly ...

After Android and iOS, the third largest mobile system reappears

Google is still keeping mum on its intentions to ...

Liu Yiwei's "15 Innovative Marketing Strategies: Establishing a New Marketing Global Perspective" (15 lectures) Baidu Cloud Download

Liu Yiwei's "15 Innovative Marketing Str...

Now that both black boxes have been found, can we recover the truth? Why isn't the data transmitted back to the ground in real time?

Experts in this article: Liu Yuxuan, Tianjin Fore...

How to limit traffic, block accounts and increase weight in Douyin live broadcast room!

Data shows that during the epidemic, Douyin has d...

AI triggers a revolution in materials science, and there is a tough battle that cannot be avoided

How to operate short videos? How to make short videos for self-media?

Why is the meter still running when the appliances are not in use? Who is "stealing" your appliances?

Li Xiaojun - Solving the Mysteries of Chaos

The key to adjusting the Android window soft keyboard: windowSoftInputMode property setting

To study whether hippos run fast, scientists specifically studied their gait

How did the leader in the field of superchargeable batteries come to be? Uncovering the secret of Juwan Technology Research's killer weapon

Who is the winner behind the competition for IPTV dominance between radio, television and telecommunications?

Behind the success of pair programming

Xiaohongshu is back on the shelves, how to carry out promotion?

Will you be exposed? Teach you how to hide secrets on QQ that you don’t want others to find out

Recommend

A good place to drink tea in the three towns of Wuhan

After Android and iOS, the third largest mobile system reappears

Liu Yiwei's "15 Innovative Marketing Strategies: Establishing a New Marketing Global Perspective" (15 lectures) Baidu Cloud Download

APP promotion tips, free resources from Android app store!

4K smart TVs are not worthy of their name and ecological shortcomings restrict large-scale application

Android Duoyou Duoshu v1.0.2.655 Read paid novels for free

How can educational institutions carry out good community operations?

Zhihu video income project, simple operation to earn 10000+ per month

WeChat red envelope cover traffic diversion: use the Spring Festival to add 10,000 WeChat friends every day

Raise fish on the big screen of TV "Dream Aquarium" trial experience

Dandruff and hair loss, what's wrong with our scalp?

How did plate tectonic movement start? Is it from geothermal energy?

A world-class astronomical observatory is located in China’s “no man’s land”, Lenghu is no longer “cold”!

Now that both black boxes have been found, can we recover the truth? Why isn't the data transmitted back to the ground in real time?

How to limit traffic, block accounts and increase weight in Douyin live broadcast room!