Another huge step forward! OpenAI o1 is here. How does it solve complex problems?

Produced by: Science Popularization China

Author: Wang Chen (PhD Candidate at Institute of Computing Technology, Chinese Academy of Sciences)

Producer: China Science Expo

Editor's note: To showcase the latest trends in intelligent technology, the China Science Popularization Frontier Technology Project has launched a series of articles on "Artificial Intelligence" to provide a glimpse into the latest progress in artificial intelligence and respond to various concerns and curiosities. Let us explore together and welcome the intelligent era.

In the past two years, OpenAI's ChatGPT has exploded around the world. Just as everyone was eagerly awaiting the release of GPT-5, in the early morning of September 13, OpenAI released OpenAI o1, a new reasoning model dedicated to solving complex problems.

(Image source: OpenAI official website)

How powerful is OpenAI o1 from the competition rankings

Earlier this month, OpenAI CEO Sam Altman posted a photo of strawberries in his garden. Later, people familiar with the matter revealed that OpenAI will release a new AI model, internally codenamed Strawberry.

The predecessor of the Strawberry model is Q*, which implies that it combines two well-known artificial intelligence methods - Q-learning and A* search. It is said that Q*'s overly powerful capabilities have caused researchers to worry that it will pose a potential threat to humans, which is one of the key reasons for the previous internal turmoil in OpenAI.

Photo of strawberries posted by Sam Altman

(Image source: Sam Altman's X(twitter) account)

The OpenAI o1 model released by OpenAI is the Strawberry model. Due to its important progress in complex reasoning problems, OpenAI restarted the counting from 1 and named the new model OpenAI o1. According to the information released by OpenAI, OpenAI o1 can spend more time thinking before answering questions, just like humans. Therefore, the o1 model can solve more difficult problems in science, programming and mathematics through reasoning than before.

Compared with OpenAI's latest model GPT-4o, OpenAI o1 has achieved significant improvements in math competitions, programming competitions, and PhD benchmark scientific problems, demonstrating its strong ability in complex reasoning tasks. It ranks 89% in the programming competition (Codeforces), ranks among the top 500 in the United States in the United States Mathematical Olympiad Qualifiers (AIME), and its accuracy in answering benchmark questions (GPQA) in physics, biology, and chemistry exceeds that of human PhDs.

Comparison between OpenAI o1 and GPT-4o in mathematics, programming, and scientific problems

(Image source: OpenAI official website)

OpenAI o1’s secret weapon: reinforcement learning based on thought chaining

The key to OpenAI o1's reasoning ability far exceeding GPT-4o is that it uses reinforcement learning based on the Chain of Thought. Just as humans may think for a long time before answering a difficult question, OpenAI o1 also uses the Chain of Thought when trying to solve a problem. Through the Chain of Thought, the model will break down the task into simpler steps and solve it step by step, which is usually more accurate than letting the model directly output the answer to the question.

In fact, thought chaining is not a new concept. Long before the release of GPT-3, researchers have discovered that thought chaining can guide large language models to reason.

Example of using thought chaining in a large language model

(Image source: Translated from reference 2)

The example in the figure above has two sets of inputs and outputs of a large language model. In the input, the model is first given a question and answer about counting the number of tennis balls, and then a similar question is asked to the model to count the number of apples.

The left side below is a direct question and answer, and the model gave an incorrect answer. The right side below is a question and answer using a chain of thought. The researchers supplemented the question and answer about the number of tennis balls input to the model, showing the model the reasoning process for the number of tennis balls, and then asked the model to answer the number of apples.

This time, the model correctly calculated the number of apples through reasoning. This method of guiding the model to generate a series of intermediate reasoning steps for a problem is called a thought chain. Through the thought chain, the large language model can display the reasoning steps in detail and intuitively when solving problems. This not only improves the accuracy of the large language model in solving reasoning problems, but also makes the answers of the large language model explainable, and it is no longer a complete black box.

After the release of GPT-3, people further discovered the importance of this prompt. For large language models with strong capabilities, when asking questions, you don’t even need to give examples like the one above about counting the number of tennis balls. You only need to tell the model “Let’s think step by step” to improve the model’s ability to handle complex reasoning problems.

The above attempts are all about adding guidance when asking questions to the model. If thought chains are so useful, is it possible to solidify thought chains within the model during model construction and training? This is what OpenAI o1 tried.

OpenAI o1's reinforcement learning and new Scaling Law

When answering questions, the GPT model is essentially performing a "text chain". It estimates what kind of answer is most appropriate based on the model's input based on a large amount of statistical probability data during training.

In order to let the large language model learn how to use thought chains instead of just continuing based on probability, OpenAI o1 used a machine learning method called reinforcement learning.

Reinforcement learning means that the model learns by the method of "trial and error". During the training process, the model is not told what the standard result is, but the degree of its result. When the model's result is correct, the model will tend to adopt this result in future outputs; when the model's result is wrong, it will tend to avoid this result in future outputs. After multiple rounds of trial and error, the model will learn a set of judgment standards based on its own experience.

Reinforcement learning, a learning method that does not give a standard answer, is suitable for decision-making problems in complex environments, such as robot control, financial transactions, chess games, etc. In these fields, we often cannot give a correct answer in a standard sense, but can only know the result after taking an action. For example, whether the robot falls, whether the financial transaction is profitable, or whether the game is won.

A famous example of reinforcement learning is AlphaGo, a Go AI developed by Deepmind in 2016. In the field of Go, the total number of possible situations even exceeds the total number of atoms in the observable universe. Even the top Go players cannot determine the best move in any situation. Because Go is too complex, it is impossible to obtain the best move through exhaustive enumeration. Before AlphaGo appeared, people once believed that artificial intelligence could not beat humans in Go.

AlphaGo uses reinforcement learning for training. It plays chess with itself and learns from the wins and losses of each game. It does not need humans to tell it which move is right, nor does it need to learn any past human chess records. After just a few days of training, it has reached a level that human chess players cannot match.

In the process of AlphaGo's decision-making, it first makes a rough judgment of the situation and determines where to make a move that is more likely to make it win. This feeling or intuition is usually called chess sense by humans. After roughly judging where to make a move that is more likely to be advantageous, AlphaGo further calculates the subsequent possibilities of these different moves and selects the best one.

Therefore, AlphaGo's strength is mainly influenced by two factors, including the ability to judge the situation and the amount of calculation required to calculate possible moves. Among them, the reinforcement learning process of the model can improve the model's ability to judge the situation.

AlphaGo's self-play

(Image source: Reference 1)

In the training of OpenAI o1, through reinforcement learning, OpenAI o1 learned to hone its thinking chain and improve the strategies it used. It learned to break down difficult problems into simpler steps and recognize and correct errors during the analysis process. This process greatly improved the model's reasoning ability.

After learning to use thought chains, OpenAI o1's input no longer requires human guidance to use thought chains. Instead, OpenAI recommends keeping prompts simple and direct when using OpenAI o1, and avoiding the use of thought chain prompts.

In their tests of OpenAI o1, the researchers found that increasing the time of reinforcement learning training and adding more thinking time during reasoning can improve the performance of the model, which is consistent with the factors affecting AlphaGo's strength mentioned above.

OpenAI o1's Scaling Law

(Image source: OpenAI)

In 2020, researchers at OpenAI discovered the Scaling Law for large language models. The performance of large language models will increase with the increase of model size, training set size, and the amount of computation during training.

OpenAI o1 demonstrates a new Scaling Law. In terms of improving model performance, it can also make the model achieve stronger performance by increasing the inference time, which provides new possibilities for the further development of large language models in the future.

The OpenAI o1 series currently includes three models, o1, o1-preview, and o1-mini. Their models are of different sizes, o1 is the largest and has the strongest reasoning ability, and o1-mini is the smallest but consumes less cost when used. Their performance in math competitions is shown in the figure below. o1-mini performs even better than o1-preview in math competitions, but it performs poorly in other tasks that require non-STEM (science, technology, engineering, mathematics) knowledge. At the same time, as the reasoning time increases, the performance of the three models will improve.

Performance of different versions of OpenAI o1 models in math competitions

(Image source: OpenAI)

Will OpenAI o1 bring more safety issues?

The breakthrough of OpenAI o1 model has undoubtedly further improved the capabilities of large language models. OpenAI has proposed five stages to build general artificial intelligence (AGI). The first stage is to achieve artificial intelligence that can communicate with people, and the second stage is to achieve artificial intelligence with reasoning ability. ChatGPT has achieved the goal of the first stage, and the emergence of OpenAI o1 has brought us one step closer to the second stage.

While OpenAI o1 demonstrates powerful reasoning capabilities, just as researchers worry about Q*, people can't help but ask whether OpenAI o1 will bring more security issues.

OpenAI's report pointed out that thought chains provide new opportunities to improve the safety of models. During the training process, human values can be integrated into the model's thought chains, allowing the model to refuse to perform harmful behaviors. At the same time, thought chains allow us to observe the model's thinking in a clear way, thereby enhancing the model's safety.

The future may be beyond imagination

Currently, the preview version and mini version of OpenAI o1 are available to users, and practical functions such as browsing, file and image uploading will be added later. How it works in real scenarios needs further experience and testing. In short, the important progress of OpenAI o1 in reasoning ability may mean that we are one step closer to general artificial intelligence. Where will artificial intelligence go in the future and whether it can make greater contributions to the benefit of human society, let us continue to look forward to it.

References:

1.Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270

2.https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

<<: Can eating chili peppers extend your life? Spicy food lovers, please take note →

>>: By putting on these gloves and touching them, you can tell with your naked eyes whether the nitrite content exceeds the standard?

Analysis of 2 information flow advertising cases and industry data reference!

In November 2023, the sales of China's new power car companies were obviously differentiated, and only Ideal made a profit of 2.8 billion in Q3

Blog

Zero-cost, zero-threshold money-making project: Douyin Kuaishou wallpaper project Buddhist gameplay, cashing in 500+ a day

Blog

Recommend

Popular Science Illustrations丨"Fuxing" drove onto the snowy plateau, and the Xige section of the Qinghai-Tibet Railway officially entered the era of high-speed trains

...

High-efficiency traffic-based anchor training camp: quickly improve anchor capabilities and achieve 1 million sales live broadcast rooms

High-efficiency traffic acceptance anchor trainin...

Like a ball but not a ball? The "very basic problem" that has troubled the mathematical community for 30 years has finally been solved

Five mathematicians have not only solved a diffic...

How old is the Earth?

Most Europeans before modern times believed that ...

Ten thousand words screen dominance promotion: How to make the website quickly appear on the homepage?

Today I will talk about the issues that everyone ...

Another huge step forward! OpenAI o1 is here. How does it solve complex problems?

Analysis of 2 information flow advertising cases and industry data reference!

Are you ready for the office security "battle"?

Apps that shorten phone performance: The top 10 most performance-destroying Android apps in Q2 2015

Boston Consulting Group: AI Maturity Matrix

Do you hold your phone with your little finger? Doctors warn...

In November 2023, the sales of China's new power car companies were obviously differentiated, and only Ideal made a profit of 2.8 billion in Q3

Zero-cost, zero-threshold money-making project: Douyin Kuaishou wallpaper project Buddhist gameplay, cashing in 500+ a day

5 tips for running promotional events!

Want to improve retention? Activate your user first!

Learn 3 tricks to easily promote your product

Recommend

Popular Science Illustrations丨"Fuxing" drove onto the snowy plateau, and the Xige section of the Qinghai-Tibet Railway officially entered the era of high-speed trains

How do keywords leverage big accounts? Point 5 is the most important!

Google is fixing an issue with Android's WebView component causing app crashes

How to create an online promotion plan for a product

Advertising: Can your ads be seen by users?

Information flow advertising: budget 400, keywords 2W+, how to promote?

Let’s talk about the technical challenges of full-site HTTPS

Ice does not conduct electricity? Did you notice the physics knowledge in Nezha 2?

High-efficiency traffic-based anchor training camp: quickly improve anchor capabilities and achieve 1 million sales live broadcast rooms

Breakthrough Academy Online Course Monetization Training Camp: Turn knowledge/experience/skills into money in 10 days

Most popular programming languages on GitHub since 2012

Jia Zhen's "Xiaohongshu Merchant Camp" Issue 6 Merchant Edition, let merchants lose paid traffic

Like a ball but not a ball? The "very basic problem" that has troubled the mathematical community for 30 years has finally been solved

How old is the Earth?

Ten thousand words screen dominance promotion: How to make the website quickly appear on the homepage?