Recently, DeepSeek-R1, an open source large model with deep thinking and reasoning capabilities released by my country's DeepSeek company, has attracted worldwide attention. Before DeepSeek-R1, GPT-o1 from OpenAI, Claude from Athropic, and Gemini from Google all claimed to have deep thinking and reasoning capabilities. These models have indeed performed amazingly well in a variety of tests by professionals and netizens. What particularly caught our interest was that Google's dedicated model AlphaGeometry achieved a score of 28/42 in the International Mathematical Olympiad, which is recognized as a difficult competition, and won the silver medal. We were also exposed to the Mathematical Olympiad when we were students, and we know that all the players who can win silver medals in such international Olympiads have shown considerable mathematical talent since childhood and have worked hard to train all the way. It is not an exaggeration to say that AI that can reach this level has a strong thinking ability. Since then, we have been curious about the physical level of these powerful AIs. On January 17, the Institute of Physics of the Chinese Academy of Sciences held the "Tianmu Cup" theoretical physics competition in Liyang City, Jiangsu Province. Two days later, the release of DeepSeek-R1 set off the AI circle, and it naturally became the first choice model for our test. In addition, the models we tested also include: GPT-o1 released by OpenAI and Claude-sonnet released by Anthropic. Here is how we tested: 1. The entire test consists of 8 dialogues. 2. The first question in the conversation is the “opening statement”: explaining the task to be completed, the format of the question, the format of submitting the answer, etc. The AI’s response is manually confirmed to understand. 3. All 7 questions are sent in sequence, and the next question is sent after receiving the reply. There is no manual feedback in between. 4. Each question consists of two parts: a text description and a picture description (there are no pictures for questions 3, 5 and 7). 5. The image description is in plain text format. All the description texts are generated by GPT-4o and manually proofread. 6. The text materials obtained for each large model are exactly the same (see attachment). After the above process, we obtained 7 paragraphs of tex text for each large model, corresponding to the answers to 7 questions. The following is the way we marked the papers: 1. Manually adjust the tex text so that it can be compiled with the Overleaf tool, and collect the compiled PDF files as the answer sheet. 2. Send the answers to the 7 questions of the 4 models to the grading group consisting of 7 examiners. 3. The marking team is exactly the same as that of the “Tianmu Cup” competition, and each marker is responsible for the same questions. For example: Marker A is responsible for the first question of all human and AI answers; Marker B is responsible for the second question of all human and AI answers, and so on. 4. The marking team summarizes the scores of all questions. What are the results? See the table below. Results Comments: 1. DeepSeek-R1 performed the best . It scored full marks on the basic questions (full marks on the first three questions), and a full mark on the sixth question, which was not seen among human contestants. The low score on the seventh question seemed to be due to the failure to understand the meaning of "proof" in the question stem, and it only restated the conclusion to be proved, which could not be scored. Looking at its thinking process, there are steps that can be scored in the process, but these steps are not reflected in the final answer. 2. The total score of GPT-o1 is almost the same as that of DeepSeek . There are some miscalculations in the basic questions (questions 2 and 3) that lead to lost points. Compared with DeepSeek, o1's answers are closer to human style, so the last question, which is mainly based on proof questions, has a slightly higher score. 3. Claude-sonnet can be said to have "stumbled at the beginning" . He made stupid moves in the first two questions and got 0 points, but his subsequent performance was very close to o1, and the points deducted were similar. 4. If the AI scores are compared with those of humans, DeepSeek-R1 can enter the top three (won the special award), but there is still a big gap with the highest human score of 125 points; GPT-o1 enters the top five (won the special award), and Claude-sonnet enters the top ten (won the excellent award). Finally, I would like to talk about my subjective feelings about marking the papers. First of all, AI’s thinking is really good. Basically, there are no questions that cannot be solved. In fact, they can find the correct thinking at once in many cases. But unlike humans, after they have the correct thinking, they will make some very simple mistakes . For example, by looking at R1’s thinking process for the seventh question, it was found that it knew to use normal coordinates early on. Almost 100% of the candidates who could think of this step solved the correct normal coordinates (just a simple matrix diagonalization), but R1 seemed to be repeatedly guessing and trial and error, and in the end, it did not get the expression of normal coordinates. Another thing is that all AIs don't seem to understand what a "rigorous" proof actually means. They seem to think that as long as they can come up with an answer in form, it is a proof . AI, like humans, will make many "accidental" mistakes. For example, before the official unified test, we tried it many times in private, and many times Claude-sonnet could correctly answer the first question, but it got it wrong in the official test. For the sake of rigor, we should probably test the same question multiple times and take the average, but it's really a bit troublesome... Planning and production Source: Institute of Physics, Chinese Academy of Sciences (id: cas-iop) Editor: Yang Yaping Proofread by Xu Lai and Lin Lin The cover image of this article comes from the copyright library. Reprinting and using it may cause copyright disputes |
<<: This tree has a "ghost face", but it is loved by people...
>>: What designs in restaurants that you think are normal are actually draining your wallet?
Shenzhou, which has always been known as a "...
Since it is an Internet company that wants to make...
In the past two days, a short video has been wide...
Science and Technology Daily reporter Lu Chengkua...
01Good content is the essence of traffic generati...
Bilibili has special platform attributes. From a ...
As one of the oldest swimming birds, penguins hav...
How much is the cost of recruiting investors in A...
Activation and recall are also top priorities in ...
For a long time, people have believed that intrav...
How long can the medicines at home be preserved o...
Nowadays, training camps have become one of the s...
Although microorganisms are invisible to the nake...
1. In fact, 2016 was a very interesting year for ...
In a mature medium or large organization, profess...