We used the recently popular DeepSeek to challenge the competition questions issued by the Institute of Physics, and the result was...

We used the recently popular DeepSeek to challenge the competition questions issued by the Institute of Physics, and the result was...

Recently, DeepSeek-R1, an open source large model with deep thinking and reasoning capabilities released by my country's DeepSeek company, has attracted worldwide attention.

Before DeepSeek-R1, GPT-o1 from OpenAI, Claude from Athropic, and Gemini from Google all claimed to have deep thinking and reasoning capabilities. These models have indeed performed amazingly well in a variety of tests by professionals and netizens.

What particularly caught our interest was that Google's dedicated model AlphaGeometry achieved a score of 28/42 in the International Mathematical Olympiad, which is recognized as a difficult competition, and won the silver medal. We were also exposed to the Mathematical Olympiad when we were students, and we know that all the players who can win silver medals in such international Olympiads have shown considerable mathematical talent since childhood and have worked hard to train all the way. It is not an exaggeration to say that AI that can reach this level has a strong thinking ability. Since then, we have been curious about the physical level of these powerful AIs.

On January 17, the Institute of Physics of the Chinese Academy of Sciences held the "Tianmu Cup" theoretical physics competition in Liyang City, Jiangsu Province. Two days later, the release of DeepSeek-R1 set off the AI ​​circle, and it naturally became the first choice model for our test. In addition, the models we tested also include: GPT-o1 released by OpenAI and Claude-sonnet released by Anthropic.

Here is how we tested:

1. The entire test consists of 8 dialogues.

2. The first question in the conversation is the “opening statement”: explaining the task to be completed, the format of the question, the format of submitting the answer, etc. The AI’s response is manually confirmed to understand.

3. All 7 questions are sent in sequence, and the next question is sent after receiving the reply. There is no manual feedback in between.

4. Each question consists of two parts: a text description and a picture description (there are no pictures for questions 3, 5 and 7).

5. The image description is in plain text format. All the description texts are generated by GPT-4o and manually proofread.

6. The text materials obtained for each large model are exactly the same (see attachment).

After the above process, we obtained 7 paragraphs of tex text for each large model, corresponding to the answers to 7 questions. The following is the way we marked the papers:

1. Manually adjust the tex text so that it can be compiled with the Overleaf tool, and collect the compiled PDF files as the answer sheet.

2. Send the answers to the 7 questions of the 4 models to the grading group consisting of 7 examiners.

3. The marking team is exactly the same as that of the “Tianmu Cup” competition, and each marker is responsible for the same questions. For example: Marker A is responsible for the first question of all human and AI answers; Marker B is responsible for the second question of all human and AI answers, and so on.

4. The marking team summarizes the scores of all questions.

What are the results? See the table below.

Results Comments:

1. DeepSeek-R1 performed the best . It scored full marks on the basic questions (full marks on the first three questions), and a full mark on the sixth question, which was not seen among human contestants. The low score on the seventh question seemed to be due to the failure to understand the meaning of "proof" in the question stem, and it only restated the conclusion to be proved, which could not be scored. Looking at its thinking process, there are steps that can be scored in the process, but these steps are not reflected in the final answer.

2. The total score of GPT-o1 is almost the same as that of DeepSeek . There are some miscalculations in the basic questions (questions 2 and 3) that lead to lost points. Compared with DeepSeek, o1's answers are closer to human style, so the last question, which is mainly based on proof questions, has a slightly higher score.

3. Claude-sonnet can be said to have "stumbled at the beginning" . He made stupid moves in the first two questions and got 0 points, but his subsequent performance was very close to o1, and the points deducted were similar.

4. If the AI ​​scores are compared with those of humans, DeepSeek-R1 can enter the top three (won the special award), but there is still a big gap with the highest human score of 125 points; GPT-o1 enters the top five (won the special award), and Claude-sonnet enters the top ten (won the excellent award).

Finally, I would like to talk about my subjective feelings about marking the papers. First of all, AI’s thinking is really good. Basically, there are no questions that cannot be solved. In fact, they can find the correct thinking at once in many cases. But unlike humans, after they have the correct thinking, they will make some very simple mistakes . For example, by looking at R1’s thinking process for the seventh question, it was found that it knew to use normal coordinates early on. Almost 100% of the candidates who could think of this step solved the correct normal coordinates (just a simple matrix diagonalization), but R1 seemed to be repeatedly guessing and trial and error, and in the end, it did not get the expression of normal coordinates.

Another thing is that all AIs don't seem to understand what a "rigorous" proof actually means. They seem to think that as long as they can come up with an answer in form, it is a proof . AI, like humans, will make many "accidental" mistakes. For example, before the official unified test, we tried it many times in private, and many times Claude-sonnet could correctly answer the first question, but it got it wrong in the official test. For the sake of rigor, we should probably test the same question multiple times and take the average, but it's really a bit troublesome...

Planning and production

Source: Institute of Physics, Chinese Academy of Sciences (id: cas-iop)

Editor: Yang Yaping

Proofread by Xu Lai and Lin Lin

The cover image of this article comes from the copyright library. Reprinting and using it may cause copyright disputes

<<:  This tree has a "ghost face", but it is loved by people...

>>:  What designs in restaurants that you think are normal are actually draining your wallet?

Recommend

12 ultimate technical conjectures about the world of programming

It is still difficult to predict the future of th...

Is marketing just about “selling goods”?

Marketing is about satisfying needs profitably. T...

Spring Festival promotion cannot be separated from these 7 major themes

The Spring Festival is one of the most important ...

Can we keep the “teeth” like the “elephant”?

Whether in the East or the West, the use of ivory...

Justill Poster Color Lab Bootcamp 2.0

Justill Poster Color Lab Training Camp 2.0 Resour...

Great news for mobile phone users! Android 6.0 source code released

[[151054]] OTA push has started, the original ima...

Why do people shiver when it’s cold?

When I was a child, I heard a fable about the Cuc...

Unlocking agricultural black technology, farming is like playing a game

In my country's long agricultural civilizatio...

Kuaishou advertising, Kuaishou advertising account opening

What is Kuaishou Advertising ? Kuaishou is a well...

Beginner's guide to UI design: iOS

This article is the first chapter of UI design, m...

WeChat mini program dividends, can mini programs adopt a dividend model?

Q: Can the mini program adopt a dividend model? A...