We used the recently popular DeepSeek to challenge the competition questions issued by the Institute of Physics, and the result was...

We used the recently popular DeepSeek to challenge the competition questions issued by the Institute of Physics, and the result was...

Recently, DeepSeek-R1, an open source large model with deep thinking and reasoning capabilities released by my country's DeepSeek company, has attracted worldwide attention.

Before DeepSeek-R1, GPT-o1 from OpenAI, Claude from Athropic, and Gemini from Google all claimed to have deep thinking and reasoning capabilities. These models have indeed performed amazingly well in a variety of tests by professionals and netizens.

What particularly caught our interest was that Google's dedicated model AlphaGeometry achieved a score of 28/42 in the International Mathematical Olympiad, which is recognized as a difficult competition, and won the silver medal. We were also exposed to the Mathematical Olympiad when we were students, and we know that all the players who can win silver medals in such international Olympiads have shown considerable mathematical talent since childhood and have worked hard to train all the way. It is not an exaggeration to say that AI that can reach this level has a strong thinking ability. Since then, we have been curious about the physical level of these powerful AIs.

On January 17, the Institute of Physics of the Chinese Academy of Sciences held the "Tianmu Cup" theoretical physics competition in Liyang City, Jiangsu Province. Two days later, the release of DeepSeek-R1 set off the AI ​​circle, and it naturally became the first choice model for our test. In addition, the models we tested also include: GPT-o1 released by OpenAI and Claude-sonnet released by Anthropic.

Here is how we tested:

1. The entire test consists of 8 dialogues.

2. The first question in the conversation is the “opening statement”: explaining the task to be completed, the format of the question, the format of submitting the answer, etc. The AI’s response is manually confirmed to understand.

3. All 7 questions are sent in sequence, and the next question is sent after receiving the reply. There is no manual feedback in between.

4. Each question consists of two parts: a text description and a picture description (there are no pictures for questions 3, 5 and 7).

5. The image description is in plain text format. All the description texts are generated by GPT-4o and manually proofread.

6. The text materials obtained for each large model are exactly the same (see attachment).

After the above process, we obtained 7 paragraphs of tex text for each large model, corresponding to the answers to 7 questions. The following is the way we marked the papers:

1. Manually adjust the tex text so that it can be compiled with the Overleaf tool, and collect the compiled PDF files as the answer sheet.

2. Send the answers to the 7 questions of the 4 models to the grading group consisting of 7 examiners.

3. The marking team is exactly the same as that of the “Tianmu Cup” competition, and each marker is responsible for the same questions. For example: Marker A is responsible for the first question of all human and AI answers; Marker B is responsible for the second question of all human and AI answers, and so on.

4. The marking team summarizes the scores of all questions.

What are the results? See the table below.

Results Comments:

1. DeepSeek-R1 performed the best . It scored full marks on the basic questions (full marks on the first three questions), and a full mark on the sixth question, which was not seen among human contestants. The low score on the seventh question seemed to be due to the failure to understand the meaning of "proof" in the question stem, and it only restated the conclusion to be proved, which could not be scored. Looking at its thinking process, there are steps that can be scored in the process, but these steps are not reflected in the final answer.

2. The total score of GPT-o1 is almost the same as that of DeepSeek . There are some miscalculations in the basic questions (questions 2 and 3) that lead to lost points. Compared with DeepSeek, o1's answers are closer to human style, so the last question, which is mainly based on proof questions, has a slightly higher score.

3. Claude-sonnet can be said to have "stumbled at the beginning" . He made stupid moves in the first two questions and got 0 points, but his subsequent performance was very close to o1, and the points deducted were similar.

4. If the AI ​​scores are compared with those of humans, DeepSeek-R1 can enter the top three (won the special award), but there is still a big gap with the highest human score of 125 points; GPT-o1 enters the top five (won the special award), and Claude-sonnet enters the top ten (won the excellent award).

Finally, I would like to talk about my subjective feelings about marking the papers. First of all, AI’s thinking is really good. Basically, there are no questions that cannot be solved. In fact, they can find the correct thinking at once in many cases. But unlike humans, after they have the correct thinking, they will make some very simple mistakes . For example, by looking at R1’s thinking process for the seventh question, it was found that it knew to use normal coordinates early on. Almost 100% of the candidates who could think of this step solved the correct normal coordinates (just a simple matrix diagonalization), but R1 seemed to be repeatedly guessing and trial and error, and in the end, it did not get the expression of normal coordinates.

Another thing is that all AIs don't seem to understand what a "rigorous" proof actually means. They seem to think that as long as they can come up with an answer in form, it is a proof . AI, like humans, will make many "accidental" mistakes. For example, before the official unified test, we tried it many times in private, and many times Claude-sonnet could correctly answer the first question, but it got it wrong in the official test. For the sake of rigor, we should probably test the same question multiple times and take the average, but it's really a bit troublesome...

Planning and production

Source: Institute of Physics, Chinese Academy of Sciences (id: cas-iop)

Editor: Yang Yaping

Proofread by Xu Lai and Lin Lin

The cover image of this article comes from the copyright library. Reprinting and using it may cause copyright disputes

<<:  This tree has a "ghost face", but it is loved by people...

>>:  What designs in restaurants that you think are normal are actually draining your wallet?

Recommend

999 yuan eight-core luxury version Shenzhou Lingya X50 full review

Shenzhou, which has always been known as a "...

Moji "Air Fruit" is in an awkward situation

Since it is an Internet company that wants to make...

Will wearing a mask often cause lung nodules? Or even cancer? The truth is...

In the past two days, a short video has been wide...

Common traffic interception tactics for Internet marketing and promotion!

01Good content is the essence of traffic generati...

How do brands choose UP hosts on Bilibili for promotion?

Bilibili has special platform attributes. From a ...

What do penguins drink when they live in an environment without fresh water?

As one of the oldest swimming birds, penguins hav...

User Recall and Activation: 7 Case Studies on User Loss and Activation

Activation and recall are also top priorities in ...

Need IV drips every time you get sick? These diseases don’t require it!

For a long time, people have believed that intrav...

User Operations: Practical Training Camp to Increase Conversion Rate by 50%!

Nowadays, training camps have become one of the s...

Reinventing "alchemy"! Can you get rich just by relying on microorganisms?

Although microorganisms are invisible to the nake...

The era of "traffic-centric" operations is over

1. In fact, 2016 was a very interesting year for ...