We used the recently popular DeepSeek to challenge the competition questions issued by the Institute of Physics, and the result was...

Recently, DeepSeek-R1, an open source large model with deep thinking and reasoning capabilities released by my country's DeepSeek company, has attracted worldwide attention.

Before DeepSeek-R1, GPT-o1 from OpenAI, Claude from Athropic, and Gemini from Google all claimed to have deep thinking and reasoning capabilities. These models have indeed performed amazingly well in a variety of tests by professionals and netizens.

What particularly caught our interest was that Google's dedicated model AlphaGeometry achieved a score of 28/42 in the International Mathematical Olympiad, which is recognized as a difficult competition, and won the silver medal. We were also exposed to the Mathematical Olympiad when we were students, and we know that all the players who can win silver medals in such international Olympiads have shown considerable mathematical talent since childhood and have worked hard to train all the way. It is not an exaggeration to say that AI that can reach this level has a strong thinking ability. Since then, we have been curious about the physical level of these powerful AIs.

On January 17, the Institute of Physics of the Chinese Academy of Sciences held the "Tianmu Cup" theoretical physics competition in Liyang City, Jiangsu Province. Two days later, the release of DeepSeek-R1 set off the AI circle, and it naturally became the first choice model for our test. In addition, the models we tested also include: GPT-o1 released by OpenAI and Claude-sonnet released by Anthropic.

Here is how we tested:

1. The entire test consists of 8 dialogues.

2. The first question in the conversation is the “opening statement”: explaining the task to be completed, the format of the question, the format of submitting the answer, etc. The AI’s response is manually confirmed to understand.

3. All 7 questions are sent in sequence, and the next question is sent after receiving the reply. There is no manual feedback in between.

4. Each question consists of two parts: a text description and a picture description (there are no pictures for questions 3, 5 and 7).

5. The image description is in plain text format. All the description texts are generated by GPT-4o and manually proofread.

6. The text materials obtained for each large model are exactly the same (see attachment).

After the above process, we obtained 7 paragraphs of tex text for each large model, corresponding to the answers to 7 questions. The following is the way we marked the papers:

1. Manually adjust the tex text so that it can be compiled with the Overleaf tool, and collect the compiled PDF files as the answer sheet.

2. Send the answers to the 7 questions of the 4 models to the grading group consisting of 7 examiners.

3. The marking team is exactly the same as that of the “Tianmu Cup” competition, and each marker is responsible for the same questions. For example: Marker A is responsible for the first question of all human and AI answers; Marker B is responsible for the second question of all human and AI answers, and so on.

4. The marking team summarizes the scores of all questions.

What are the results? See the table below.

Results Comments:

1. DeepSeek-R1 performed the best . It scored full marks on the basic questions (full marks on the first three questions), and a full mark on the sixth question, which was not seen among human contestants. The low score on the seventh question seemed to be due to the failure to understand the meaning of "proof" in the question stem, and it only restated the conclusion to be proved, which could not be scored. Looking at its thinking process, there are steps that can be scored in the process, but these steps are not reflected in the final answer.

2. The total score of GPT-o1 is almost the same as that of DeepSeek . There are some miscalculations in the basic questions (questions 2 and 3) that lead to lost points. Compared with DeepSeek, o1's answers are closer to human style, so the last question, which is mainly based on proof questions, has a slightly higher score.

3. Claude-sonnet can be said to have "stumbled at the beginning" . He made stupid moves in the first two questions and got 0 points, but his subsequent performance was very close to o1, and the points deducted were similar.

4. If the AI scores are compared with those of humans, DeepSeek-R1 can enter the top three (won the special award), but there is still a big gap with the highest human score of 125 points; GPT-o1 enters the top five (won the special award), and Claude-sonnet enters the top ten (won the excellent award).

Finally, I would like to talk about my subjective feelings about marking the papers. First of all, AI’s thinking is really good. Basically, there are no questions that cannot be solved. In fact, they can find the correct thinking at once in many cases. But unlike humans, after they have the correct thinking, they will make some very simple mistakes . For example, by looking at R1’s thinking process for the seventh question, it was found that it knew to use normal coordinates early on. Almost 100% of the candidates who could think of this step solved the correct normal coordinates (just a simple matrix diagonalization), but R1 seemed to be repeatedly guessing and trial and error, and in the end, it did not get the expression of normal coordinates.

Another thing is that all AIs don't seem to understand what a "rigorous" proof actually means. They seem to think that as long as they can come up with an answer in form, it is a proof . AI, like humans, will make many "accidental" mistakes. For example, before the official unified test, we tried it many times in private, and many times Claude-sonnet could correctly answer the first question, but it got it wrong in the official test. For the sake of rigor, we should probably test the same question multiple times and take the average, but it's really a bit troublesome...

Planning and production

Source: Institute of Physics, Chinese Academy of Sciences (id: cas-iop)

Editor: Yang Yaping

Proofread by Xu Lai and Lin Lin

The cover image of this article comes from the copyright library. Reprinting and using it may cause copyright disputes

<<: This tree has a "ghost face", but it is loved by people...

>>: What designs in restaurants that you think are normal are actually draining your wallet?

Niu Electric's financial report: Niu Electric's revenue in Q3 2022 was 1.15 billion yuan, down 6% year-on-year

How much does it cost to produce the Zigong Automobile Mini Program? How much does it cost to produce the Zigong Automobile Mini Program?

Blog

A Short History of Chinese Rock

Blog

Analysis of the planning process of "Learn Together" online course activities

Blog

We used the recently popular DeepSeek to challenge the competition questions issued by the Institute of Physics, and the result was...

Niu Electric's financial report: Niu Electric's revenue in Q3 2022 was 1.15 billion yuan, down 6% year-on-year

Douyin Training Camp Project Practice (Operation)

On the eve of 315, here’s a super detailed PR guide for you!

If the sea level rise in "The Wandering Earth" becomes a reality, can humans reverse it?

iOS channel first release rules and contact list

How to effectively use and manage negative reviews of APP

Want to "communicate" with your pet? Just do a few actions!

How much does it cost to produce the Zigong Automobile Mini Program? How much does it cost to produce the Zigong Automobile Mini Program?

A Short History of Chinese Rock

Analysis of the planning process of "Learn Together" online course activities

Recommend

Zhihu and Xiaohongshu traffic and algorithm logic

Native advertising has deep routines. How can case-based copywriting make information flow advertising more outstanding?

When can I poop out last year's poop?

Look at the blood sugar in the physical examination report! It may reflect a big problem, you'd better know it early!

ARM's A77 chip may help Android phones surpass iPhones in 2020

This can’t be eaten, that can’t be eaten, what exactly are the “irritating foods” that Chinese medicine talks about?

Boston Consulting Group: 2025 AI Radar Report

Can you believe it? It was once freezing cold near the equator

Why do men in TV dramas bleed from their noses when they see beautiful women?

Wear thin clothes under your down jacket to keep warm! A super simple dressing trick, save it now

This hypoglycemic drug has an amazing oral technology! Can people who want to lose weight take it?

This parasite can kill people, but it's a ray of hope for cancer patients

Didi Friends Chain

Smart home is still a little cold, mobile devices will become industry catalyst

Korean battery is in trouble. Can domestic battery companies take advantage of the situation to rise?