Dongpo pork, stir-fried vegetables with mushrooms, steamed crucian carp, shrimp and tofu... After cooking a table full of dishes, take a photo and throw it to AI, and ask it: Which food in the picture has the highest protein content? Which dish should people with high uric acid not eat too much? The AI thought deeply for a few seconds, typed out the entire reasoning process, and finally circled the answer on the picture. This is a multimodal large model that has learned to reason, and it will be used in daily life in the future. Previously, this kind of AI that "has eyes" and is good at reasoning was still in the imagination stage. However, recently, a group of post-95s from Hangzhou Om AI Lab have successfully migrated the training method of DeepSeek-R1 from the pure text field to the visual language field, opening up more imagination space for multimodal large models. They also open-sourced the project named VLM-R1 and published it on GitHub, the world's largest code hosting platform. After only one week online, it received 2.7k stars from developers from various countries and was listed on the hot trend list on February 21. This achievement is outstanding in this open source community. Star data curve of VLM-R1 after it was launched on GitHub for one week On February 21, it was listed on GitHub's hot trend list The leader of this R&D team is a post-90s generation, Dr. Zhao Tiancheng, the founder of Om AI Lab. He is also the director and doctoral supervisor of the Om Artificial Intelligence Center of Zhejiang University Binjiang Research Institute. The method that will teach DeepSeek-R1 reasoning Bringing it to machine vision The uniqueness of the DeepSeek-R1 model is that DeepSeek has adjusted the general model reasoning steps. Previously, when improving the reasoning ability of the model, it usually relied on the link of "supervised fine-tuning" (SFT, supervised fine-tuning). Simply put, it is to take a large model that has learned a lot of things and use some specific, labeled data to teach it how to better complete a task. It's like you already know how to cook, but when it comes to Sichuan cuisine or Anhui cuisine, you still need to master the cooking skills through special practice. DeepSeek-R1 skipped this step during training and entered the "reinforcement learning" stage, exploring how a large model can evolve itself through pure reinforcement learning without supervised data. This innovative reinforcement learning method has a professional term called Group Relative Policy Optimization (GRPO). GRPO has helped DeepSeek-R1 learn reasoning, so can it also help AI models perform better in general computer vision tasks? After repeated experiments, the answer from the Om AI Lab R&D team is: Yes. They trained the Tongyi open source visual understanding model Qwen2.5-VL in a visual positioning task. On this basis, they compared the R1 method with the SFT method. The conclusion is that the R1 method can maintain stable high performance in various complex scenarios. This is crucial in practical applications. As shown in the street view photo below, the task given to AI is to locate objects in the photo that may pose a danger to the visually impaired. In the scene of the roadside sidewalk, humans can think of obstacles to the visually impaired, usually stone pillars, bus stops, pedestrians, etc., which are "data" that can be marked in advance. But in this picture, there is a special situation - stairs. According to the experiments conducted by Zhao Tiancheng’s team, the AI model trained by the R1 method can successfully infer that the steps in this scenario will pose a danger to the visually impaired. "For humans, this is common sense reasoning and is very easy. But for previous traditional computer vision models, this is actually very difficult," Zhao Tiancheng explained. As shown in the picture below, there are yam, omelet, edamame, green vegetables, coffee and oranges on the table. Let AI locate the food with the most vitamin C in the picture. The AI model trained using the R1 method quickly locked onto Orange and attached its thinking process. "Before, it gave the answer directly without telling you the solution, and the error rate was high. For example, it could only answer four or five questions correctly out of 10, while the one trained using the R1 method could answer seven or eight questions correctly." In addition, there is a very common situation in the field of machine learning: when a model is trained with task A, as the number of training steps (the number of iterations performed by the training model) increases, its performance on task B, which is not so similar to A, will deteriorate (the red curve in the figure). "It's a bit like 'pressing the gourd to get a dipper'. So when doing multiple tasks in the past, we had to carefully control the ratio between tasks." However, the AI model trained using the R1 method (the green curve in the figure) does not show this trend, which means that the R1 method can help the model truly "learn" to understand visual content, rather than simply memorize it. The green curve is trained using the R1 method, and the red curve is trained using the traditional SFT method. Training a visual language model A new idea "The experiment started during the Spring Festival holiday. Fortunately, we have accumulated a lot of experience in the early stage, and a lot of 'infrastructure' is ready-made. Once we have an idea, we can quickly conduct experiments and verify the results." The 10-member team includes R&D personnel from the institute and doctoral students led by Zhao Tiancheng. On February 15, Zhao Tiancheng released the experimental results of VLM-R1 on overseas social platforms, open sourced it and uploaded it to GitHub. As of February 22, it has received 2.7k Stars from developers around the world. Questions of all sizes came flooding in: How long does it take to train, what is the minimum video memory, can you share more of your model thinking process... "Although the underlying logic is the same, vision, mathematics and code are completely different modalities. How to design in the visual field and make it work? The team actually went through many trials and errors before finding the current relatively effective combination." Zhao Tiancheng admitted that the current version can only be regarded as version 0.1 and is far from mature. "There are some problems that need to be answered with more experiments." In his opinion, one of the greatest significances of the experiments during this period is that they provide some new ideas for the training and industry of multimodal models. It proves the versatility of the R1 method, "not only performs well in the text field, but may also lead a new trend in visual language model training." “Be a leader who dares to try It is more important than following others in the trend." Lianhui Technology, the parent company behind Om AI Lab, is located in Hangzhou Binjiang Internet Industrial Park, which was the cradle of the rise of Alibaba and NetEase. Internet and Internet of Things technologies have entered our daily lives from here. At present, artificial intelligence has become the protagonist, and this company is committed to the application and implementation of artificial intelligence intelligent body platform. On February 21, Om AI Lab, led by Zhao Tiancheng, brought the debut of VLM-R1, a visual understanding multimodal model based on R1 reinforcement learning, and Open Agent Leaderboard, an open source large language model intelligent agent evaluation platform, to the 2025 Global Developer Conference (GDC) held in Shanghai. Zhao Tiancheng (Photo by Chen Zhongqiu) In August last year, Zhao Tiancheng said in an interview that he always remembered what his mentor said when he was studying at Carnegie Mellon University (CMU) in the United States: To be a leader, not a follower. Being a leader who dares to try is far more important than following others at the forefront of the trend. (Source: Chao News) |
<<: Heart regeneration is no longer a dream? Please "patch" the failing heart
Grandpa Zhang, who has been raising cattle and sh...
1. The Internet is rapidly penetrating into the l...
Baidu’s paid promotion is charged based on clicks...
Recently I have seen many people talking about DS...
Follow Captain Da Shanzha Wan Check out more hard...
[[142148]] Introduction I am not good at writing ...
As new energy vehicles become increasingly popula...
Recently, a research team from the Yangtze River ...
NetEase Yanxuan is different from platform-based ...
As a person who grew up in the countryside, from ...
1. Group buying mode selection 1. Split the money...
2016 is the explosive year for information flow a...
Retention rate is the most important indicator to...
On August 9, Beijing time, Weibo’s stock price re...