Currently, large language models (LLMs) may be the "optimal solution" for achieving general artificial intelligence (AGI). However, although large models seem to be close to human-level performance in terms of fluency and breadth of knowledge, the challenges of evaluating them are becoming increasingly prominent. With the rapid development of large models, some traditional benchmarks have become invalid. Therefore, new evaluation benchmarks are urgently needed. Recently, research teams from Meta, HuggingFace and AutoGPT jointly proposed a benchmark for testing general AI assistants - GAIA , which raises real-world problems that require a series of basic capabilities, such as reasoning, multimodal processing, web browsing, and proficient skills in general tool use. The research team said that these questions are conceptually very simple for humans, but are challenging for most large models: an intuitive data, humans have a success rate of 92% in answering these questions, while even GPT-4 with a plug-in has a success rate of only 15%. This is in stark contrast to the trend in recent years that large models have outperformed humans in tasks that require professional skills, such as law or chemistry. The related research paper, titled “GAIA: A Benchmark for General AI Assistants”, has been published on the preprint website arXiv. It is worth noting that GAIA's concept deviates from the current trend of AI benchmarks, which aim for tasks that are increasingly difficult for humans. The research team believes that the emergence of AGI depends on whether the system can show similar robustness to ordinary people on such problems. General AI Assistant Benchmark: Interacting with the Real World As the capabilities of large models increase, existing evaluation benchmarks become increasingly unable to meet the challenges of new models, and traditional benchmarks will soon be surpassed by these new models. In the process of trying to turn large models into general assistants, current evaluation methods are relatively backward. Existing evaluations mainly rely on closed systems, specific API calls, or reuse existing evaluation datasets. However, these methods are usually conducted in a closed environment and may evaluate the extent to which the assistant learns to use a specific API rather than its more general ability in real-world interactions. In contrast, GAIA uses real-world interactions as the benchmark and does not restrict possible APIs. There are other approaches that explore the evaluation of general assistants, but their core difference from GAIA is that they focus more on the capabilities of current models rather than future progress. According to the paper, GAIA is a standard for testing general assistant problems for AI systems, designed to avoid various problems in LLMs evaluation . GAIA contains 466 questions designed and annotated by humans. These questions are mainly in text form and sometimes contain some files, such as images or spreadsheets. The questions cover a variety of general assistant application scenarios, including daily personal tasks, scientific problems, and general knowledge. The questions are designed to have only one short and correct answer, so they are easy to verify. Using GAIA only requires prompting these questions to the AI assistant, along with relevant evidence (if any). In addition, evaluating LLMs using GAIA only requires the ability to ask questions to the model, that is, access to the API. The researchers used a prefix prompt before asking the model questions. To facilitate the extraction of answers, they also specified a format in the prefix prompt. They then evaluated GPT4 with and without plugins, and also evaluated AutoGPT with GPT4 as the backend. Currently, GPT4 requires manual selection of plugins, while AutoGPT can automatically make this selection. The results show that GAIA allows a clear ranking of capable assistants, while there is still a lot of room for improvement in the coming months and years. As can be seen in the figure, human web search performs well on Level 1, but does not perform well on more complex queries and is slightly slower. GPT-4 with plugins performs better in terms of answer accuracy and execution planning than GPT-4 without plugins. AutoGPT-4 uses tools automatically, but its performance on Level 2 and even Level 1 is disappointing, probably due to the way it relies on the GPT-4 API. Overall, humans working with GPT-4 with plugins seem to find the best balance between score and time. The first step in evaluating the potential of your AI assistant The emergence of GAIA makes us rethink the paradigm of current and future AI system evaluation. Models locked behind an API may change over time, meaning that evaluations conducted at different points in time may not be replicated or reproducible. Additionally, the problem may be compounded because tools like the ChatGPT plugin and their functionality are regularly updated rather than being accessed through ChatGPT’s API. Achieving reproducibility can be more difficult because researchers often rely on real-world benchmarks when evaluating model performance, and these benchmarks may change over time. However, GAIA is robust to generating randomness because it only looks at the final answer, accepting only one correct response for evaluation. In addition, compared to larger datasets of multiple-choice questions, GAIA focuses on question quality rather than quantity. The continued development of GAIA is expected to become a key component for more comprehensive assessments of the generalization ability and robustness of AI systems. A GAIA task may involve calling various modules to complete, such as an image classifier may return the wrong label. Some people may find such an assessment somewhat ambiguous because GAIA looks at the system as a whole, rather than attributing errors to sub-parts of the system, such as web browsing or vision modules. However, tightly integrating LLMs with other tools to complete all tasks may not be a sustainable approach. Future models may have more integration between language models and other capabilities, such as vision-language models. GAIA aims to evaluate entire AI systems, not just specific architectural standards. More broadly, automatic, factual, and explainable evaluation of complex generation has been a long-standing problem in generative AI. Current evaluation methods may have some limitations, and more sophisticated methods may be needed in the future, such as combining multimodal systems, improving the evaluation of generative models by performing complex sequence modifications on images, and asking explicit questions in natural language. Despite the progress of deep learning in various fields, full automation currently faces unpredictable failures, such as the challenge of self-driving cars. Solving the GAIA problem requires full automation, but this may lead to changes in the socio-economic landscape and there is a risk that technology owners will dominate value capture. In addition, GAIA also has some limitations. First, GAIA cannot evaluate the different paths leading to the correct answer. The authors of the paper suggest considering human and model evaluation in the future to make up for this shortcoming. In addition, because OpenAI's API does not provide detailed tool call logs, only the most powerful language models with tool access rights are currently evaluated. The research team hopes to add other models with sufficient tool usage capabilities and logging in the open source field in the future. To create a realistic and easy-to-use benchmark, two rounds of annotation were required, the first round where annotators designed clarifying questions, and the second round where two independent annotators answered the questions and resolved ambiguities, which despite this thorough process could still exist. Finally, a significant limitation of GAIA is its lack of language diversity: all questions can only be asked in “standard” English, and many questions rely primarily on English web pages. Therefore, GAIA is only a first step in evaluating the potential of general AI assistants and should not be considered an absolute proof of their success. Reference Links: https://arxiv.org/abs/2311.12983 Author: Yan Yimi Editor: Academic |
The total number of active users on Kuaishou and ...
The choice between WeChat and Apple has become a ...
The emotion release technology that is popular in...
According to recent news, South Korean battery an...
Introduction The year is 2040 and our latest MacB...
Dopamine is a chemical secreted in our brain that...
We know that Apple has already used the desktop-l...
This article will introduce in detail the specifi...
Amazon released its third-quarter earnings report...
We have helped more than 300,000 women solve and ...
Review expert: Wu Xinsheng, deputy chief physicia...
As Omicron attacks, the pressure on the emergency...
Everyone must be familiar with the three astronau...
The human brain is like a machine. When the tempe...