2025 is the year of agent explosion. Based on the ability to handle complex, multi-step tasks and interact with different environments in real time, agent systems driven by large language models (LLMs), especially multi-agent systems (MAS), are considered to be very suitable for solving real-world problems . Therefore, they are increasingly used in various fields such as software engineering, drug discovery, scientific simulation, and general agent systems. However, compared to single-agent systems or even simpler baselines, multi-agent systems are more prone to errors when dealing with real problems . As shown in the figure below, the failure rate of AppWorld can be as high as 86.7% . Figure | Failure rates of five commonly used multi-agent LLM systems using GPT-4o and Claude-3 Why is this? A research team from the University of California, Berkeley and Intesa Sanpaolo Bank of Italy gave the answer: They conducted the first comprehensive study of the challenges facing multi-agent systems and identified 14 unique failure modes , which they classified into three categories: (1) specification and system design failures; (2) inter-agent misalignment; and (3) task verification and termination. The related research paper, titled “Why Do Multi-Agent LLM Systems Fail?”, has been published on the preprint website arXiv. Paper link: https://arxiv.org/abs/2503.13657 Specifically, they proposed the first empirically based multi-agent system fault taxonomy , MASFT , which provides a structured framework for understanding and mitigating multi-agent system failures. At the same time, they also developed a scalable "LLM-as-a-judge" evaluation pipeline for analyzing the performance of new multi-agent systems and diagnosing failure modes. In addition, they conducted intervention studies on agent specifications, dialogue management, and verification strategies. Although the task completion rate increased by 14%, it still failed to completely solve the problem of multi-agent system failure , which highlights the need for structural redesign of multi-agent systems. In addition, they also open-sourced their research results, including: More than 150 annotated multi-agent system conversation traces; Scalable LLM-as-a-judge evaluation pipeline and LLM annotations for over 150 trajectories; Detailed expert annotations of 15 selected trajectories. Up to 14 failure modes In this work, the research team used Grounded Theory, a qualitative research method that builds theory directly from empirical data rather than testing predefined hypotheses , allowing the identification of failure modes to arise organically. They repeatedly collected and analyzed the execution traces of multi-agent systems through theoretical sampling, open coding, continuous comparative analysis, memos, and theorizing. After obtaining the multi-agent system trace records and discussing the preliminary findings, they derived MASFT by collecting the observed failure modes. Figure|Systematic research on multi-agent system method flow To achieve automatic fault identification, they developed an LLM-based annotator and verified its reliability. They then conducted an inter-annotator agreement study, repeatedly adjusting the failure modes and failure categories by adding, removing, merging, splitting, or modifying definitions until consensus was reached . This process reflects a learning approach, where the taxonomy is continually refined until stability is achieved, and the agreement between annotators is measured using the Kappa coefficient. Figure|Multi-agent system failure mode classification method Ultimately, MASFT includes three overall failure categories: specification and system design failures; inter-agent misalignment; and task verification and termination , identifying 14 fine-grained failure modes that multi-agent systems may encounter during execution. MASFT also divides the execution of a multi-agent system into three phases: before execution, during execution, and after execution, and identifies the multi-agent system execution phase where each fine-grained failure mode may occur. Figure|Correlation matrix of fault categories in multi-agent system In addition, they found that multi-agent systems face similar problems as complex human organizations, and their failure modes are consistent with common failure modes observed in human organizations . "Not asking for clarification" undermines "respect for expertise", and "agent misalignment" reflects the need to strengthen hierarchical distinctions and coordinate role allocation. The effectiveness of multi-agent collaboration still needs to be improved For all of the above fault categories, the research team proposed tactical and structural strategies. Tactical strategies involve direct modifications targeting specific failure modes, such as improving prompts, agent network topology, and dialogue management. However, the effectiveness of these approaches is not consistent, as demonstrated by two case studies. Structural strategies , i.e., more comprehensive approaches that have an impact on the entire system: strong verification, enhanced communication protocols, uncertainty quantification, and memory and state management. These strategies require deeper research and careful implementation, and remain research topics to be explored in the future. Figure|Solution strategy and fault classification of multi-agent system The research team applied these strategic approaches in two case studies. In the first case, they used the MathChat scenario implementation in AG2 as a baseline, in which a student agent collaborates with an assistant agent that can execute Python code to solve a problem. For the benchmark, they randomly selected 200 exercises from the GSM-Plus dataset. The first strategy was to improve the original prompt with a clear structure and a new section dedicated to verification. The second strategy was to refine the agent configuration into a more specialized system with three different roles: a problem solver , who solves problems without tools using a thought chain approach; a coder , who writes and executes Python code to arrive at the final answer; and a verifier , who reviews the discussion and critically evaluates the solution, either confirming the answer or stimulating further discussion. In this case, only the validator can terminate the conversation once a solution is found. In the second case, ChatDev simulates a multi-agent software company where different agents have different roles, such as CEO, CTO, software engineer, and auditor, who try to collaborate to solve a software generation task. They implemented two different interventions. The first was to improve the prompts for specific roles to enforce hierarchy and role consistency, while the second attempt involved a fundamental change to the framework's topology, modifying the framework's stopping structure from a directed acyclic graph (DAG) to a cyclic graph. Now, the process terminates only when the CTO agent confirms that all reviews have been properly met, and a maximum iteration deadline is set to prevent infinite loops. This approach enables iterative improvements and more comprehensive quality assurance. Figure | Performance accuracy of various solutions The research team says many of the "obvious" solutions actually have serious limitations and the outlined structural strategies are needed to achieve more consistent improvements. Considering the information redundancy and conflict in current multi-agent coordination and the amplified model deviation in collaboration, future multi-agent systems need to achieve rapid response, real-time verification and dynamic coordination to improve the effectiveness of team collaboration . "LLM-based multi-agent still has certain potential in areas such as distributed scientific research collaboration and emergency response systems." Author: Yu Ke |
<<: Why is juniper pollen so popular? The scientific truth behind the "smoking" of the top streamers
They are both advertising channels , but why are ...
When I was doing a community survey on mini progr...
As early as last fall, there were multiple reports...
Recently, the Tencent WeChat team helped Apple up...
Recently, a science program called "Upload20...
The local epidemic caused by this round of gather...
If you want to get more traffic when selling good...
During the survey, we found that “growth means” i...
Myopia reduced from 800 degrees to 100 degrees? D...
You must have seen many New Year advertisements l...
Earlier, Feng Chao from Dongguan had an idea, whe...
When operating seed users , the operation will ha...
With the development of App development technolog...
The annual International Consumer Electronics Sho...
The editor has collected and compiled a complete ...