Why do "3 agents" have no water to drink? Scientists have discovered 14 reasons for failure

Why do "3 agents" have no water to drink? Scientists have discovered 14 reasons for failure

2025 is the year of agent explosion.

Based on the ability to handle complex, multi-step tasks and interact with different environments in real time, agent systems driven by large language models (LLMs), especially multi-agent systems (MAS), are considered to be very suitable for solving real-world problems . Therefore, they are increasingly used in various fields such as software engineering, drug discovery, scientific simulation, and general agent systems.

However, compared to single-agent systems or even simpler baselines, multi-agent systems are more prone to errors when dealing with real problems . As shown in the figure below, the failure rate of AppWorld can be as high as 86.7% .

Figure | Failure rates of five commonly used multi-agent LLM systems using GPT-4o and Claude-3

Why is this? A research team from the University of California, Berkeley and Intesa Sanpaolo Bank of Italy gave the answer:

They conducted the first comprehensive study of the challenges facing multi-agent systems and identified 14 unique failure modes , which they classified into three categories: (1) specification and system design failures; (2) inter-agent misalignment; and (3) task verification and termination.

The related research paper, titled “Why Do Multi-Agent LLM Systems Fail?”, has been published on the preprint website arXiv.

Paper link: https://arxiv.org/abs/2503.13657

Specifically, they proposed the first empirically based multi-agent system fault taxonomy , MASFT , which provides a structured framework for understanding and mitigating multi-agent system failures.

At the same time, they also developed a scalable "LLM-as-a-judge" evaluation pipeline for analyzing the performance of new multi-agent systems and diagnosing failure modes.

In addition, they conducted intervention studies on agent specifications, dialogue management, and verification strategies. Although the task completion rate increased by 14%, it still failed to completely solve the problem of multi-agent system failure , which highlights the need for structural redesign of multi-agent systems.

In addition, they also open-sourced their research results, including:

More than 150 annotated multi-agent system conversation traces;

Scalable LLM-as-a-judge evaluation pipeline and LLM annotations for over 150 trajectories;

Detailed expert annotations of 15 selected trajectories.

Up to 14 failure modes

In this work, the research team used Grounded Theory, a qualitative research method that builds theory directly from empirical data rather than testing predefined hypotheses , allowing the identification of failure modes to arise organically.

They repeatedly collected and analyzed the execution traces of multi-agent systems through theoretical sampling, open coding, continuous comparative analysis, memos, and theorizing. After obtaining the multi-agent system trace records and discussing the preliminary findings, they derived MASFT by collecting the observed failure modes.

Figure|Systematic research on multi-agent system method flow

To achieve automatic fault identification, they developed an LLM-based annotator and verified its reliability.

They then conducted an inter-annotator agreement study, repeatedly adjusting the failure modes and failure categories by adding, removing, merging, splitting, or modifying definitions until consensus was reached . This process reflects a learning approach, where the taxonomy is continually refined until stability is achieved, and the agreement between annotators is measured using the Kappa coefficient.

Figure|Multi-agent system failure mode classification method

Ultimately, MASFT includes three overall failure categories: specification and system design failures; inter-agent misalignment; and task verification and termination , identifying 14 fine-grained failure modes that multi-agent systems may encounter during execution.

MASFT also divides the execution of a multi-agent system into three phases: before execution, during execution, and after execution, and identifies the multi-agent system execution phase where each fine-grained failure mode may occur.

Figure|Correlation matrix of fault categories in multi-agent system

In addition, they found that multi-agent systems face similar problems as complex human organizations, and their failure modes are consistent with common failure modes observed in human organizations . "Not asking for clarification" undermines "respect for expertise", and "agent misalignment" reflects the need to strengthen hierarchical distinctions and coordinate role allocation.

The effectiveness of multi-agent collaboration still needs to be improved

For all of the above fault categories, the research team proposed tactical and structural strategies.

Tactical strategies involve direct modifications targeting specific failure modes, such as improving prompts, agent network topology, and dialogue management. However, the effectiveness of these approaches is not consistent, as demonstrated by two case studies.

Structural strategies , i.e., more comprehensive approaches that have an impact on the entire system: strong verification, enhanced communication protocols, uncertainty quantification, and memory and state management. These strategies require deeper research and careful implementation, and remain research topics to be explored in the future.

Figure|Solution strategy and fault classification of multi-agent system

The research team applied these strategic approaches in two case studies.

In the first case, they used the MathChat scenario implementation in AG2 as a baseline, in which a student agent collaborates with an assistant agent that can execute Python code to solve a problem.

For the benchmark, they randomly selected 200 exercises from the GSM-Plus dataset. The first strategy was to improve the original prompt with a clear structure and a new section dedicated to verification. The second strategy was to refine the agent configuration into a more specialized system with three different roles: a problem solver , who solves problems without tools using a thought chain approach; a coder , who writes and executes Python code to arrive at the final answer; and a verifier , who reviews the discussion and critically evaluates the solution, either confirming the answer or stimulating further discussion.

In this case, only the validator can terminate the conversation once a solution is found.

In the second case, ChatDev simulates a multi-agent software company where different agents have different roles, such as CEO, CTO, software engineer, and auditor, who try to collaborate to solve a software generation task.

They implemented two different interventions. The first was to improve the prompts for specific roles to enforce hierarchy and role consistency, while the second attempt involved a fundamental change to the framework's topology, modifying the framework's stopping structure from a directed acyclic graph (DAG) to a cyclic graph.

Now, the process terminates only when the CTO agent confirms that all reviews have been properly met, and a maximum iteration deadline is set to prevent infinite loops. This approach enables iterative improvements and more comprehensive quality assurance.

Figure | Performance accuracy of various solutions

The research team says many of the "obvious" solutions actually have serious limitations and the outlined structural strategies are needed to achieve more consistent improvements.

Considering the information redundancy and conflict in current multi-agent coordination and the amplified model deviation in collaboration, future multi-agent systems need to achieve rapid response, real-time verification and dynamic coordination to improve the effectiveness of team collaboration .

"LLM-based multi-agent still has certain potential in areas such as distributed scientific research collaboration and emergency response systems."

Author: Yu Ke

<<:  Why is juniper pollen so popular? The scientific truth behind the "smoking" of the top streamers

>>:  The "behind-the-scenes promoter" of the accelerated expansion of the universe is slowly pulling back...

Recommend

Case analysis: The operation and promotion methods of "Lian Coffee"!

When I was doing a community survey on mini progr...

Too difficult to copy? Why don’t new Android flagships have 3D Touch?

As early as last fall, there were multiple reports...

Case analysis: How to increase user growth?

During the survey, we found that “growth means” i...

Hefei private detective website SEO optimization practical training case!

Earlier, Feng Chao from Dongguan had an idea, whe...

How can gaming products increase their first batch of seed customers?

When operating seed users , the operation will ha...

Mainstream App promotion and customer acquisition channels and methods!

With the development of App development technolog...

A collection of selected real estate advertising slogans

The editor has collected and compiled a complete ...