Why do "3 agents" have no water to drink? Scientists have discovered 14 reasons for failure

2025 is the year of agent explosion.

Based on the ability to handle complex, multi-step tasks and interact with different environments in real time, agent systems driven by large language models (LLMs), especially multi-agent systems (MAS), are considered to be very suitable for solving real-world problems . Therefore, they are increasingly used in various fields such as software engineering, drug discovery, scientific simulation, and general agent systems.

However, compared to single-agent systems or even simpler baselines, multi-agent systems are more prone to errors when dealing with real problems . As shown in the figure below, the failure rate of AppWorld can be as high as 86.7% .

Figure | Failure rates of five commonly used multi-agent LLM systems using GPT-4o and Claude-3

Why is this? A research team from the University of California, Berkeley and Intesa Sanpaolo Bank of Italy gave the answer:

They conducted the first comprehensive study of the challenges facing multi-agent systems and identified 14 unique failure modes , which they classified into three categories: (1) specification and system design failures; (2) inter-agent misalignment; and (3) task verification and termination.

The related research paper, titled “Why Do Multi-Agent LLM Systems Fail?”, has been published on the preprint website arXiv.

Paper link: https://arxiv.org/abs/2503.13657

Specifically, they proposed the first empirically based multi-agent system fault taxonomy , MASFT , which provides a structured framework for understanding and mitigating multi-agent system failures.

At the same time, they also developed a scalable "LLM-as-a-judge" evaluation pipeline for analyzing the performance of new multi-agent systems and diagnosing failure modes.

In addition, they conducted intervention studies on agent specifications, dialogue management, and verification strategies. Although the task completion rate increased by 14%, it still failed to completely solve the problem of multi-agent system failure , which highlights the need for structural redesign of multi-agent systems.

In addition, they also open-sourced their research results, including:

More than 150 annotated multi-agent system conversation traces;

Scalable LLM-as-a-judge evaluation pipeline and LLM annotations for over 150 trajectories;

Detailed expert annotations of 15 selected trajectories.

Up to 14 failure modes

In this work, the research team used Grounded Theory, a qualitative research method that builds theory directly from empirical data rather than testing predefined hypotheses , allowing the identification of failure modes to arise organically.

They repeatedly collected and analyzed the execution traces of multi-agent systems through theoretical sampling, open coding, continuous comparative analysis, memos, and theorizing. After obtaining the multi-agent system trace records and discussing the preliminary findings, they derived MASFT by collecting the observed failure modes.

Figure｜Systematic research on multi-agent system method flow

To achieve automatic fault identification, they developed an LLM-based annotator and verified its reliability.

They then conducted an inter-annotator agreement study, repeatedly adjusting the failure modes and failure categories by adding, removing, merging, splitting, or modifying definitions until consensus was reached . This process reflects a learning approach, where the taxonomy is continually refined until stability is achieved, and the agreement between annotators is measured using the Kappa coefficient.

Figure｜Multi-agent system failure mode classification method

Ultimately, MASFT includes three overall failure categories: specification and system design failures; inter-agent misalignment; and task verification and termination , identifying 14 fine-grained failure modes that multi-agent systems may encounter during execution.

MASFT also divides the execution of a multi-agent system into three phases: before execution, during execution, and after execution, and identifies the multi-agent system execution phase where each fine-grained failure mode may occur.

Figure｜Correlation matrix of fault categories in multi-agent system

In addition, they found that multi-agent systems face similar problems as complex human organizations, and their failure modes are consistent with common failure modes observed in human organizations . "Not asking for clarification" undermines "respect for expertise", and "agent misalignment" reflects the need to strengthen hierarchical distinctions and coordinate role allocation.

The effectiveness of multi-agent collaboration still needs to be improved

For all of the above fault categories, the research team proposed tactical and structural strategies.

Tactical strategies involve direct modifications targeting specific failure modes, such as improving prompts, agent network topology, and dialogue management. However, the effectiveness of these approaches is not consistent, as demonstrated by two case studies.

Structural strategies , i.e., more comprehensive approaches that have an impact on the entire system: strong verification, enhanced communication protocols, uncertainty quantification, and memory and state management. These strategies require deeper research and careful implementation, and remain research topics to be explored in the future.

Figure｜Solution strategy and fault classification of multi-agent system

The research team applied these strategic approaches in two case studies.

In the first case, they used the MathChat scenario implementation in AG2 as a baseline, in which a student agent collaborates with an assistant agent that can execute Python code to solve a problem.

For the benchmark, they randomly selected 200 exercises from the GSM-Plus dataset. The first strategy was to improve the original prompt with a clear structure and a new section dedicated to verification. The second strategy was to refine the agent configuration into a more specialized system with three different roles: a problem solver , who solves problems without tools using a thought chain approach; a coder , who writes and executes Python code to arrive at the final answer; and a verifier , who reviews the discussion and critically evaluates the solution, either confirming the answer or stimulating further discussion.

In this case, only the validator can terminate the conversation once a solution is found.

In the second case, ChatDev simulates a multi-agent software company where different agents have different roles, such as CEO, CTO, software engineer, and auditor, who try to collaborate to solve a software generation task.

They implemented two different interventions. The first was to improve the prompts for specific roles to enforce hierarchy and role consistency, while the second attempt involved a fundamental change to the framework's topology, modifying the framework's stopping structure from a directed acyclic graph (DAG) to a cyclic graph.

Now, the process terminates only when the CTO agent confirms that all reviews have been properly met, and a maximum iteration deadline is set to prevent infinite loops. This approach enables iterative improvements and more comprehensive quality assurance.

Figure | Performance accuracy of various solutions

The research team says many of the "obvious" solutions actually have serious limitations and the outlined structural strategies are needed to achieve more consistent improvements.

Considering the information redundancy and conflict in current multi-agent coordination and the amplified model deviation in collaboration, future multi-agent systems need to achieve rapid response, real-time verification and dynamic coordination to improve the effectiveness of team collaboration .

"LLM-based multi-agent still has certain potential in areas such as distributed scientific research collaboration and emergency response systems."

Author: Yu Ke

<<: Why is juniper pollen so popular? The scientific truth behind the "smoking" of the top streamers

>>: The "behind-the-scenes promoter" of the accelerated expansion of the universe is slowly pulling back...

What is URL Canonicalization? What does URL normalization mean?

Why is it that my down jacket, which I bought only a few years ago, is no longer warm? It turns out that many people have made these two mistakes!

Blog

These 12 obvious signs are signs of physical aging! Timely conditioning can slow down the aging process

Recommend

Behind the spontaneous combustion of Weimar Motor: More friends means more success? Suppliers are more about quality than quantity

In October, three consecutive spontaneous combust...

BYD advertising scandal ends: "Double-faced Li Juan" sentenced to 14 years in prison, many advertising companies on the verge of bankruptcy

On the eve of May Day this year, an execution ord...

DeepBlue Auto opens a new era of intelligent driving for all and calls on the industry to jointly promote the popularization of intelligent driving

On February 9, DeepBlue Auto's full-scenario ...

Do hollow strawberries contain hormones? Are "big" strawberries sprayed with pesticides? Can we still eat strawberries safely?

Review expert: Wang Kang, Director of the Science...

Hot, hot, hot! Why does the perceived temperature VS the meteorological temperature feel so different?

I believe many of my friends have had this experi...

Why do "3 agents" have no water to drink? Scientists have discovered 14 reasons for failure

What is URL Canonicalization? What does URL normalization mean?

Which cities have the most universities? Ranking of cities with the most universities in 2020

How can fresh cut flowers keep their beauty for a long time? Have you tried these "home remedies"?

Why is it that my down jacket, which I bought only a few years ago, is no longer warm? It turns out that many people have made these two mistakes!

These 12 obvious signs are signs of physical aging! Timely conditioning can slow down the aging process

Information flow delivery | Core analysis of conversion bidding!

Guide to short video advertising in the wedding industry!

What issues should be paid attention to when developing a fitness app in Beijing?

Why do fireflies glow? Do they also emit electromagnetic waves and photons?

2021 Wandering "Sublimation Chat Training Camp 2.0" The Second Love Secrets and the Strongest Chat Course

Recommend

Behind the spontaneous combustion of Weimar Motor: More friends means more success? Suppliers are more about quality than quantity

Data operation: customer portrait data analysis!

Google and Fiat Chrysler team up to launch Android-based car system

Apple releases new iOS 11.3 beta: Manual frequency reduction is here, but will you choose it?

BYD advertising scandal ends: "Double-faced Li Juan" sentenced to 14 years in prison, many advertising companies on the verge of bankruptcy

Zebra AI: APP competitive product analysis!

Why are about 1 million restaurant workers unemployed in the United States? How is the U.S. restaurant industry doing?

Microsoft's immortal dream of mobile phones

DeepBlue Auto opens a new era of intelligent driving for all and calls on the industry to jointly promote the popularization of intelligent driving

What is the correct way to take advantage of the college entrance examination hot spots?

They are most afraid of you going to the hospital...

Why most companies fail in digital transformation

Why do users keep churning? Teach you 3 stages of user retention methods

Do hollow strawberries contain hormones? Are "big" strawberries sprayed with pesticides? Can we still eat strawberries safely?

Hot, hot, hot! Why does the perceived temperature VS the meteorological temperature feel so different?