Which is bigger, 9/11 or 9/9? A question that a kindergarten kid can answer stumps a bunch of AIs…

Which is bigger, 9/11 or 9/9? A question that a kindergarten kid can answer stumps a bunch of AIs…

Which one is bigger, 9.11 or 9.9?

This question, which even kindergarten children can answer, has stumped many large language models (LLMs) in the past (and still does).

However, in order to reach the level of general artificial intelligence (AGI), LLM must not only complete simple logical reasoning such as "comparing size", but also complete more difficult reasoning , such as "understanding and execution of complex rules and multi-step planning", which is the core capability of LLM agents and decision-making systems.

Therefore, how to effectively evaluate the role of LLM as a rule-based executor and planner is crucial. However, there is little research in this area in academia and industry.

A research team from Tsinghua University and Zhipu has launched a new benchmark test, LogicGame , which aims to comprehensively evaluate LLM's ability in rule understanding, execution, and planning . Let's take a look at the evaluation results first:

Figure|LogicGame evaluation results and sample display. The upper figure shows the performance of various models in the execution and planning categories; the lower figure (left and right) is two execution and planning category case studies respectively.

In addition to seeing that o1-preview and o-mini are far ahead, we also see that more than half of the models score less than 10% , as shown in the red area in the figure above.

This evaluation result reveals a fact that cannot be ignored: most LLMs have obvious defects in rule-based logical reasoning .

The related research paper, titled “LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models”, has been published on the preprint website arXiv.

Unlike traditional benchmarks, LogicGame contains a series of diverse games, each with a set of initial state rules, which requires the model to not only understand these predefined rules but also apply them to solve the problem . Moreover, LogicGame also considers both the final result and the intermediate steps to comprehensively evaluate the model performance.

The research results show that by setting game scenarios of different difficulties, LogicGame can accurately measure the model's performance in rule understanding and multi-step execution and planning tasks .

LogicGame: "Level 4" difficulty game scenario

The combination of rule-following and reasoning is key to accomplishing many real-world tasks. However, existing benchmarks often fail to adequately capture this.

To fill this gap, the research team developed a set of novel problems through extensive research and crowdsourcing. They found that these tasks are similar to certain game mechanics, because real-world tasks often share common features with games, such as the need to follow specific rules and make decisions. Therefore, they adopted a gamification approach, which enables a detailed evaluation of the model's reasoning ability to follow rules .

Among them, the data construction of LogicGame includes the following four parts :

Design rule reasoning problems inspired by real-world scenarios. Since real-world tasks often have game features, such as following specific rules and making decisions, LogicGame uses a gamification approach to evaluate the model's rule-following and reasoning abilities.

Develop output constraints to ensure that model outputs conform to a standard format. To facilitate accurate evaluation and simplify the matching step, model responses are required to follow a structured JSON output format. For single-step problems (Level 0), the model only needs to output the final answer, and the evaluation is based only on the correctness of the answer. For problems involving multiple steps or more complex reasoning (Level 1, 2, 3, and some Level 0 problems), both the answer and the steps must be evaluated.

Implements different difficulty levels and includes example problems. There are four difficulty levels, assessing the range of the model's reasoning capabilities, from simple rule applications to complex reasoning chains. The difficulty gradient is determined by the complexity of the relevant rules and the number of reasoning steps required to arrive at a solution.

To ensure fairness and wider applicability, LogicGame includes both Chinese and English version benchmarks.

As shown in the figure below, each model receives a set of rules specific to a given problem as an input prompt, as well as a corresponding question and an output constraint in JSON format, including the answer and steps.

Figure | Illustration of the classification and evaluation methods in LogicGame. The illustration in the classification highlights the categories involving mathematics in purple.

LogicGame's evaluation method uses an automated approach to evaluate not only the correctness of the answer, but also the correctness of the steps that lead to the answer . Specifically, it evaluates the model's answer accuracy (A-Acc), step accuracy (P-Acc), and answer-step accuracy (AP-Acc).

The score of each question answer is determined by comparing the model's response to the reference answer. Similarly, the score of each question step is achieved by evaluating how well the model step agrees with the reference step, as defined by the JSON format constraints.

A-Acc : This metric is used to evaluate the correctness of all answers to a given question, providing a binary evaluation (0/1) for each answer, indicating whether it is correct or not.

P-Acc : This metric evaluates the correctness of a step, measuring the percentage of matches based on the character-level similarity between the provided step and the expected step. In rare cases where a Level 0 question is a single-step reasoning where no steps are provided for evaluation, step accuracy is treated the same as answer accuracy when scoring.

AP-Acc : This composite metric evaluates the overall accuracy of the answer and steps. It is calculated by combining the answer accuracy and step accuracy using a logical AND operation to produce a total score.

This evaluation method ensures that the model follows the rules for reasoning and comprehensively evaluates the model's reasoning ability.

How is it doing? OpenAI o1 is far ahead

As shown in the figure below, regardless of the Chinese or English version, at the highest difficulty level 3 in the execution category, o1-preview and o1-mini are far ahead in the scores of the 14 evaluated models, and the scores of domestic models failed to break 10, and even appeared multiple 0 points ; at the highest difficulty level 3 in the planning category, OpenAI o1's leading advantage is also the same.

Figure | AP-Acc% performance of 14 models on the Chinese version of LogicGame.

Figure | AP-Acc% performance of 14 models on the English version of LogicGame.

In the execution category, the accuracy of the models has significantly improved as the number of shots increases . Specifically, more powerful models such as GPT-4o have a larger improvement in AP-Acc scores when switching from 0-shot to 1-shot and 2-shot, indicating that they are better able to utilize additional contextual information to improve execution accuracy.

Figure | Few-sample differences of the Chinese version of LogicGame in the execution and planning categories.

We also observe that in the execution tasks, adding examples generally improves the model’s performance in the execution tasks , especially in the simple tasks (Level 0).

Figure|The shot difference settings of the Chinese version of LogicGame at different difficulty levels are similar to the above figure.

However, the 1-shot and 2-shot settings have different effects on the model at different difficulty levels. The model benefits the most from examples at Level 0, but the effect of examples gradually decreases as the difficulty level increases.

In the planning task, the effect of adding examples on the model's performance in the planning task is more complex . Some models' performance drops when switching from 0-shot to 1-shot or 2-shot settings, indicating that additional contextual information may introduce noise and interfere with the model's understanding of key information. Overall, 1-shot has the most obvious impact on the model, but its impact gradually weakens as the difficulty level increases, while 2-shot is more unstable and has no obvious pattern.

In a case study, LLM’s performance in the game of Reversi was almost “terrible”. Except for OpenAI o1, the scores of other models were almost (close to) 0 , which also shows that LLM still has difficulty in handling complex rules and performing multi-step reasoning.

Figure | Average AP-Acc% scores for the five worst performing categories. The heat map shows the average AP-ACC% score for each category. The models perform poorly in both execution and planning scenarios, especially in "Reversi", where many models score close to zero.

Figure | An example of a Reversi game with model output, including the answer and steps.

The research team analyzed this failure and found the following three reasons:

Inadequate processing of details : For example, the Claud 3.5 Sonnet model was unable to correctly process details such as placing some pieces or flipping some pieces, indicating that they did not have a deep enough understanding of the rules.

Poor understanding of execution/planning rules : Models were unable to correctly execute or plan actions in the game, indicating that their understanding of game mechanics (such as flipping) was flawed.

Excessive changes : The llama-3-8b-chat models made excessive changes to the board state, indicating a clear bias in their understanding of the rules of the game.

The reasoning ability of LLM still needs to be improved

In this paper, the research team proposed a novel benchmark, LogicGame, to evaluate the rule-based reasoning ability of LLM. This benchmark contains multiple difficulty levels and focuses on evaluating the model's understanding of rules, execution based on these rules, and planning capabilities.

At the same time, they also developed methods to evaluate the results and reasoning process to ensure that the model faithfully follows the given rules rather than just guessing the answer.

Extensive experiments show that current large models still exhibit significant deficiencies in rule-based reasoning tasks.

In this regard, the research team believes that LLM's reasoning ability still needs to be improved, especially in understanding complex rules, performing multi-step reasoning, and learning and applying new rules.

In order for LLMs to better understand and execute rules, their reasoning capabilities need to be further improved, such as through more effective training methods or the introduction of new reasoning mechanisms .

Furthermore, in order to more comprehensively evaluate the reasoning ability of LLMs, more effective evaluation methods need to be developed , for example, by introducing more complex rules and more difficult reasoning tasks.

Let’s battle together!

Want to prove your big model's logical reasoning ability? Why not participate in the LogicGame test and battle with many big models at home and abroad?

The research team maintains a Leaderboard on GitHub to show the performance of the model in the English and Chinese versions of LogicGame. The ranking is based on AP-Acc%. The main evaluation indicators include:

AP-Acc% (answer and step accuracy)

A-Acc% (correct answer percentage)

P-Acc% (correctness of steps)

IFError% (instruction follow error rate)

JSError% (Json format output error rate)

Figure|Performance of 14 large models on the Chinese version of LogicGame

Figure|Performance of 14 large models on the English version of LogicGame

So, how do you get your model to perform in the English and Chinese versions of LogicGame?

The research team has stored dev data for demonstration on GitHub, and provided the input data required for submission to Codabench (a platform dedicated to model evaluation that provides an efficient, fair and unified evaluation environment). You can download the zh_all and en_all files (representing the full set of Chinese and English data, respectively), input them into your model to obtain the model response , and use this response as the input data for Codabench submission to obtain evaluation result feedback.

<<:  The person who invented frying is such a genius. Can icicles and beer be fried?

>>:  Why do ants move?

Recommend

Popular Science | What did ancient women wear when traveling?

Summer vacation is coming soon, and many parents ...

6 major types of user growth projects, just do it!

Growth is a part that operators cannot escape. Us...

Why is the conversion effect of your copy always limited?

What is the most important feature of a copy that...

Strategy Analytics: COVID-19’s Impact on Enterprise and IoT

Since the beginning of the year, the COVID-19 out...

Mini Program Optimization Suggestions

Currently, the mini program optimization suggesti...

How much does it cost to invest in Yan'an Electric's mini program?

How much does it cost to attract investment throu...

Marketing skills and ideas for Teachers’ Day!

Teachers' Day is coming soon. Have you though...

Liquid robots become a reality, inspired by sea cucumbers

Produced by: Science Popularization China Author:...