Better than GPT-4, the 2 billion parameter model solves arithmetic problems with almost 100% accuracy

Currently, large language models (LLMs) have demonstrated excellent capabilities in handling various downstream tasks in the field of NLP. In particular, pioneering models such as GPT-4 and ChatGPT have been trained on large amounts of text data, giving them strong text understanding and generation capabilities, enabling them to generate coherent and contextually relevant responses, and are highly versatile in various NLP tasks.

However, LLM's performance in mathematical reasoning is not satisfactory. LLM has difficulty accurately performing complex arithmetic operations, especially operations involving multiplication of numbers with more than 8 digits, and operations involving decimals and fractions.

Based on this, researchers from Tsinghua University, TAL AI Lab and Zhipu AI jointly proposed a new model - MathGLM, which can perfectly perform complex arithmetic operations.

The study shows that with enough training data, the 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with an accuracy of almost 100% and no data leakage. This result far exceeds GPT-4 (whose multi-digit multiplication accuracy is only 4.3%).

Method Introduction

This paper proposes a model named MathGLM to explore the efficiency of LLM in mathematical reasoning.

The arithmetic tasks that need to be completed in the MathGLM model can be roughly divided into two categories: basic arithmetic operations and complex mixed operations. Basic arithmetic operations include basic mathematical tasks that revolve around simple calculations of two numbers. Complex mixed operations involve a combination of different arithmetic operations and number formats (such as integers, decimals, fractions, etc.). Table 1 shows the classification of MathGLM tasks.

To enhance the arithmetic capabilities of MathGLM, we adopt a Transformer-based decoder-only architecture and train it from scratch on a generated arithmetic dataset using an autoregressive objective.

Learning arithmetic tasks

The arithmetic training dataset is carefully designed to include multiple operations such as addition, subtraction, multiplication, division, and exponentiation. In addition, it also contains multiple number formats such as integers, decimals, percentages, fractions, and negative numbers. The datasets vary in size, ranging from 1 million to 50 million records.

In each dataset, a single arithmetic expression consists of 2 to 10 steps, covering a range of mathematical operations such as addition (+), subtraction (-), multiplication (×), division (/), and exponentiation (^). Figure 3 shows some training examples extracted from the arithmetic dataset:

Table 2 summarizes the different sizes of MathGLM models, including 4 different types of models, each with different parameter sizes. The largest model has 2B parameters and the highest capacity; the others have 500M parameters, 100M parameters, and the smallest 10M parameter model.

Study of applied mathematics problems

In addition to arithmetic tasks, this paper also trained (fine-tuned) a series of Transformer-based language models, called the General Language Model (GLM) and its chat version, to solve mathematical application problems. The training process used the public Chinese Ape210K dataset, which contains 210,000 Chinese elementary school math problems, and the answer to each problem is directly calculated.

In order to improve the performance of MathGLM on math word problems, this paper adopts a step-by-step strategy to reconstruct the Ape210K dataset and convert it into a version that calculates the answer to each math problem step by step. Figure 4 shows the comparison between the original Ape210K dataset and the reconstructed version of this paper.

This paper uses different variants of GLM as the backbone to train MathGLM, including GLM-large, GLM-6B, GLM2-6B and GLM-10B with 335M parameters. In addition, this paper also uses ChatGLM-6B and ChatGLM2-6B backbone networks to train MathGLM. These backbone models give MathGLM basic language understanding capabilities, enabling it to effectively understand the language information contained in mathematical word problems.

experiment

Two different types of experiments were designed in this paper, including arithmetic tasks and mathematical word problems.

For arithmetic tasks, we pre-trained a Transformer-based MathGLM model with 500M parameters and compared its performance with leading large language models (LLMs) such as GPT-4 and ChatGPT. The results are shown in Table 3. MathGLM outperforms all other models, indicating that MathGLM has excellent performance in handling arithmetic tasks.

Even with just 10 million parameters, the results are astonishing: MathGLM-10M outperforms GPT-4 and ChatGPT on a range of comprehensive arithmetic tasks.

Furthermore, when comparing MathGLMs of different parameter sizes, we observed that the arithmetic performance of MathGLMs is directly related to the increase in their number of parameters. This finding suggests that their performance increases proportionally with the increase in model size.

In summary, the researchers' evaluation results on complex arithmetic tasks show that MathGLM has excellent performance. By decomposing arithmetic tasks, the performance of these models significantly exceeds GPT-4 and ChatGPT.

In addition, this paper also compares GPT-4, ChatGPT, text-davinci-003, code-davinci-002, Galacica, LLaMA, OPT, BLOOM and GLM. This paper randomly extracts a compact arithmetic dataset containing 100 test cases from the large data set discussed earlier. The results are shown in Table 4.

From the above analysis results, we can see that MathGLM achieves an accuracy of 93.03% with 2 billion parameters, surpassing all other LLMs.

For mathematical application problems, this paper conducts experiments on the Ape210K dataset. Table 8 reports the results including MathGLM variants, GPT-4, ChatGPT, etc.

The results show that when used with GLM-10B, MathGLM achieves comparable performance levels to the state-of-the-art GPT-4 model in terms of answer accuracy.

Furthermore, when comparing the performance of MathGLM with GLM-Large, GLM-6B, and GLM-10B, a clear trend emerged: MathGLM showed significant improvements in both arithmetic accuracy and answer accuracy.

In order to evaluate the model's ability to solve math problems at different grades, the study tested and evaluated the performance of several models on the K6 dataset, including: GPT-4, ChatGPT, Chinese-Alpaca-13B, MOSS-16B, Ziya-LLaMA-13B, Baichuan-7B, ChatGLM-6B, ChatGLM2-6B and MathGLM-GLM-10B. The results are shown in Figure 8 below.

Paper address: https://arxiv.org/pdf/2309.03241v2.pdf

Project address: https://github.com/THUDM/MathGLM#arithmetic-tasks

<<: What happens if you don't turn off the lights at night? It may kill small animals...

>>: What is the universe like? Here is the answer you want

Analysis of 11 cases and 4 logical points of growth hacking!

Recommend

Science 60s丨Spend 15 minutes less on your phone to reduce your sense of crisis

...

Case Analysis | The entire process of creating a product from 0 to 1!

I have been learning product-related knowledge be...

Xiaomi Box and TV sales exceeded 15 million. What will be Wang Chuan’s next move after the Premier League event takes place?

"The cumulative sales of Xiaomi Box and Xiao...

E-commerce promotions are cold, and the Double 12 shopping festival is no longer popular. The main reason may be the decline in consumers' shopping desire

In previous years, on this day, i.e., Double 12, ...

The Earth is secretly "slowing down", and a day may be extended to 25 hours in the future?

Tuchong Creative Almost all celestial bodies in t...

Don’t throw away your unused iPhone. Use Carplay to build a smart car

Apple Carplay was launched a long time ago, but i...

Opening the era of family cars 3.0, China's most powerful super multi-purpose family car - Trumpchi E8 is officially delivered

On December 9, on the Bund in Shanghai, accompani...

Better than GPT-4, the 2 billion parameter model solves arithmetic problems with almost 100% accuracy

Analysis of 11 cases and 4 logical points of growth hacking!

Three reasons led to a collective change of management at PPTV

How to achieve effective user retention? Share 6 tips!

How to obtain seed users during cold start?

Can sweating also cause sunburn? 5 preventions and 3 treatments to help you solve the problem of sunburn

How is the spiciness of those notoriously spicy peppers measured?

How to build a perfect bidding promotion account?

As giants collectively enter the content e-commerce market, are companies like "Little Red Book" having an easy time?

Can “Glucosamine” really save your joints, or is it just a waste of money?

Baofeng TV's model of achieving profitability within 5 years is questionable

Recommend

Science 60s丨Spend 15 minutes less on your phone to reduce your sense of crisis

Case Analysis | The entire process of creating a product from 0 to 1!

Xiaomi Box and TV sales exceeded 15 million. What will be Wang Chuan’s next move after the Premier League event takes place?

Creative writing skills for advertising in 6 major industries!

Analysis of new marketing promotion traffic in 2019!

The most comprehensive guide to event operations!

Are brightly colored clothes toxic? Do dark-colored clothes contain formaldehyde? Should I get rid of these clothes?

How can a small brand turn into a big brand?

E-commerce promotions are cold, and the Double 12 shopping festival is no longer popular. The main reason may be the decline in consumers' shopping desire

The Earth is secretly "slowing down", and a day may be extended to 25 hours in the future?

Don’t throw away your unused iPhone. Use Carplay to build a smart car

Opening the era of family cars 3.0, China's most powerful super multi-purpose family car - Trumpchi E8 is officially delivered

Asia Winter Science Popularization Issue 3丨Technology empowers "Improvement" of ice and snow sports safety in "Erbin"

A complete guide to online event planning and promotion!

How should Chinese brands show their attitude after the emergence of GoPro and Tesla in the United States?