Better than GPT-4, the 2 billion parameter model solves arithmetic problems with almost 100% accuracy

Better than GPT-4, the 2 billion parameter model solves arithmetic problems with almost 100% accuracy

Currently, large language models (LLMs) have demonstrated excellent capabilities in handling various downstream tasks in the field of NLP. In particular, pioneering models such as GPT-4 and ChatGPT have been trained on large amounts of text data, giving them strong text understanding and generation capabilities, enabling them to generate coherent and contextually relevant responses, and are highly versatile in various NLP tasks.

However, LLM's performance in mathematical reasoning is not satisfactory. LLM has difficulty accurately performing complex arithmetic operations, especially operations involving multiplication of numbers with more than 8 digits, and operations involving decimals and fractions.

Based on this, researchers from Tsinghua University, TAL AI Lab and Zhipu AI jointly proposed a new model - MathGLM, which can perfectly perform complex arithmetic operations.

The study shows that with enough training data, the 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with an accuracy of almost 100% and no data leakage. This result far exceeds GPT-4 (whose multi-digit multiplication accuracy is only 4.3%).

Method Introduction

This paper proposes a model named MathGLM to explore the efficiency of LLM in mathematical reasoning.

The arithmetic tasks that need to be completed in the MathGLM model can be roughly divided into two categories: basic arithmetic operations and complex mixed operations. Basic arithmetic operations include basic mathematical tasks that revolve around simple calculations of two numbers. Complex mixed operations involve a combination of different arithmetic operations and number formats (such as integers, decimals, fractions, etc.). Table 1 shows the classification of MathGLM tasks.

To enhance the arithmetic capabilities of MathGLM, we adopt a Transformer-based decoder-only architecture and train it from scratch on a generated arithmetic dataset using an autoregressive objective.

Learning arithmetic tasks

The arithmetic training dataset is carefully designed to include multiple operations such as addition, subtraction, multiplication, division, and exponentiation. In addition, it also contains multiple number formats such as integers, decimals, percentages, fractions, and negative numbers. The datasets vary in size, ranging from 1 million to 50 million records.

In each dataset, a single arithmetic expression consists of 2 to 10 steps, covering a range of mathematical operations such as addition (+), subtraction (-), multiplication (×), division (/), and exponentiation (^). Figure 3 shows some training examples extracted from the arithmetic dataset:

Table 2 summarizes the different sizes of MathGLM models, including 4 different types of models, each with different parameter sizes. The largest model has 2B parameters and the highest capacity; the others have 500M parameters, 100M parameters, and the smallest 10M parameter model.

Study of applied mathematics problems

In addition to arithmetic tasks, this paper also trained (fine-tuned) a series of Transformer-based language models, called the General Language Model (GLM) and its chat version, to solve mathematical application problems. The training process used the public Chinese Ape210K dataset, which contains 210,000 Chinese elementary school math problems, and the answer to each problem is directly calculated.

In order to improve the performance of MathGLM on math word problems, this paper adopts a step-by-step strategy to reconstruct the Ape210K dataset and convert it into a version that calculates the answer to each math problem step by step. Figure 4 shows the comparison between the original Ape210K dataset and the reconstructed version of this paper.

This paper uses different variants of GLM as the backbone to train MathGLM, including GLM-large, GLM-6B, GLM2-6B and GLM-10B with 335M parameters. In addition, this paper also uses ChatGLM-6B and ChatGLM2-6B backbone networks to train MathGLM. These backbone models give MathGLM basic language understanding capabilities, enabling it to effectively understand the language information contained in mathematical word problems.

experiment

Two different types of experiments were designed in this paper, including arithmetic tasks and mathematical word problems.

For arithmetic tasks, we pre-trained a Transformer-based MathGLM model with 500M parameters and compared its performance with leading large language models (LLMs) such as GPT-4 and ChatGPT. The results are shown in Table 3. MathGLM outperforms all other models, indicating that MathGLM has excellent performance in handling arithmetic tasks.

Even with just 10 million parameters, the results are astonishing: MathGLM-10M outperforms GPT-4 and ChatGPT on a range of comprehensive arithmetic tasks.

Furthermore, when comparing MathGLMs of different parameter sizes, we observed that the arithmetic performance of MathGLMs is directly related to the increase in their number of parameters. This finding suggests that their performance increases proportionally with the increase in model size.

In summary, the researchers' evaluation results on complex arithmetic tasks show that MathGLM has excellent performance. By decomposing arithmetic tasks, the performance of these models significantly exceeds GPT-4 and ChatGPT.

In addition, this paper also compares GPT-4, ChatGPT, text-davinci-003, code-davinci-002, Galacica, LLaMA, OPT, BLOOM and GLM. This paper randomly extracts a compact arithmetic dataset containing 100 test cases from the large data set discussed earlier. The results are shown in Table 4.

From the above analysis results, we can see that MathGLM achieves an accuracy of 93.03% with 2 billion parameters, surpassing all other LLMs.

For mathematical application problems, this paper conducts experiments on the Ape210K dataset. Table 8 reports the results including MathGLM variants, GPT-4, ChatGPT, etc.

The results show that when used with GLM-10B, MathGLM achieves comparable performance levels to the state-of-the-art GPT-4 model in terms of answer accuracy.

Furthermore, when comparing the performance of MathGLM with GLM-Large, GLM-6B, and GLM-10B, a clear trend emerged: MathGLM showed significant improvements in both arithmetic accuracy and answer accuracy.

In order to evaluate the model's ability to solve math problems at different grades, the study tested and evaluated the performance of several models on the K6 dataset, including: GPT-4, ChatGPT, Chinese-Alpaca-13B, MOSS-16B, Ziya-LLaMA-13B, Baichuan-7B, ChatGLM-6B, ChatGLM2-6B and MathGLM-GLM-10B. The results are shown in Figure 8 below.

Paper address: https://arxiv.org/pdf/2309.03241v2.pdf

Project address: https://github.com/THUDM/MathGLM#arithmetic-tasks

<<:  What happens if you don't turn off the lights at night? It may kill small animals...

>>:  What is the universe like? Here is the answer you want

Recommend

Tips for developing a big Tik Tok account that attracts a lot of fans!

So what kind of influencers have more long-term a...

Subjective and objective: Is the Google module phone suspended or terminated?

[[145681]] With the recent establishment of Googl...

Domestic brands emerge in the auto market

Compared with the double-digit growth in 2016, Ch...

From variable declaration in C language to block syntax in Objective-C

[[164693]] In this article, we start with simple ...

Facebook pays tribute to Steve Jobs on his 60th birthday

I don't want to go to work because I'm la...

Google launches an app that helps hearing-impaired people communicate freely

In 1882, a sudden illness took away all the color...