Why is ChatGPT like a "100,000 Whys" answering machine?

Recently, the artificial intelligence ChatGPT has become popular all over the world. People from all walks of life have posted their conversations with it. Some of the answers are amazing, while others are serious nonsense. Some people use it to assist in copywriting and code modification, trying to make it a good helper for human work. Why can ChatGPT be a universal chat companion and answer all our questions?

Written by | Chen Qingyang

The chatbot chatGPT launched by OpenAI has become a focus of discussion around the world. The chatGPT language generation system built on the powerful GPT model has extraordinary natural language text generation capabilities. After being pre-trained with a large number of corpora, it is capable of various natural language processing tasks. It can not only generate very realistic texts such as papers, press releases, poems, codes, etc. according to user requirements, but also answer almost all your questions - from astronomy to geography. Why is it so powerful? This article will give a brief introduction to the language model principles and development process behind chatGPT.

Language modeling: a fundamental task

Behind chatGPT is a powerful language model. What is a language model? For example, we have all used input method speech-to-text conversion, and there are many homophones in the language, such as

The fried squid in this restaurant is really delicious!

and

The stir-fry at this restaurant is really delicious!

How can a machine distinguish whether the word is "squid" or "due to" based on the speech? This is where language models come in. The task of a language model is to determine the probability of a sentence actually appearing, given a sentence . A good language model will give a high probability to the first sentence (squid) and a low probability to the second sentence (due to).

This "excellent translation" is obviously the result of the lack of a good language model.

So, how can a language model accurately determine the probability of a given sentence actually appearing? The machine needs to read thousands of books to make it very familiar with human language and understand the habits of word choice and sentence structure. The technical problem here is how can we design an algorithm or program to learn language patterns and let the machine "understand" human language?

There is a simple method called "Bigram". The idea is to scan all documents, count each word, and find out what the most common word is after it. For example, for the word "I", after the machine scans many documents, it may find that the frequency of "I" appearing after "I" is 30%, the frequency of "call" is 20%, the probability of "think" is 30%, and so on. This frequency will be used by the machine to determine the probability of these two words appearing together (joint probability). Assuming that the vocabulary in the language is N, when the machine completes learning, it will construct a table with N rows and N columns, where each row represents a word row[i], and each column row[i][j] represents the probability of the word appearing after the word row[i]. With this table, the machine can complete the judgment of the language model.

We give it a sentence: "I miss you". In the table, assume that the probability of "想" after "我" is 30%, the probability of "你" after "想" is 50%, and the probability of "了" after "你" is 10%. Then the overall probability of "我想你了" is 30%*50%*10%=0.015. With this method, the machine can easily determine that the probability of "squid" appearing in the previous example is higher than that of "由于".

But this simple method also has many problems. In many cases, the probability of a word following another word does not only depend on that word, but also on the previous k words to determine the probability of the next word (some of the previous k words may not be important). Table storage is also a problem. When k=2, our table is already N*N in size. As k increases, the table will increase exponentially. In addition, this method cannot automatically learn synonyms to generalize the model.

Other ways to build language models include manually creating a knowledge base to record the relationship between nouns, or designing a set of grammatical rules for the language, so that grammatically incorrect sentences can be eliminated. In fact, one of the early schools of natural language processing methods is to use linguistics, which requires a lot of manual annotation and lacks flexibility (it cannot learn from new texts, etc.). Therefore, with more and more data, as machine learning methods, computing power and other technologies further develop, especially after the 2010s, methods based on deep artificial neural networks have become mainstream. People no longer use manual grammatical rules or simple statistical models, but use deep neural networks (Note: From now on, we use the term "neural network" to refer to artificial neural networks, not biological neural networks. Using many layers of deep neural networks for machine learning is called "deep learning") to let machines automatically learn the context in the language and the intrinsic relationship between words from massive data (examples of natural language).

Feed-Forward Neural Network
Yoshua Bengio (one of the three giants of deep learning) published a paper in 2003 called A Neural Probabilistic Language Model[1], which was later widely regarded as a classic work on language modeling using neural networks. At the beginning of this paper, he pointed out two major problems with models such as Bigram: first, due to the exponential increase in the table, Bigram models cannot consider a long context (the longest one has ever used is to use the previous five words to predict the next word); second, they do not learn synonyms to generalize language understanding capabilities.

In order to solve these key problems, this paper proposed a two-layer forward propagation neural network. This is a relatively simple neural network, far less complex than the later networks, but it reveals the effectiveness of using neural networks in language modeling. The idea of the model can be summarized as follows:

Each word has a word feature vector that can be learned
;
The joint probability of a sentence is represented by the feature vectors of the words in the sentence and the parameters of a layer of neural network;
The above-mentioned word feature vectors and network parameters are learned from the data through optimization methods.
The paper gives such an example. After training, the model should be able to easily generalize from sentence 1 to a new, unseen sentence 2.

Sentence 1: The cat is walking in the bedroom;

Sentence 2: A dog was running in a room.

Because after a lot of learning, "cat" and "dog", "bedroom" and "room", "walk" and "run" will have similar feature vectors. Therefore, even if the machine has never seen sentence 2 in the training set, it will assign a joint probability similar to sentence 1 to it, and a small fluctuation in the feature vector will have little effect on the final joint probability.

Interestingly, the paper also specifically mentioned that in order to train such a "large model" (which is very small from today's perspective), the researchers designed a parallel training algorithm and trained it on 40 CPUs for three weeks.

Transformer: The Origin of Large Models

With the development of deep learning and computer computing power, newer neural network architectures and larger models have been proposed. A milestone work is the Transformer model [2] (named after the famous movie Transformers) proposed by scientists at Google Brain in 2017, which is the "T" in chatGPT. By introducing the "self-attention mechanism" and "position encoding", Transformer can learn which words in a context should be given more "attention". As we mentioned earlier, when predicting a word, the weight and role of different words in the context will be different. Take the following sentence as an example:

The dog was running in a room because it was hungry.

Does it here refer to the dog or the room? Through the attention mechanism, the Transformer model can determine that in this context, it is highly correlated with dog, but less correlated with room. The specific calculation process is that the embedding (word vector) of each word in the context will be inner-producted with the word vectors of other words, and the smaller the inner product, the smaller the distance (greater correlation). Broadly speaking, the Transformer model can better encode a word based on the context, that is, for the same word, it can be given different encodings (meanings) in different contexts.

The core calculation of the self-attention mechanism is the inner product, which is a matrix multiplication in batches. Matrix multiplication is a highly parallel operation, which allows long-distance dependencies (long contexts) to be calculated efficiently. High parallelization also makes the model easier to expand (scalable). For example, in large models such as GPT-3, the model can consider contexts up to 2048 words.

In the excellent translation diagram above, if we use a Transformer model, the machine can consider the previous context "be careful" when translating "be careful". At this time, "be careful" will be translated as "be careful not to" instead of "be careful".

With such a powerful, flexible and efficient language modeling method, AI has entered the era of large models.

Large Model + Pre-training: More General Intelligence

The Transformer model has set off a revolution in artificial intelligence, and new Transformer-based models have been launched, such as Google's BERT and OpenAI's GPT. For those who are used to using Google search, you have actually benefited from Transformer technology countless times (Google search uses the BERT-based Transformer model). The GPT (Generative Pre-trained Transformer) launched by OpenAI in 2018 is the predecessor of chatGPT.

Despite the emergence of new technologies such as Transformer, natural language processing still retains the practice of training different models for different language tasks (such as question answering and translation). In short, professional models are used for professional tasks.

In 2018, GPT's paper Improving Language Understanding by Generative Pre-Training[3] pushed the generalization of machine intelligence to a new level. The researchers concluded that instead of training different models for different language tasks, it is better to "pre-train" a general language model. This model does not do anything specific. It is only responsible for building a general understanding of human language, which is what we call the language model above, that is, judging whether a given sentence sounds normal or abnormal. The researchers found that once the model has been pre-trained on a large scale, only a small amount of "fine-tuning" (small-scale special training) is needed to allow the pre-trained model to quickly adapt to a new task, and the effect is better than the model trained specifically for a specific task.

In the subsequent release of GPT-2[4] and GPT-3[5] (chatGPT is a fine-tuned version of GPT-3 for the "dialogue task"), researchers further increased the size of the model (GPT-3 has 175 billion parameters and was trained on a 57 billion dataset), and something almost miraculous happened: GPT-3 was able to generate highly realistic sentences, even more eloquent than humans. This large-scale Transformer model is called a "large language model."

In GPT-3, researchers further improved the training method, called "learning in context", which means that you only need to pre-train with large-scale general knowledge first, and then learn directly on the spot when doing special tasks! That is, given a few examples (context), the machine will understand what you mean, and it can also perform quite well. So far, artificial intelligence has taken a step further towards more general intelligence.

Conclusion and future imagination

ChatGPT and large language models are not omnipotent. They still have various problems. Sometimes they give you an incorrect answer. This is because it is a generative model based on probability distribution. The text it generates is based on its training set and your context to maximize the probability of generating an answer. Therefore, it is naturally impossible to guarantee that the answer is correct all the time. Nevertheless, GPT-3 and chatGPT have taken a big step towards general intelligence.

Let's imagine the future. The current chatGPT's cognition of the world mainly comes from human language and text, and the way humans perceive the world is multidimensional. Language and text are just one form. A lot of information comes from images, videos, and even taste and smell. Will the chatGPT of the future not just stay at home, but appear in the form of a robot: with a camera as eyes, a speaker as a mouth, and machinery as hands and feet, go out of the house to see the world, interact with people and nature in the physical world, and get feedback and correct its cognition. When the robot sees flowers, trees, mountains, rivers, seas, sunrises and sunsets, and the joys and sorrows of human beings, can it also express "emotions" and "love" in some way? In addition to being an assistant to humans, can AI also bring us emotional companionship?

Let's wait and see.

Note: chatGPT's core technology also includes reinforcement learning (Reinforcement Learning from Human Feedback), which makes the answers it gives more accurate and friendly. These are all the results of reinforcement learning. This article only briefly introduces some basic language model backgrounds. For more learning materials, please refer to the extended reading and references.

It changed color and also changed gender...

Blog

Xpeng Motors plans to raise $600 million, with a valuation of $4 billion

Blog

Live broadcast training camp: Create a million-dollar live broadcast room to teach you how to sell goods through live broadcast and seize the live broadcast boom

Blog

What does Chinese people’s homesickness look like?

Will oxygen inhalation during high altitude sickness lead to addiction? Here’s what you should know before traveling to high altitude areas!

gossip "When you have altitude sickness, you...

Why is ChatGPT like a "100,000 Whys" answering machine?

It changed color and also changed gender...

Xpeng Motors plans to raise $600 million, with a valuation of $4 billion

Live broadcast training camp: Create a million-dollar live broadcast room to teach you how to sell goods through live broadcast and seize the live broadcast boom

What does Chinese people’s homesickness look like?

China's mobile internet monthly active users decline for the first time

How to perform fuzzy query on an encrypted mobile phone number?

Is advertising really the only profit model for personal websites?

Can Fengxing TV, which has gathered five pearls, really sell tens of millions of units in three years?

Edmunds: It is estimated that the US new car sales in April 2020 will be only 633,260, a year-on-year decrease of 52.5%

Tektronix BoosterPro high temperature quick-drying sterilization floor scrubber, a leading new intelligent experience

Recommend

China Mobile: China 5G Industry Development and Investment Report

The latest news on Jilin City express delivery in 2022: When will it return to normal? Can you ship now?

Ocean Love Story Series丨Anglerfish——This Sticky Love

I have already started searching for traffic on WeChat SEO. Today I spent 300 yuan to authenticate a new account.

Life expectancy cut in half in 50 years! Will the disappearance of bees lead to human extinction?

iOS source code download: stacked cells in groups

I have learned the 8 major factors that affect SEM traffic. If there is no effect, I lose!

Will oxygen inhalation during high altitude sickness lead to addiction? Here’s what you should know before traveling to high altitude areas!

Stay away from digestive tract infectious diseases, you need to know this secret

This boring conference makes people worry about Apple's future

Konka's feature film "Our Motto" is launched, focusing on cutting-edge technology to help the industry break through

iOS Programming Basics: How does the Hello World App work?

The little-known secrets of bananas

How to achieve growth at low prices? Let you know Xiaomi's business model

“Short video promotion + live e-commerce” marketing manual!