Recently, the artificial intelligence ChatGPT has become popular all over the world. People from all walks of life have posted their conversations with it. Some of the answers are amazing, while others are serious nonsense. Some people use it to assist in copywriting and code modification, trying to make it a good helper for human work. Why can ChatGPT be a universal chat companion and answer all our questions? Written by | Chen Qingyang The chatbot chatGPT launched by OpenAI has become a focus of discussion around the world. The chatGPT language generation system built on the powerful GPT model has extraordinary natural language text generation capabilities. After being pre-trained with a large number of corpora, it is capable of various natural language processing tasks. It can not only generate very realistic texts such as papers, press releases, poems, codes, etc. according to user requirements, but also answer almost all your questions - from astronomy to geography. Why is it so powerful? This article will give a brief introduction to the language model principles and development process behind chatGPT. Language modeling: a fundamental task Behind chatGPT is a powerful language model. What is a language model? For example, we have all used input method speech-to-text conversion, and there are many homophones in the language, such as The fried squid in this restaurant is really delicious! and The stir-fry at this restaurant is really delicious! How can a machine distinguish whether the word is "squid" or "due to" based on the speech? This is where language models come in. The task of a language model is to determine the probability of a sentence actually appearing, given a sentence . A good language model will give a high probability to the first sentence (squid) and a low probability to the second sentence (due to). This "excellent translation" is obviously the result of the lack of a good language model. So, how can a language model accurately determine the probability of a given sentence actually appearing? The machine needs to read thousands of books to make it very familiar with human language and understand the habits of word choice and sentence structure. The technical problem here is how can we design an algorithm or program to learn language patterns and let the machine "understand" human language? There is a simple method called "Bigram". The idea is to scan all documents, count each word, and find out what the most common word is after it. For example, for the word "I", after the machine scans many documents, it may find that the frequency of "I" appearing after "I" is 30%, the frequency of "call" is 20%, the probability of "think" is 30%, and so on. This frequency will be used by the machine to determine the probability of these two words appearing together (joint probability). Assuming that the vocabulary in the language is N, when the machine completes learning, it will construct a table with N rows and N columns, where each row represents a word row[i], and each column row[i][j] represents the probability of the word appearing after the word row[i]. With this table, the machine can complete the judgment of the language model. We give it a sentence: "I miss you". In the table, assume that the probability of "想" after "我" is 30%, the probability of "你" after "想" is 50%, and the probability of "了" after "你" is 10%. Then the overall probability of "我想你了" is 30%*50%*10%=0.015. With this method, the machine can easily determine that the probability of "squid" appearing in the previous example is higher than that of "由于". But this simple method also has many problems. In many cases, the probability of a word following another word does not only depend on that word, but also on the previous k words to determine the probability of the next word (some of the previous k words may not be important). Table storage is also a problem. When k=2, our table is already N*N in size. As k increases, the table will increase exponentially. In addition, this method cannot automatically learn synonyms to generalize the model. Other ways to build language models include manually creating a knowledge base to record the relationship between nouns, or designing a set of grammatical rules for the language, so that grammatically incorrect sentences can be eliminated. In fact, one of the early schools of natural language processing methods is to use linguistics, which requires a lot of manual annotation and lacks flexibility (it cannot learn from new texts, etc.). Therefore, with more and more data, as machine learning methods, computing power and other technologies further develop, especially after the 2010s, methods based on deep artificial neural networks have become mainstream. People no longer use manual grammatical rules or simple statistical models, but use deep neural networks (Note: From now on, we use the term "neural network" to refer to artificial neural networks, not biological neural networks. Using many layers of deep neural networks for machine learning is called "deep learning") to let machines automatically learn the context in the language and the intrinsic relationship between words from massive data (examples of natural language). Feed-Forward Neural Network In order to solve these key problems, this paper proposed a two-layer forward propagation neural network. This is a relatively simple neural network, far less complex than the later networks, but it reveals the effectiveness of using neural networks in language modeling. The idea of the model can be summarized as follows: Each word has a word feature vector that can be learned Sentence 1: The cat is walking in the bedroom; Sentence 2: A dog was running in a room. Because after a lot of learning, "cat" and "dog", "bedroom" and "room", "walk" and "run" will have similar feature vectors. Therefore, even if the machine has never seen sentence 2 in the training set, it will assign a joint probability similar to sentence 1 to it, and a small fluctuation in the feature vector will have little effect on the final joint probability. Interestingly, the paper also specifically mentioned that in order to train such a "large model" (which is very small from today's perspective), the researchers designed a parallel training algorithm and trained it on 40 CPUs for three weeks. Transformer: The Origin of Large Models With the development of deep learning and computer computing power, newer neural network architectures and larger models have been proposed. A milestone work is the Transformer model [2] (named after the famous movie Transformers) proposed by scientists at Google Brain in 2017, which is the "T" in chatGPT. By introducing the "self-attention mechanism" and "position encoding", Transformer can learn which words in a context should be given more "attention". As we mentioned earlier, when predicting a word, the weight and role of different words in the context will be different. Take the following sentence as an example: The dog was running in a room because it was hungry. Does it here refer to the dog or the room? Through the attention mechanism, the Transformer model can determine that in this context, it is highly correlated with dog, but less correlated with room. The specific calculation process is that the embedding (word vector) of each word in the context will be inner-producted with the word vectors of other words, and the smaller the inner product, the smaller the distance (greater correlation). Broadly speaking, the Transformer model can better encode a word based on the context, that is, for the same word, it can be given different encodings (meanings) in different contexts. The core calculation of the self-attention mechanism is the inner product, which is a matrix multiplication in batches. Matrix multiplication is a highly parallel operation, which allows long-distance dependencies (long contexts) to be calculated efficiently. High parallelization also makes the model easier to expand (scalable). For example, in large models such as GPT-3, the model can consider contexts up to 2048 words. In the excellent translation diagram above, if we use a Transformer model, the machine can consider the previous context "be careful" when translating "be careful". At this time, "be careful" will be translated as "be careful not to" instead of "be careful". With such a powerful, flexible and efficient language modeling method, AI has entered the era of large models. Large Model + Pre-training: More General Intelligence The Transformer model has set off a revolution in artificial intelligence, and new Transformer-based models have been launched, such as Google's BERT and OpenAI's GPT. For those who are used to using Google search, you have actually benefited from Transformer technology countless times (Google search uses the BERT-based Transformer model). The GPT (Generative Pre-trained Transformer) launched by OpenAI in 2018 is the predecessor of chatGPT. Despite the emergence of new technologies such as Transformer, natural language processing still retains the practice of training different models for different language tasks (such as question answering and translation). In short, professional models are used for professional tasks. In 2018, GPT's paper Improving Language Understanding by Generative Pre-Training[3] pushed the generalization of machine intelligence to a new level. The researchers concluded that instead of training different models for different language tasks, it is better to "pre-train" a general language model. This model does not do anything specific. It is only responsible for building a general understanding of human language, which is what we call the language model above, that is, judging whether a given sentence sounds normal or abnormal. The researchers found that once the model has been pre-trained on a large scale, only a small amount of "fine-tuning" (small-scale special training) is needed to allow the pre-trained model to quickly adapt to a new task, and the effect is better than the model trained specifically for a specific task. In the subsequent release of GPT-2[4] and GPT-3[5] (chatGPT is a fine-tuned version of GPT-3 for the "dialogue task"), researchers further increased the size of the model (GPT-3 has 175 billion parameters and was trained on a 57 billion dataset), and something almost miraculous happened: GPT-3 was able to generate highly realistic sentences, even more eloquent than humans. This large-scale Transformer model is called a "large language model." In GPT-3, researchers further improved the training method, called "learning in context", which means that you only need to pre-train with large-scale general knowledge first, and then learn directly on the spot when doing special tasks! That is, given a few examples (context), the machine will understand what you mean, and it can also perform quite well. So far, artificial intelligence has taken a step further towards more general intelligence. Conclusion and future imagination ChatGPT and large language models are not omnipotent. They still have various problems. Sometimes they give you an incorrect answer. This is because it is a generative model based on probability distribution. The text it generates is based on its training set and your context to maximize the probability of generating an answer. Therefore, it is naturally impossible to guarantee that the answer is correct all the time. Nevertheless, GPT-3 and chatGPT have taken a big step towards general intelligence. Let's imagine the future. The current chatGPT's cognition of the world mainly comes from human language and text, and the way humans perceive the world is multidimensional. Language and text are just one form. A lot of information comes from images, videos, and even taste and smell. Will the chatGPT of the future not just stay at home, but appear in the form of a robot: with a camera as eyes, a speaker as a mouth, and machinery as hands and feet, go out of the house to see the world, interact with people and nature in the physical world, and get feedback and correct its cognition. When the robot sees flowers, trees, mountains, rivers, seas, sunrises and sunsets, and the joys and sorrows of human beings, can it also express "emotions" and "love" in some way? In addition to being an assistant to humans, can AI also bring us emotional companionship? Let's wait and see. Note: chatGPT's core technology also includes reinforcement learning (Reinforcement Learning from Human Feedback), which makes the answers it gives more accurate and friendly. These are all the results of reinforcement learning. This article only briefly introduces some basic language model backgrounds. For more learning materials, please refer to the extended reading and references. Further reading 1. Training language models to follow instructions with human feedback 2. Illustrating Reinforcement Learning from Human Feedback (RLHF) 3. The Road to AGI: Large Language Model (LLM) Technical Essentials: References [1] A Neural Probabilistic Language Model, https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf [2] Attention Is All You Need, https://arxiv.org/abs/1706.03762 [3] Improving Language Understanding by Generative Pre-Training,https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf [4] Language Models are Unsupervised Multitask Learners, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [5] Language Models are Few-Shot Learners, https://arxiv.org/abs/2005.14165 This article is supported by the Science Popularization China Starry Sky Project Produced by: China Association for Science and Technology Department of Science Popularization Producer: China Science and Technology Press Co., Ltd., Beijing Zhongke Xinghe Culture Media Co., Ltd. Special Tips 1. Go to the "Featured Column" at the bottom of the menu of the "Fanpu" WeChat public account to read a series of popular science articles on different topics. 2. Fanpu provides a function to search articles by month. Follow the official account and reply with the four-digit year + month, such as "1903", to get the article index for March 2019, and so on. Copyright statement: Personal forwarding is welcome. Any form of media or organization is not allowed to reprint or excerpt without authorization. For reprint authorization, please contact the backstage of the "Fanpu" WeChat public account. |
<<: Latest news! The Shenzhou 15 crew plans to return home in June
>>: How does the bird that seems to be suspended in the air fly?
The two most popular words recently are online an...
Using second-hand mobile phones is indeed a socia...
If the content of information products is not goo...
When it comes to eating vegetables, everyone'...
The following is an outline of this article: 1. W...
After half a year of silence, the Douyin Mini Pro...
Once upon a time, when we bought mobile phones an...
In traditional Chinese culture, the tiger is the ...
How finance creates value for enterprises -- Intr...
There are very strict requirements for the config...
Although he is over 90 years old, Chen Yu, former...
People in the south of the Yangtze River are prob...
Competition among Internet products is becoming i...
App developers often ask themselves a question: h...
[[164900]] I don't usually write articles lik...