In-depth | A full explanation of the principles of Toutiao’s recommendation algorithm

Nowadays, algorithm distribution has gradually become a standard feature of almost all software, including information platforms, search engines , browsers , social software, etc., but at the same time it has also begun to face various doubts, challenges and misunderstandings.

In January 2018, Dr. Cao Huanhuan, a senior algorithm architect at Toutiao , publicly disclosed the algorithm principles of Toutiao for the first time, in an effort to encourage the entire industry to diagnose and provide algorithm advice. By making the algorithm transparent, we can eliminate misunderstandings about the algorithm among all parties.

It is reported that Toutiao's information recommendation algorithm has undergone four major adjustments and modifications since the first version was developed and put into operation in September 2012. Currently serving hundreds of millions of users around the world.

The following is Cao Huanhuan’s sharing on “The Principles of Toutiao’s Algorithm” (authorized):

This sharing will mainly introduce the overview of Toutiao's recommendation system and the principles of content analysis, user tags, evaluation analysis, content security, etc.

1. System Overview

The recommendation system, if described in a formal way, is actually a function that fits a user's satisfaction with the content. This function requires input of variables in three dimensions.

The first dimension is content. Toutiao has now become a comprehensive content platform, including pictures, texts, videos, UGC short videos, Q&A, and micro Toutiao. Each type of content has its own characteristics, and we need to consider how to extract the characteristics of different content types to make good recommendations. The second dimension is user characteristics. Including various interest tags, occupation, age, gender, etc., as well as many implicit user interests portrayed by the model. The third dimension is environmental characteristics. This is a recommended feature in the mobile Internet era. Users move around anytime and anywhere, and their information preferences shift in different scenarios such as work, commuting, and travel.

Combining these three dimensions, the model will give an estimate, that is, it speculates whether the recommended content is suitable for this user in this scenario.

There is another question here, how to introduce goals that cannot be directly measured?

In the recommendation model, click-through rate , reading time, likes, comments, reposts and likes are all quantifiable goals. The model can be directly fitted to make estimates, and the online improvement can tell whether the performance is good or not. However, a large-scale recommendation system that serves a large number of users cannot be evaluated entirely by indicators. It is also important to introduce factors other than data indicators.

For example, frequency control of advertisements and special content. Question and answer cards are a special form of content. The goal of their recommendation is not just to allow users to browse, but also to attract users to answer and contribute content to the community. How to mix these contents with ordinary content and how to control the frequency need to be considered.

In addition , out of consideration for content ecology and social responsibility, the platform needs to further intervene in the content, such as suppressing vulgar content, clickbait and low-quality content, pinning, weighting, and inserting important news, and demoting the content of low-level accounts. These are all things that the algorithm itself cannot accomplish, and require further intervention in the content.

Below I will briefly introduce how to achieve it based on the above algorithm goals.

The formula y = F(Xi, Xu, Xc) mentioned above is a classic supervised learning problem. There are many feasible methods, such as traditional collaborative filtering model, supervised learning algorithm Logistic Regression model, deep learning-based model, Factorization Machine and GBDT, etc.

An excellent industrial-grade recommendation system requires a very flexible algorithm experiment platform that can support multiple algorithm combinations, including model structure adjustments. Because it is difficult to have a universal model architecture that is suitable for all recommendation scenarios. It is now very popular to combine LR and DNN. A few years ago, Facebook also combined LR and GBDT algorithms. Several products under Toutiao use the same powerful algorithm recommendation system, but the model architecture will be adjusted according to different business scenarios.

After the model, let’s take a look at the typical recommendation features. There are four main types of features that play a relatively important role in recommendation.

The first category is relevance features, which evaluate the attributes of the content and whether it matches the user. Explicit matching includes keyword matching, category matching, source matching, topic matching, etc. There are also some implicit matches in the FM model, which can be derived from the distance between the user vector and the content vector.

The second category is environmental characteristics, including geographical location and time. These are both bias features and can be used to construct some matching features.

The third category is heat characteristics. Including global popularity, category popularity, topic popularity, and keyword popularity, etc. Content popularity information is very effective in large recommendation systems, especially when users are cold-starting .

The fourth category is collaborative features, which can partially help solve the so-called problem of algorithms becoming narrower and narrower. Collaborative features do not take into account the user's existing history. Instead, it expands the model's exploration capabilities by analyzing the similarities between different users through user behavior , such as click similarity, interest classification similarity, topic similarity, interest word similarity, and even vector similarity.

In terms of model training, most of Toutiao’s recommendation products use real-time training. Real-time training saves resources and provides fast feedback, which is very important for information flow products. User behavior information can be quickly captured by the model and fed back into the recommendation effect for the next refresh. We currently process sample data in real time based on the Storm cluster online, including clicks, displays, favorites, shares and other action types. The model parameter server is a high-performance system developed internally. As the scale of Toutiao data grows too fast, similar open source systems cannot meet the stability and performance requirements. We have made many targeted optimizations at the bottom layer of our self-developed system and provided complete operation and maintenance tools that are more suitable for existing business scenarios.

At present, Toutiao's recommendation algorithm model is relatively large in the world, containing tens of billions of original features and billions of vector features. The overall training process is that the online server records real-time features, imports them into the Kafka file queue, and then further imports them into the Storm cluster to consume Kafka data. The client sends back the recommended label to construct training samples, and then performs online training based on the latest samples to update model parameters. Finally, the online model is updated. The main delay in this process is the user's action feedback delay, because the user may not read the article immediately after it is recommended. Excluding this part of time, the entire system is almost real-time.

However, because the current content volume of Toutiao is very large, and there are tens of millions of short video contents, it is impossible for the recommendation system to estimate all the content through the model. Therefore, it is necessary to design some recall strategies to filter out thousands of content libraries from the massive amount of content each time a recommendation is made. The most important requirement for the recall strategy is extreme performance, and the timeout generally cannot exceed 50 milliseconds.

There are many types of recall strategies, and we mainly use the reverse sorting approach. An inverted index is maintained offline. The key of this inverted index can be category, topic, entity, source, etc. The sorting takes into account popularity, freshness, action, etc. Online recall can quickly cut off content from the reverse list based on user interest tags, and efficiently filter out a small amount of more reliable content from a large content library.

2. Content Analysis

Content analysis includes text analysis, image analysis and video analysis. Toutiao initially focused on information, and today we will mainly talk about text analysis. A very important role of text analysis in recommendation systems is user interest modeling. Without content and text tags, it is impossible to obtain user interest tags. For example, only when we know that the article tag is "Internet" and the user reads the article with the "Internet" tag, can we know that the user has the "Internet" tag. The same is true for other keywords.

On the other hand, the tags of text content can directly help recommend features. For example, the content of Meizu can be recommended to users who follow Meizu, which is a match of user tags. If the recommended main channel is not effective for a period of time and the recommendations become narrow, users will find that after reading the specific channel recommendations (such as technology, sports, entertainment, military, etc.), and then returning to the main feed , the recommendation effect will be better. Because the entire model is connected, the sub-channel exploration space is smaller, making it easier to meet user needs. It is difficult to improve the accuracy of recommendations only through single-channel feedback, so it is important to do a good job in sub-channels. And this also requires good content analysis.

The picture above is an actual text case of Toutiao. As you can see, this article has text features such as classification, keywords, topic, and entity words. Of course, it doesn’t mean that recommendation systems cannot work without text features. Recommendation systems were first used in the Amazon and even Walmart eras. Even Netflix’s video recommendations do not have direct collaborative filtering recommendations without text features. However, for information products, most people consume content on the same day. Without text features, it is very difficult to cold-start new content, and collaborative features cannot solve the problem of article cold-start.

The main text features extracted by Toutiao's recommendation system include the following categories. The first is the semantic tag feature, which explicitly tags the article with semantic tags. This part of the labels is a feature defined by humans. Each label has a clear meaning, and the label system is predefined. In addition, there are implicit semantic features, mainly topic features and keyword features. Topic features are descriptions of word probability distribution and have no clear meaning; while keyword features are based on some unified feature descriptions and have no clear set.

In addition, text similarity features are also very important. On Toutiao, one of the biggest issues users reported was why repetitive content was always recommended. The difficulty with this question is that everyone has a different definition of repetition. For example, some people think that this article about Real Madrid and Barcelona is repetitive because they have seen similar content yesterday and talking about these two teams again today. But for a heavy football fan, especially a Barcelona fan, I can't wait to read all the reports. To solve this problem, we need to judge the subject, text, body and other contents of similar articles and formulate online strategies based on these characteristics.

Similarly, there are spatiotemporal characteristics, analyzing the location and timeliness of the content. For example, it may not make sense to push the traffic restrictions in Wuhan to Beijing users. Finally, we also need to consider quality-related features to determine whether the content is vulgar, pornographic, or a soft article or chicken soup?

The above picture shows the features and usage scenarios of Toutiao semantic tags. They have different levels and requirements.

The goal of classification is to cover everything, and we hope that every piece of content and every video has a classification; while the entity system requires precision, and the same name or content must be able to clearly distinguish which person or thing it refers to, but it does not need to cover everything. The conceptual system is responsible for solving the semantics of more precise and abstract concepts. This was our initial classification. In practice, we found that the classifications and concepts were technically interchangeable, so we later unified them using a set of technical architectures.

At present, implicit semantic features can already help with recommendations very well, but semantic tags need to be continuously annotated. As new terms and concepts continue to emerge, annotations must also be continuously iterated. The difficulty and resource investment in doing it well are far greater than implicit semantic features, so why do we need semantic labels? There are some product requirements, such as channels need to have clearly defined categories and easy-to-understand text label systems. The effectiveness of semantic labeling is a touchstone for checking a company’s NLP technology level.

The online classification of Toutiao's recommendation system adopts a typical hierarchical text classification algorithm. The top layer is Root, and the first layer of classification below is large categories such as technology, sports, finance, and entertainment. Sports are further divided into football, basketball, table tennis, tennis, athletics, swimming... Football is further divided into international football and Chinese football . Chinese football is further divided into China League One, China Super League , national team... Compared with a single classifier, the use of hierarchical text classification algorithm can better solve the problem of data skew. There are some exceptions where you can see we have some fly wires connected if you want to improve recall. This architecture is universal, but depending on the difficulty of the problem, each meta-classifier can be heterogeneous. For example, some classification SVMs work very well, some need to be combined with CNN, and some need to be combined with RNN for further processing.

The above picture is a case of entity word recognition algorithm. Candidates are selected based on word segmentation results and part-of-speech tagging. During this process, some splicing may need to be done based on the knowledge base. Some entities are a combination of several words, and it is necessary to determine which words combined together can map the description of the entity. If the result maps multiple entities, it is necessary to disambiguate them through word vectors, topic distribution, and even word frequency itself, and finally calculate a correlation model.

3. User Tags

Content analysis and user tags are the two cornerstones of recommendation systems. Content analysis involves more machine learning, and user label engineering is more challenging in comparison.

Common user tags on Toutiao include categories and topics that users are interested in, keywords, sources, interest-based user clusters, and various vertical interest features (car models, sports teams, stocks, etc.). There is also information such as gender, age, and location. Gender information is obtained by logging in through the user’s third-party social account. Age information is usually predicted by models, estimated through machine models, reading time distribution, etc. The permanent location comes from the user's authorized access location information, and the permanent point is obtained through traditional clustering methods based on the location information. The permanent location combined with other information can be used to infer the user's work location, business trip location, and travel location. These user tags are very helpful for recommendations.

Of course the simplest user tag is the browsed content tag. But there are some data processing strategies involved here. Mainly includes: 1. Filter noise. Filter out clickbait titles through clicks with short dwell time. 2. Hot spot punishment. The user's actions on some popular articles (such as the news about PG One some time ago) will be demoted. In theory, the more widely disseminated content is, the lower its credibility will be. 3. Time decay. User interests will shift, so strategies will be more inclined towards new user behaviors . Therefore, as user actions increase, the old feature weights will decay over time, and the feature weights contributed by new actions will be larger. 4. Punishment display. If an article recommended to a user is not clicked, the weight of the relevant features (category, keyword, source) will be penalized. Of course, at the same time, we must also consider the overall context, whether there are more related content pushes, as well as related closure and dislike signals, etc.

User tag mining is generally simple, and the main challenge is the engineering challenge just mentioned. The first version of Toutiao user tags was a batch computing framework with a relatively simple process. It extracted the action data of yesterday’s daily active users over the past two months every day and calculated the results in batches on the Hadoop cluster.

But the problem is that with the rapid growth of users, the types of interest models and other batch processing tasks are increasing, and the amount of computation involved is too large. In 2014, it was difficult to complete the Hadoop task of batch processing millions of user label updates on the same day. The shortage of cluster computing resources can easily affect other work, the pressure of centralized writing to the distributed storage system begins to increase, and the delay in updating user interest tags is getting higher and higher.

Face these challenges. At the end of 2014, Toutiao launched the user tag Storm cluster streaming computing system. After switching to streaming mode, labels are updated whenever there is a user action update. The CPU cost is relatively small, which can save 80% of the CPU time and greatly reduce computing resource overhead. At the same time, only dozens of machines are needed to support the update of interest models for tens of millions of users every day, and the feature update speed is very fast, basically achieving near real-time. This system has been in use since it went online.

Of course, we also found that not all user tags require a streaming system. Information such as the user's gender, age, and permanent location does not need to be recalculated in real time and can still be updated daily.

IV. Evaluation and Analysis

The above introduces the overall architecture of the recommendation system. So how do we evaluate the effectiveness of the recommendation?

There is a saying that I think is very wise, "If you can't measure something, you can't optimize it." The same goes for recommendation systems.

In fact, many factors will affect the recommendation effect. For example, changes in the candidate set, improvements or additions to the recall module, additions to recommended features, improvements to the model architecture, optimization of algorithm parameters, etc., are not listed one by one. The significance of evaluation lies in the fact that many optimizations may ultimately have negative effects, and the effects will not necessarily improve after the optimization is launched.

A comprehensive evaluation and recommendation system requires a complete evaluation system, a powerful experimental platform, and easy-to-use empirical analysis tools. The so-called complete system means that it is not measured by a single indicator. We cannot just look at click-through rate or length of stay, etc. A comprehensive evaluation is needed. In the past few years, we have been trying to combine as many indicators as possible into a unique evaluation indicator, but we are still exploring. At present, our online launch still needs to be decided after in-depth discussion by a review committee composed of more experienced students in each business.

The reason why many companies do not perform well in algorithm development is not because their engineers are not capable enough, but because they need a powerful experimental platform and convenient experimental analysis tools that can intelligently analyze the confidence of data indicators.

The establishment of a good evaluation system needs to follow several principles, the first of which is to take into account both short-term and long-term indicators. When I was in charge of e-commerce in my previous company, I observed that many strategy adjustments seemed fresh to users in the short term, but were actually of no help in the long run.

Secondly, we must take into account both user indicators and ecological indicators. As a content creation platform, Toutiao must not only provide value to content creators so that they can create with more dignity, but it also has an obligation to satisfy users. These two must be balanced. The interests of advertisers must also be considered. This is a process of multi-party bargaining and balancing.

In addition, attention should be paid to the impact of synergistic effects. Strict traffic isolation is difficult to achieve in experiments, and attention should be paid to external effects.

A very direct advantage of a powerful experimental platform is that when there are many experiments online at the same time, the platform can automatically allocate traffic without the need for manual communication, and the traffic can be recycled immediately after the experiment ends, thereby improving management efficiency. This can help companies reduce analysis costs, accelerate algorithm iteration effects, and enable algorithm optimization work for the entire system to move forward quickly.

This is the basic principle of Toutiao’s A/B Test experimental system. First, we will bucket users offline, then allocate experimental traffic online, label the users in the buckets, and assign them to the experimental groups. For example, we start an experiment with 10% traffic, with two experimental groups of 5% each. One 5% is the baseline, with the same strategy as the online market, and the other is a new strategy.

During the experiment, user actions will be collected, basically in near real time, and can be seen every hour. However, because hourly data fluctuates, it is usually viewed on a daily basis. After the actions are collected, they will be processed into logs, distributed statistics, and written into the database, which is very convenient.

In this system, engineers only need to set traffic requirements, experimental time, define special filtering conditions, and customize the experimental group ID. The system can automatically generate: experimental data comparison, experimental data confidence, experimental conclusion summary and experimental optimization suggestions.

Of course, an experimental platform alone is far from enough. Online experimental platforms can only infer changes in user experience through changes in data indicators, but there are differences between data indicators and user experience, and many indicators cannot be fully quantified. Many improvements still require manual analysis, and major improvements require manual evaluation and secondary confirmation.

5. Content Security

Finally, I would like to introduce some of Toutiao’s initiatives in content security. Toutiao is now the largest content creation and distribution company in China, and must pay more and more attention to its social responsibilities and the responsibilities of an industry leader. If there is a problem with 1% of the recommended content, it will have a big impact.

Therefore, Toutiao has placed content security at the company’s top priority since its inception. At the beginning of its establishment, a special review team was set up to be responsible for content security. At that time, there were less than 40 people working on all the clients, backends, and algorithms, and Toutiao attached great importance to content review.

At present, the content of Toutiao mainly comes from two parts. One is the PGC platform with mature content production capabilities.

The first is UGC user content, such as questions and answers, user comments, and micro headlines. These two parts need to go through a unified review mechanism. If the amount of PGC content is relatively small, we will directly conduct a risk review, and if there is no problem, we will recommend it on a large scale. UGC content needs to be filtered through a risk model, and problematic content will enter a secondary risk review. After passing the review, the content will be truly recommended. At this time, if we receive more than a certain number of comments or reports of negative feedback, it will go back to the review stage, and if there are any problems, it will be removed directly from the shelves. The entire mechanism is relatively sound. As an industry leader, Toutiao has always held itself to the highest standards in terms of content security.

The content recognition technology shared mainly includes pornography detection model, insult detection model and vulgarity detection model. Toutiao’s vulgarity model is trained through a deep learning algorithm. The sample library is very large, and both images and texts are analyzed simultaneously. This part of the model focuses more on recall, and accuracy can even be sacrificed. The sample library of the verbal abuse model also exceeds one million, with a recall rate of over 95% and an accuracy rate of over 80%. We have some penalty mechanisms if users frequently make offensive or inappropriate comments.

There are many situations involved in general low-quality identification, such as fake news, black articles, inconsistencies between titles and texts, clickbait, low-quality content, etc. This part of the content is very difficult for machines to understand and requires a lot of feedback information, including comparison with other sample information. Currently, the accuracy and recall rates of low-quality models are not particularly high, and manual review is needed to increase the threshold. The final recall has now reached 95%, and there is actually still a lot of work to be done in this area. Professor Li Hang from Toutiao's Artificial Intelligence Laboratory is currently working with the University of Michigan on a research project to establish a rumor identification platform.

The above is a sharing of the principles of Toutiao’s recommendation system. We hope to get more suggestions in the future to help us improve our work.

The author of this article @ 36kr was compiled and published by (Qinggua Media). Please indicate the author information and source when reprinting!

Product promotion services: APP promotion services Advertising platform Longyou Century

<<: How can a novice quickly build an information flow account? Teach you 3 steps to get it done

>>: "Shi Sanxi · The Essence of Eight Mansions" Rediscover the Magical Uses of Eight Mansions Feng Shui

11 common mistakes that new entrepreneurs make

What brand of USB flash drive is good: How to optimize the website to stay in the scarce top three search engine rankings?

Blog

In-depth | A full explanation of the principles of Toutiao’s recommendation algorithm

11 common mistakes that new entrepreneurs make

Behind WeChat’s crazy “ban”: the pain of WeChat Pay

Why can’t Chinese video websites retain users and make money?

What impact will it have on Chinese mobile phone manufacturers as Google enters “retaliation” mode?

Guinness World Records Day | Thumbs up! Chinese technology included in the Guinness World Records

From 0 to 1, interpreting Android ASO optimization!

It has suddenly become popular recently, and the flames jumped to nearly 5 meters in 3 seconds! Don't follow the trend!

Home Contact Gallery RSS Best Practices for Git in a Team -- How to Use Git Flow Correctly

How to apply the three theories of advertising creativity?

What brand of USB flash drive is good: How to optimize the website to stay in the scarce top three search engine rankings?

Recommend

Interesting business strategy gameplay!

Why can’t rich people make short videos well? How should a novice do live streaming?

91 Ten Articles - New Energy Vehicle Industry Briefing: Former Land Rover Vice President Joins FF as CEO

The core process of online and offline user conversion!

Most of the time, no one reads your information flow copy because of this mistake!

In-depth information | The most comprehensive guide to Baidu information flow advertising is online

How to tap into high-value customers through different channels?

What issues should we pay attention to when building a website? Don't be fooled!

Tesla launches survey in Germany to prove users fully understand Autopilot

E-commerce operations: 9 optimization strategies for product detail pages

Science Time Machine | How did the “Big Bang” come about?

If Facebook becomes like WeChat, Apple will be its strongest rival

"Lifestyle diseases" have become the number one killer of health! The secret to prevention is just these 6 words

Weibo advertising creative optimization skills, placement and traffic generation

A new trend in the VR era, Apple will launch VR headsets next year