How do content apps tag users and push content?

From 2017 to the end of 2018, I participated in a project on information content interest preference tags. What are content interest preference tags?

Simply put, it is to analyze the types of articles that users like to read and obtain their interest preferences. On this basis, personalized content recommendations and push notifications are made to users to effectively promote app activity and extend the user life cycle.

Simply put, this is a two-step process:

The first is to classify the articles, which is what we commonly call labeling the articles.
The second is to label users, that is, based on the types of articles they read, they will obtain corresponding interest preference tags. For example, if I like to read technology-related articles, then there is a great possibility that I will be labeled as a technology information user. The overall process is as follows;

So is it really that simple in practice? How are these two seemingly simple steps achieved?

First, let’s talk about classifying articles.

Because of this project, I looked at the article classifications of many competing apps and found that they are basically consistent, but there are also some differences in details. The more serious problem is that the classification of information articles is difficult to exhaust. We referred to the existing classifications on the market and combined with some materials to develop a complete content interest preference system. When specifying the classification, we followed the MECE principle and basically achieved mutual independence and complete exhaustiveness.

Next, we want to classify the articles, and we use supervised learning of classification algorithms. Ideally, the process would be something like this:

But in practice, we face two problems. Since we choose supervised learning, we must provide a basis of labeled samples. Generally, there are three ways to obtain samples:

One is to manually annotate the articles . The advantage is accuracy, but the disadvantage is low efficiency. The algorithm requires a large number of samples, and the cost is very high.
Another way is to train the model through keywords provided by some open source websites , such as those obtained from the Sogou vocabulary. The advantage is low cost, but the disadvantage is also obvious. Due to the inconsistent understanding of some classifications by different classification systems, the classification is not accurate enough, and a lot of manpower is required for correction in the later stage.
The third way is to cooperate with some information apps and obtain their articles and categories as samples. For example, Toutiao and UC, which are currently doing well, are good choices. We all tried it at that time (tears of bitterness).

After obtaining the samples, the next step is to train and test the algorithm model. The training principle of the algorithm model is to segment sample articles, extract entities, establish feature engineering, use each feature word as a vector, and fit a function. In this way, when there is a new article, the article is segmented and the result is calculated through the model. However, the model cannot be accurate once the samples are collected; the model still needs to be tested and corrected. The general testing process is as follows:

A model that has passed the test is not a one-time solution. Some inaccurate classification problems may still occur in the later stage. This may be caused by the sample or the algorithm model. This requires us to find these abnormal articles and their classifications, correct the classifications, and feed them into the model again as training samples to correct the model. On the one hand, we can manually check articles in categories with relatively low conversion rates to determine whether the problem lies with the algorithm. In addition, here, since each article label is assigned a value, we can set a threshold for these values. When the highest value is lower than a certain threshold, these articles and their labels will be recalled, manually annotated and corrected, and then put into the sample library.

The calculation of article labels is that articles can have multiple labels, not some binary classification results. Therefore, the method we adopt is to use the similarity algorithm to calculate the label of the article and assign a value. The higher the value, the closer it is to this type of label, and it will be labeled accordingly.

At this point, the article tagging part has been completed.

How to tag users

There are actually two ways to label users: statistical labeling and algorithmic labeling.

The statistical type is relatively simple and crude, using the type of articles a user reads over a period of time as the user's interest preference.
Algorithms will add more influencing factors, including the number of article readings, the time interval between readings, the relationship between the article and current hot events, user attribute factors, etc.

The former can be implemented first when algorithm resources are insufficient and operational demand is high, while the latter can divide a part of the traffic based on the former to verify and adjust the algorithm model for continuous optimization.

However, when using the first method, we found that the types of articles users read over a period of time are not stable. Most users have one or several main interest preferences, and they read more articles of these types. But at the same time, users will also read some other types of articles to a greater or lesser extent. Some users even read whatever they can find.

Based on this situation, we need to rank users' interest preferences. That is, by ranking the number of articles read by users in each article type within a period of time and taking the user's top 10 tags, we can clearly tell the operation team what types of articles users like and what the priority of the user's favorite types is among these types, so that the operation team can make push selections.

Therefore, user tags also need to be more flexible, allowing operations staff to flexibly combine and select user groups based on weights such as event time and event frequency.

Currently, a large part of push notifications is done manually, from selecting articles to selecting users to matching articles and users. Generally, a large number of A/B tests are conducted before the official push notifications are made. However, there are many types of information articles, with more than 30 first-level tags and second-level tags ranging from 100 to several hundred. The total number of tags is likely to be thousands, which is absolutely impossible to accomplish by relying solely on operations staff to push notifications.

Therefore, when operational resources are limited and automation cannot be achieved, operators will generally test the tags and select those that cover a large number of users and have a higher conversion rate. But at the same time, this situation will result in some users with more niche interests being excluded from the push notification population.

For such situations, we take the user’s top 10 secondary tags and their corresponding primary tags as the user’s primary and secondary tags. This solves the problem of user coverage and allows operators to focus on pushing content to specific tags and groups.

But at the same time, another problem arises. When choosing the user's behavior within a period of time, how long should this period of time be to fully reflect the user's interests and cover more people at the same time (there will be users who churn every day, so the longer the timeline, the more users will be covered, and the shorter the timeline, the fewer users will be covered)

We found that users' long-term interest preferences tend to be stable to a certain extent, but short-term interest preferences reflect users' behavior of following hot topics in the short term. Therefore, from this perspective, it may be able to better meet user needs in the short term, but the number of users covered in the short term will be small. Here, there is always the eternal contradiction between coverage and conversion rate.

Our approach is to segment users based on their browsing time. Assign users long-term interest preferences and short-term interest preferences, and give priority to short-term interest preferences. Exclude short-term interest users from long-term interest preferences and make different push notifications. As for lost users, it is very likely that they have not had any access records in the last three months (the information at the time defined the lost user time as three months). For such users, we take the user's last recorded tag as the user tag and perform lost recovery.

At this point, all users have their own labels, and operations staff can also push different articles to different users based on the users' active time and reading frequency, truly achieving personalized services for each user.

We can say that we have fallen into many traps on this issue.

The second way is to directly label users through algorithms. In addition to time and reading frequency, more feature dimensions can be added to the algorithm model, such as the time since the user read the article, the length of time the user has read the article, comments, likes, etc. At the same time, the weight of articles can be reduced for hot articles and hot events.

Conclusion

When I look back and summarize this experience, or even when you, the reader, follow me to understand this experience, you may think it is actually very simple. However, we really stepped on countless pitfalls during this experience. In particular, we not only had to collect data and make labels, but also had to guide the business to launch and analyze problems. That experience can be said to be painful and happy.

The pain is because there are too many problems, and the business keeps asking me every day why the conversion rate is low again today; the happiness is because our conversion rate finally increased by more than double, even higher than the industry level, which is the best reward.

Author: Tangtang is the queen of traditional pickled cabbage

Source: Tangtang is the queen of traditional pickled cabbage

<<: Mobile Internet Marketing Director Operation and Promotion Planning Case

>>: What does the Douyin live broadcast field control operation do? Introduction to the responsibilities of Douyin live broadcast control

Why are there two ways to define "new user"?

Detailed explanation of how to play on 7 major platforms including WeChat, Tik Tok, Xiaohongshu, etc., read it in one article!

Blog

Tips for setting up a Baidu information flow account!

Digital "immortality" is all the rage in Silicon Valley! Are digital people who can inherit family businesses a new IQ tax?

Blog

The "chip shortage" problem continues to spread, and the annual automobile production in 2021 may decrease by 4.5 million vehicles

Recently, market research firm Bernstein Research...

Wrapped in 8 pieces of clothing and 2 layers of thick quilts, a five-month-old baby almost had an accident! The consequences of dressing like this in winter are very serious

The weather is getting colder Is everyone Have yo...

Tens of millions of years have passed, and these two Cretaceous dinosaurs are still fighting

In 1971, paleontologists discovered two extremely...

CaoCao Private Domain Retention and Precision Transaction Techniques, learn to build your own private domain traffic, valued at 699 yuan

CaoCao Private Domain Retention and Precision Tra...

How do content apps tag users and push content?

First, let’s talk about classifying articles.

How to tag users

Conclusion

Why are there two ways to define "new user"?

Kuaishou live broadcast operation experience and the most complete process!

How to help Socrates find the most beautiful flower?

How can such a low-level product advertising creativity generate sales of over 100 million?

How to install Windows 10 on your Mac

Detailed explanation of how to play on 7 major platforms including WeChat, Tik Tok, Xiaohongshu, etc., read it in one article!

Tips for setting up a Baidu information flow account!

Geely is willing to be the "Foxconn", but can Baidu, which was acquired by "Jidu", succeed in making cars?

Advertising high-quality landing page design template

Digital "immortality" is all the rage in Silicon Valley! Are digital people who can inherit family businesses a new IQ tax?

Recommend

Anhui She County starts using the college entrance examination backup papers today

Seth Klarman is a truly great book

What issues should be paid attention to in Foshan website construction? Don't be fooled!

What does the stamp on the pork prove?

14 self-media publishing platforms and commonly used new media operation tools!

The "chip shortage" problem continues to spread, and the annual automobile production in 2021 may decrease by 4.5 million vehicles

Zhou Dao's course video on "New Fission Program"

How big is the difference between iOS 15.4.1 and iOS 15.6? Is it worth upgrading?

The next generation of search engines from the perspective of advertising logic

Wrapped in 8 pieces of clothing and 2 layers of thick quilts, a five-month-old baby almost had an accident! The consequences of dressing like this in winter are very serious

Tens of millions of years have passed, and these two Cretaceous dinosaurs are still fighting

CaoCao Private Domain Retention and Precision Transaction Techniques, learn to build your own private domain traffic, valued at 699 yuan

How can you keep your hands away from fishy smell using metal soap, the “odor killer” in the kitchen?

Poisoning caused by boiling tea around the stove, what's going on? Eating hot pot also has this hidden danger!

Information flow ads with high click-through rates cannot do without these 3 points!