Business BackgroundRecommended ads are an important part of JD.com's recommended traffic. They include a variety of advertising materials, including products, aggregation pages, activities, stores, videos, live broadcasts, etc. The quality of recommended ads determines the advertising experience of users on the JD.com platform and the advertising revenue of the JD.com platform. Precision ranking is the most important part of recommended ads. It estimates the click-through rate (Click-Through Rate) of users for candidate products and is also the most typical application link of machine learning in recommended ads. Precision ranking click-through rate estimation technology is the core module of machine learning algorithm technology to drive business growth. It is also a classic field where technicians continue to pursue the best accuracy. The figure below shows several typical advertising materials for JD.com's recommended ads. On June 18, 2022, JD.com’s homepage was revamped and upgraded, and the advertising ranking technology was also upgraded accordingly, and applied to projects such as the homepage ranking model and event intelligent selection. Technical ChallengesThe user composition of the JD homepage recommendation scene is very complex. Some users have very diverse interests, while others have relatively single interests. Product materials also change rapidly. These factors pose great challenges to accurately modeling ad click-through rates. We summarize these challenges into the following three directions: (1) Effective mitigation of the cold start phenomenon: There is an obvious long-tail phenomenon of users and products in the advertising recommendation scenario on JD.com’s homepage. The data of long-tail users and products is sparse and difficult to fully train. To improve the click-through rate in this case, the key point is to properly handle the cold start phenomenon in the task. To this end, we designed a general variational feature learning framework (VELF) to better utilize limited data, ensure that cold start users and advertisements obtain more robust representation learning and avoid overfitting. (2) In-depth mining of user interests: The current model does not fully integrate the prior knowledge between user behavior and JD’s material library when learning the distribution of user interests, and lacks control over the overall semantics of advertising recommendations. To address this issue, we optimized and upgraded the user interest network structure and designed the PPNet+, NeNet, and Weighted-MMoE modules to improve the overall prediction capability of the model through deep personalized modeling of user interests. (3) Full utilization of global data: In the current model, the data sources of users and advertisements are relatively limited, and the global collaborative information generated during user interaction is not fully utilized, which limits the upper limit of the model's prediction ability. In response to this, we start from the pre-training of user global information and the modeling of user exposure data, and make a three-dimensional expansion of user global information to improve the model's prediction ability. Technical SolutionIn response to the above challenges we faced, we focused on upgrading the engineering and algorithms of the refined ranking and provided a system optimization solution. Through our system optimization, we have achieved a cumulative AUC improvement of more than 1% in the refined ranking click-through rate model, and the increase in online advertising revenue is also very obvious. The overall structure of the current refined ranking model is shown in the figure below. In the following, we will introduce our optimization solution from the perspective of variational feature learning framework, user interest network optimization, and global user collaborative information modeling. 01 Variational feature learning frameworkTo alleviate the cold start problem and optimize the processing of long-tail users and items in the homepage recommendation advertising scenario, we designed a general variational feature learning framework (VELF) to better utilize limited data to obtain more reliable features for cold-start users/ads and avoid overfitting. We first model user and ad features through distribution estimation rather than point estimation. At the same time, the variational inference (VI) method is used to effectively learn the distribution of users and ads. Traditional variational inference methods use the standard normal as the prior information of the distribution, which weakens the expression differences between different features. In order to enhance the information expression between users and ads, the corresponding secondary attributes of users and ads are used as their respective parameterized prior information, and then the prior information is corrected through the posterior distribution. The overall framework of the model is shown in the figure below, where u represents the user ID, i represents the product ID, c(u) and c(i) represent the features related to the user and product, respectively, z represents the embedding vector corresponding to the feature, zu and zi are the embedding vectors of the user and product, respectively, corresponding to the upper and lower parts of the figure. In VELF, the posterior distribution of z is used as the latent variable to be learned, and the posterior distribution of z p(z|x) is estimated through variational inference, where x represents all features including users, products, and contexts. Due to the introduction of distribution modeling, the traditional optimization method is not differentiable. Here we use the variational inference method to solve it. Our final loss function can be simplified as follows (for a detailed derivation process, see the paper "Alleviating Cold-start Problem in CTR Prediction with A Variational Embedding Learning Framework"): The first term is the likelihood of the model (cross entropy loss), that is, we hope that the prediction results of the model are as similar as possible to the actual labels. The second term is the constraint term of the feature distribution (KL divergence), that is, we hope that the learned feature posterior distribution is as similar as possible to the assumed prior distribution. In order to enhance the information expression between users and ads, we use the corresponding secondary attributes of users and ads as their respective parameterized prior information to better aggregate the feature spaces of users and ads with similar characteristics. The loss function is rewritten as follows: The parameterized prior information is obtained from the corresponding secondary attributes of users and ads. The variational field and the regularization of the parameter prior distribution are used to prevent overfitting. The final loss function is as follows: in: Our method has achieved significant gains on public data. The experimental results on public data sets are shown in the following table. The above-mentioned related work has been included in the recommended top conference WWW2022: "Alleviating Cold-start Problem in CTR Prediction with A Variational Embedding Learning Framework", article link: https://arxiv.org/abs/2201.10980 02User Interest Network OptimizationIn order to enhance the depth of hierarchical mining of user interests, we have optimized the network structure of the model from three modeling perspectives: strengthening personalized bias, enhancing model semantic connection, and heterogeneous user distribution. 1. Strengthening personalized bias: PPNet+ In the current network structure, the co-constructed semantic model of individual users and target advertisements does not take into account the personalized deviation based on the co-constructed semantic model of global users. In order to increase the personalization of DNN network parameters, we borrowed the parameter personalized network PPNet (Parameter Personalized Net) proposed by the Kuaishou recommendation team, adapted it for the JD advertising recommendation scenario, and proposed PPNet+. In addition to selecting key features such as user ID, advertising ID, and third-level category ID, it also incorporates item features, cross features, and user behavior features as input features of the gated neural network Gate NN (Gate Neural Network). At the same time, we also incorporate users' historical click behaviors and exposure sequences as side info information to assist the PPNet+ network in learning users' personalized interests. The PPNet+ model structure is shown below: As shown in the figure above, PPNet+ inherits the main structure of PPNet. The bottom layer consists of features and embedding layers, and the top layer is learned and output controlled by MLP. Considering the complexity of the recommended advertising scenario on JD.com's homepage, we also processed the sequence information, obtained a fusion_emb feature vector containing the entire scene by fusing the emb module, and concatenated it with the id feature embedding on the right as the input of Gate NN. Like PPNet, the embedding of all features on the left side of the model does not accept the back-propagation gradient of Gate NN to reduce the impact of Gate NN on the convergence of existing feature embeddings. At the same time, we also modified the Gate NN module, replaced the original Neural Layer ReLU module with a Dice activation function that is more sensitive to parameters, and added normalization operations to the input layer of the Gate network, so that the embedding input sizes of different domain features can be in the same range, helping the weight parameters learned by the Gate layer to converge better. 2. Enhanced model semantic connection: NeNet We noticed that after the model transformation with enhanced personalization bias, the introduction of PPNet+ increased the personalization bias capability, but this personalization bias capability is more susceptible to the behavior of short-term active users, which makes it easy for the model to gradually lose the ability to control the interest of long-tail users during subsequent training, resulting in a gradual decline in the effectiveness of the model when it is updated every day. In order to make up for this defect, we need to supplement the accuracy of the existing network structure and restore the gradient update loss caused by over-learning personalization bias during model training. To this end, we proposed the Needle Net (Needle Net) based on the idea of residual network to compensate for the gradient information lost during model training. The main idea can be represented by the formula as follows: is a nonlinear activation function. It can be seen that NeNet combines the learning advantages of nonlinear functions and includes the original input features. Through the idea of residual learning, it reduces the impact of short-term active user behavior and enables the model to directly learn the underlying non-biased vector features. NeNet does not need to ensure strict dimensional alignment and has no module depth requirements, so it can be applied to any submodule under the large model framework. Compared with the original residual network, the learning parameters are more flexible and can be adapted to the main vector and subnetwork of the model. 3. Heterogeneous User Distribution: Weighted-MMoE From the entrance of JD.com’s homepage, we can find that in addition to mainstream advertising recommendations, it also includes a variety of advertising display formats, namely aggregation page ads, event ads, store ads, and video/live broadcast ads. By obtaining online data and offline experimental analysis, we found that users' click consumption habits are different in different advertising scenarios; at the same time, the different display volumes of different scenarios on the same interface will also lead to differences in the distribution of user interests. However, in the current model, all scenarios share a set of outputs, which causes the outputs of different scenarios to restrict each other when the model is estimated, further limiting the precise ranking effect of recommended ads. To address the above issues, we have comprehensively modeled these multiple interrelated but inconsistent estimation targets, that is, introduced the idea of multi-task learning to improve the effect of contextual recommendations. Different from the traditional multi-task model's serial relationship in time (for example, the model infers whether the user will place an order after inferring whether the user will click), the multi-task model in JD's business scenario is more of a parallel relationship in time (that is, the user's clicks in different scenarios do not have a sequential relationship). Considering the above two situations, the model can still share highly similar underlying inputs, so we introduced MMoE (Multi-gate Mixture-of-Experts). It should be noted that the experts used by tower A and tower B in the above figure are the same set of experts. For different advertising scene tasks, the weight selection of the model is different, so we equip each advertising scene with a Gate network. For different tasks, the output of a specific Gate n represents the probability of different experts being selected. The weighted sum of multiple experts is obtained and output to a specific Tower model for the final output. The function expression is as follows: At the same time, we found that the original MMoE only covers the mutual constraints between Gates, and does not comprehensively consider the information sharing and weight distribution relationship between network layers. For this reason, we made some changes to the original model, keeping the core expert network able to share the underlying input information while also aggregating this information to the expert output network through weight distribution. Therefore, the above formula can be improved to: The number of N remains consistent with the number of experts. The attention network module is responsible for assigning weights to the learned expert information (i.e., the weighted empowerment process). Through such a network design, we can allow different expert information to share each other's information flow during reverse derivation, so that the model always maintains a unified information sharing framework. The three strategies of network structure optimization PPNet+, NeNet and Weighted-MMoE that integrate user interests have a combined AUC improvement of 0.45%, which has a significant impact on improving online revenue. 03Global User Collaborative Information ModelingThe original refined ranking model has a thin source of data information and insufficient use of collaborative information generated during user interaction, such as exposure data and click data, which limits the upper limit of the model's estimation ability. JD has comprehensive global data from multiple apps and scenarios, both online and offline, which is a potential source of information that we can tap. In this upgrade, we start from the perspective of global click data pre-training and user exposure data modeling to enhance the utilization of global data and improve the upper limit of the model's personalized estimation. The information about users' interactions with products on e-commerce platforms (browsing, clicking, adding to cart, searching, purchasing, etc.) deeply reflects the users' interests. In the task of CTR estimation, user behavior modeling has always been a topic of focus in academia and industry. The existing mainstream solutions for user behavior modeling are all based on the attention mechanism, which takes candidate products as queries and calculates different weight scores for different products in user behavior to aggregate user behavior sequences. On this basis, we have carried out a series of upgrades and expansions for our scenarios, and have conducted more in-depth mining and characterization of users and products from multiple perspectives and dimensions, and have achieved very obvious results in both offline data and real online systems. 1. Global information pre-training In the end-to-end CTR model training process, the relationship modeling between products is only affected by the accuracy of CTR estimation, and the relevance of the products themselves is ignored. The original intention of using the attention mechanism to process user behavior sequences is to select the part related to the current candidate product from the behavior sequence. Although this relevance is not completely consistent with the relevance of the aforementioned products themselves, the two are positively correlated. Many works such as DIN have also demonstrated this point when printing attention weights. Similar products have higher attention scores. On the other hand, in the end-to-end training process, the modeling of product relationships also only uses the training data of the model. The training data of general models only comes from the click exposure data of their service scenarios, and the modeling of long-tail products with low training data coverage is insufficient. If the training data of other scenarios is directly added, on the one hand, it is difficult to ensure that the data of other scenarios can be positively migrated (experiments have shown that directly adding data is difficult to benefit in large scenarios), and on the other hand, there will be many problems such as the time taken for offline training is doubled and the data features of different scenarios are difficult to align. Therefore, we use the data from JD.com to pre-train and model the correlation between products. We integrate the correlation into the model by embedding and using similarity scores as posterior statistical features to improve the model's expressiveness. In the recommendation system, the relationship between users and products, and between products is very suitable for organization using graphs. Graph models have a natural advantage in modeling the relationship between product correlations. Therefore, we use graph embedding to generate the embedding vector of each product offline. The main generation process is as follows. For details, please refer to EGES[1]. After obtaining the pre-trained vector of each product through graph embedding, an offline word list can be further obtained through faiss, which records the N products in the product library that are most similar to each product and the similarity score. In the process of model training, on the one hand, the pre-trained product embedding can be used as a side info and combined with the randomly initialized product embedding parameters created by the model (addition, dot product or concat can be adjusted according to the experimental effect) for joint training. Offline experiments show that compared with the random initialization method, this method of introducing pre-trained graph embedding can help the model better learn the relationship between candidate products and products in user behavior. On the other hand, since many behaviors in user behavior have nothing to do with candidate SKUs, that is, there is a lot of noise, and the longer the sequence, the more noise signals there are, for example, as mentioned in SIM, most of the noise can be filtered out by filtering by the same category. Similarly, we can use the offline word list generated by faiss to filter irrelevant products with scores below the threshold according to the similarity scores between candidate products and user behavior products, and add the similarity scores to the model as a posterior statistical feature after some discretization. 2. Interest modeling based on exposure information (Gama) Although users' positive behaviors such as clicks, add-to-cart, and purchases can reflect users' recent and long-term interests, in the information flow recommendation scenario, users' real-time interests are constantly affected by the products displayed on the platform. For example, a user may never click on a T-shirt when browsing on the platform, but after the platform exposes a certain T-shirt to the user, the user may be interested in the T-shirt at the current moment, perhaps because the price is very cheap or because the user likes the style. This type of real-time interest cannot be modeled through users' clicks, add-to-cart, and purchase behaviors because they have not been included. Therefore, it is necessary to introduce the user's exposure sequence to characterize the user's real-time interest. There are two challenges in exposure sequence modeling: 1. The exposure sequence is long, the computational burden is heavy, and the online system has high time requirements; 2. Most of the products in the exposure sequence are irrelevant to the current candidate products and have many noise signals. To address these two problems, we creatively proposed a gated adaptive wavelet multi-resolution analysis model Gama, which combines the non-parametric signal processing method with the exposure sequence information acquisition, solves the above two problems, and adaptively mines multi-dimensional user interests from massive exposure sequences without reducing model performance. Our method is described below. The structure of the model we proposed is shown in the figure below, where the main modules include the wavelet analysis module (Wavelet MRA) and the interest gate network (Interest Gate Net). The wavelet analysis module uses a parameter-free and efficient wavelet analysis method to perform multi-level data decomposition on the exposure sequence, thereby removing noise and mining the coherent interests contained in the user's exposure sequence. The addition of the interest gate network aims to adaptively adjust the aggregation weight of the multi-resolution data decomposition results. Wavelet analysis module (Wavelet MRA): For the vectorized exposure sequence Eu, the wavelet multi-resolution analysis method regards it as a multi-channel signal and decomposes the data step by step. The decomposition result of the Jth level includes a stable low-frequency signal a and an isolated high-frequency signal d, and the low-frequency signal a will continue to the next level of decomposition. The multi-resolution analysis process can be formalized as the following formula: H and G are low-pass filter and high-pass filter respectively, and the specific form adopts wavelet basis. Common wavelet basis include Daubechies, Coiflet, Harr, etc. For its form, please refer to the relevant materials of wavelet analysis. Interest Gate Net: For the multi-channel signals obtained by the above analysis method, the simplest way to use it is to use its average value. However, this method cannot adaptively learn the weights of each signal component, so we further propose an interest gating network. Assuming that the target product is represented by eq, we can use the attention mechanism to aggregate a certain signal s: For all decomposed signals we need to consider (such as d1, d2, a3), we can obtain the user representation wu: in We first verified the effectiveness of this method on a public dataset (Taobao), and achieved an improvement of about 10% in various CTR modeling frameworks based on user interests. This method is particularly effective for cold start users. At the same time, we also conducted offline experiments on the JD.com dataset and achieved good AUC improvements, and the online A/B effect was significantly improved. The AUC of the above two global collaborative information modeling works has increased by 0.35% cumulatively, and the online revenue has increased significantly. The related work has been included in the top machine learning conference SIGIR2022: "Gating-adapted Wavelet Multiresolution Analysis for Exposure Sequence Modeling in CTR prediction", article link: https://arxiv.org/abs/2204.14069 04Other WorkIn addition to the variational feature learning framework, user interest network optimization, and global user collaborative information modeling, we have also made the following upgrades to the fine ranking model: sorting features after comprehensive scoring by XGBoost, and expanding the dimension of the embedding vectors of important features; upgrading the dense layer of the neural network structure to Nadam and the sparse layer to adagrad; introducing time and location information in the user behavior sequence to enhance the richness of sequence attributes; and introducing the topic ID frequency network substructure of product advertisements. Combining the overall optimization plan of the above technologies, the cumulative AUC gain of the fine ranking model exceeds 1%. We have several innovative works in progress on fine ranking, including the data-generated CTR framework, the Item-server bucket sequence framework, and the Item collaborative alternative representation learning technology. Summary and OutlookIn summary, after half a year of technical exploration, JD Retail Advertising Algorithm Targeting Group and Engineering Team proposed targeted system upgrade solutions for the challenges of model cold start scenarios, user interest mining and global collaborative modeling in three dimensions: variational feature learning framework, user interest network optimization and global user collaborative information modeling. They summarized and summarized a set of technical solutions to improve the AUC percentile of recommended advertising precision ranking. This set of technical solutions has been fully launched on JD APP on the eve of the opening of 618. At the same time, the technical solutions in this article have also been applied to projects such as the intelligent selection of pan-commodity activities on JD APP, bringing significant benefits to JD 618 promotion activities. Lin Zhangang, head of JD Retail Advertising Data and Algorithm Team, said, "In the past, we have built more accurate user behavior modeling and prediction capabilities through technological innovation, which has improved user experience and brought platform benefits, achieving a win-win situation for the platform and users. In the future, we will continue to improve the length, width and thickness of data, establish a deep understanding of global users, and build more complex and accurate intelligent algorithm models based on this, to help JD Advertising business development under the new situation." Looking forward to the future, first of all, we will actively explore new precision ranking technology paradigms, including the data-generated CTR framework and the Item collaborative substitution representation learning technology framework. Secondly, in the deep mining of user interests, we have built a User Server dynamic representation empowerment framework around user characteristics. Faced with diverse training data and longer, wider and thicker global user collaborative information, we designed the Item global behavior sequence architecture. Precision ranking click rate prediction technology is the core module of machine learning algorithm technology to drive business growth, and it is also a classic field where technicians continue to pursue the best algorithm accuracy. We will continue to improve and explore future technologies with our peers. References[1] Wang J, Huang P, Zhao H, et al. Billion-scale commodity embedding for e-commerce recommendation in alibaba[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 839-848. [2] Zhou G, Zhu X, Song C, et al. Deep interest network for click-through rate prediction[C]//Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018: 1059-1068. [3] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi. 2018. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts[C] //Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). 2018: 1930?1939. [4] Andreas Veit, Michael Wilber, and Serge Belongie. 2016. Residual networks behave like ensembles of relatively shallow networks[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). 2016: 550?558. |
<<: Writing Scripts in Swift: Git Hooks
>>: Actual test to avoid pitfalls! The red and black list of ten B-side components revealed
Microwave ovens, as lazy home appliances, bring c...
Why run YouTube TrueView ads? YouTube TrueView di...
Brand advertising and performance advertising are...
The promotion war between Baidu and Toutiao for t...
Introduction to Xu Kaiwen's psychoanalytic ps...
I don’t know when it started, but the mobile phon...
The governor of the central bank, Zhou Xiaochuan, ...
It is estimated that many people do not know much...
When it comes to games , what comes to everyone’s...
When you have read a large number of official ema...
CPs! Let’s talk about the app store The group lea...
2021 First-Class Construction Engineer Full Cours...
Hot, hot, hot Hot, hot The most widespread high t...
March 25, 2024 is the 29th National Safety Educat...
1 WeChat is a semi-closed circle. “Good wine need...