Innovation and practice of multi-scenario modeling in Dewu transaction search

1. Overview

In 2024, based on the scenario characteristics and data status of transaction search, the Dewu algorithm team carried out a series of work around "multi-scenario modeling", and achieved a significant improvement in online business indicators; at the same time, we used fragmented time to form corresponding papers with the accumulated technical experience, and were successfully accepted by the top conferences in the search recommendation/data mining field WWW'2025 (CCF-A) and DASFAA'2025 (CCF-B) . The specific information of the three papers is as follows:

(1) WWW'2025 (industry track long paper): Disentangling Scenario-wise Interest Network for Multi-scenario Recommendation. (The acceptance rate is only 22.4% (63/281) )

(2) DASFAA'2025 (research track oral): Towards Scenario-adaptive User Behavior Modeling for Multi-scenario Recommendation. (The acceptance rate is only 20.7% (137 / 662) )

(3) DASFAA'2025 (industry track long paper oral): When Multi-scenario Meets Multi-attribute: Scenario and Attribute-aware Recommendation with Contrastive Learning. (Acceptance rate: 41.7% (15 / 36) )

The specific contents of these three tasks are introduced in detail below.

2. Background

In recent years, online shopping platforms have played an increasingly important role in users' daily lives. In order to meet users' diverse shopping needs, most current e-commerce apps often integrate multiple shopping scenarios (homepage waterfall flow, details page, order page, etc.) to provide customized shopping services for different users. As a result, multi-scenario learning (MSL) has also flourished in the search and recommendation systems of e-commerce platforms. Below we will analyze the characteristics of different scenarios from the perspectives of the overall Douyin App and Douyin App search.

Dewu App overall multiple scenarios

Taking the Douyin App as an example, Figure 1 shows several common scenarios:

(1) Home waterfall flow: Users can see recommended products on the home page as soon as they visit the application. This scenario displays a variety of products that attract users based on their historical behavior preferences.

(2) Shopping cart page or order page: In this scenario, products are recommended to users after they add certain items to the shopping cart or purchase them. The recommended products depend more on the user's past purchase history.

Figure 1. Example diagram of the overall multi-scenario of the Douyin App

Obviously, these typical scenarios are very different from each other. User behavior and interests in different scenarios (for example, user preferences for specific brands, prices, or categories) also vary greatly. Specifically, user interests on the homepage waterfall flow are more divergent than on the shopping cart page or order page, because the products displayed in the latter two scenarios are more restricted by the user's historical purchase and order behavior. For example, people on the homepage tend to have diverse interests (users like to browse more categories and brands of products, which have a wider range of price options); on the contrary, users may show more specific and concentrated preferences on the order page. If a user recently bought a new mobile phone on the order page, he may pay more attention to electronic devices such as mobile phone cases and headphones next. User behavior is also guided by the products they have purchased.

Search multiple scenarios on the Douyin App

Taking the search on the Douyin App as an example , we divide the scenarios from the perspective of the source of user search traffic . Figure 2 shows the corresponding scenarios.

Figure 2. Example of searching multiple scenarios on the Douyin App

After analyzing user data from different query sources, we found that users have significant differences in price, category and brand preferences from different sources. Therefore, we consider using different query sources as scenario information and introduce multi-scenario learning to perform refined modeling in the CTR prediction task.

Main Issues

We can summarize the two main problems that need to be solved in Dewu search multi-scenario modeling:

(1) How can we effectively characterize the differences in user behavior interests (price, category, brand, etc. preferences) in different scenarios?

(2) In the field of search recommendation, user behavior sequence modeling is usually the most effective means of expressing user interests. How can we elegantly integrate scenario information into behavior sequence modeling to achieve the most effective multi-scenario interest modeling?

3. Overall Optimization Idea

SACN : Regarding question (1), since users have explicit preference differences for product attributes such as price, category, and brand in different scenarios, and the common practice of user behavior sequence modeling in the past also introduced product multi-attributes as side info to enrich user interest expressions, can we deeply combine the prior information of multiple scenarios with multi-attribute sequence modeling to achieve the goal of solving the problem? Based on this, we propose to introduce multi-attribute information in multi-scenario modeling to characterize the multi-granularity portrait preferences of users in specific scenarios, and at the same time better assist in characterizing the differences in user interests in different scenarios. We proposed Scenario and Attribute-aware Contrastive Network ( SACN ) in DASFAA'2025 (industry track), which for the first time combines multi-scenario modeling with multi-attribute modeling of user behavior sequences , constructs two user interest preference extraction components of different granularities, namely Item-level Preference Extracting (IPE) and Attribute-level Preference Extracting (APE), and combines contrastive learning to distinguish users' interests in different target scenarios.

Although SACN has optimized problem (1) to a certain extent, it is similar to other multi-scenario modeling methods commonly used in the industry (PEPNet, SAR-Net, etc.), which only considers the target scenario information. It does not distinguish which scenario the user's historical behavior comes from, that is, they have not conducted in-depth discussions and optimizations on problem (2). Inspired by our past experience in iterative modeling of user behavior sequences, we follow the following two optimization ideas and consider introducing rich scenario information into the behavior sequence.

Idea 1 : Add the scene source of each historical behavior of the user as side info similar to product attributes to the sequence;
Idea 2 : Refer to the SIM method of sequential modeling to group users' historical behaviors according to their scene sources, and express users' interests more accurately;

SAINet : SAINet optimizes along the idea 1. We proposed Scenario-adaptive Interest Network ( SAINet ) in DASFAA'2025 (research track). SAINet first stacks multiple Scenario-adaptive Blocks consisting of Scenario-aware Interest Extracting (SIE) and Scenario Tailoring Module (STM). SIE adaptively integrates the scenario context (similar to the attribute information of each product in the behavior sequence) into the user behavior sequence to more accurately capture the differences in user interests. The deep network is realized by stacking, which enhances the model's ability to model differences between scenarios.

DSWIN : DSWIN optimizes along the idea 2. Inspired by the SIM method of sequence modeling (i.e. filtering historical sequences related to the target product category by category), we divide the user behavior sequence by scene to more effectively extract the interest preferences related to the target scene . DSWIN proposes the Global Interest Aggregation (GIA) and Local Interest Resolution (LIR) modules to model the user's global interest preferences and the user's interest preferences within the scene, respectively. Finally, the Interest Disentangling Module is introduced to explicitly separate the interest differences of the scene.

IV. Related Work

Traditional methods usually use scene-specific data to train a separate model for each scene. However, this isolated model training method usually requires more complex calculations and maintenance costs. In addition, a separate model cannot capture the interrelationships between scenes. Currently, the most advanced models usually use the combined data of all scenes to train a unified model to learn the commonalities and specific features between scenes. There are currently two mainstream multi-scene modeling methods:

Scenario-specific

network structures

Scenario-specific network structures are inspired by multi-task learning (MTL). These methods apply a specific network for each scenario and output multiple scenario-specific CTR scores, that is, model each scenario as a task and then construct the model using the multi-task modeling idea. MoE proposes to select sub-experts based on shared bottom inputs. MMoE adjusts the MoE structure by sharing sub-networks between all tasks, while training a lightweight gating network to optimize each task. However, MMoE suffers from the seesaw phenomenon (i.e., the improvement of one task usually leads to the decline of other tasks). To address this problem, PLE explicitly proposes shared experts and task-specific experts, and adopts a progressive routing mechanism to gradually extract and separate deeper semantic knowledge, thereby improving the efficiency of cross-task joint representation learning and information transfer in a general setting. HMoE uses MMoE to implicitly identify the differences and commonalities between scenarios. However, due to the complexity of multi-scenario datasets, it is difficult for HMoE to explicitly capture shared and specific information. With this in mind, SAML introduces a scenario-aware Embedding module to learn feature representations from global and scenario-specific perspectives, and proposes an interaction unit to learn similarities between scenarios. AESM2 proposes a novel expert network structure that automatically selects the most appropriate shared and specific expert networks by calculating the KL divergence.

However, scenario-specific network structures ignore the differences between scenarios in the underlying representation of the model, and they are all composed of heavy network structures, which is not conducive to actual online use. Therefore, lightweight multi-scenario models such as parameter adaptive network structures are currently the mainstream in the industry.

Parameter adaptive

network structures

Influenced by the LHUC algorithm proposed in the field of speech recognition, these methods propose to take scene context information as input and dynamically adjust the model's underlying embedding and top-level DNN hidden units through a gate mechanism to learn scene diversity. AdaSparse learns an adaptive sparse structure for each scene to enhance cross-domain generalization capabilities. SASS designs a scene-adaptive transfer module to inject useful information from the entire scene into a single scene. DFFM incorporates scene-related information into the parameters of feature interaction and user behavior modules. 3MN proposes a novel ternary network-based approach to model complex task-task, scene-scene, and task-scene relationships. PEPNet takes scene-related features as input and dynamically scales the underlying embedding and top-level DNN hidden units through a gate mechanism. SFPNet contains a series of so-called scene-customization modules that redefine basic input features and integrate scene information at a coarse-grained level.

However, these methods all consider the user's entire historical behavior sequence as a whole and cannot distinguish which scene each behavior interaction in the behavior sequence comes from. As shown in Figure 3 (b) , coarse-grained weight adjustment methods (such as PEPNet) uniformly apply the same weight to the representation vectors aggregated from historical behaviors. SAR-Net in attempts to integrate the target scene information when calculating the weight of each behavior in the sequence through the target attention mechanism. However, none of them can distinguish the scene to which each behavior in the behavior sequence belongs. As mentioned in the background, user behaviors in different scenes vary greatly, which inevitably reduces the accuracy of the CTR model.

Figure 3. Schematic diagram of scene information utilization in PEPNet and SAR-Net

5. SACN: When Multi-scenario Meets Multi-attribute: Scenario and Attribute-aware Recommendation with Contrastive Learning

Overall structure

The SACN model structure is shown in Figure 4.

Figure 4. Schematic diagram of SACN

Problem Definition

Let represent the user set, product set, product ID set, j-th product attribute set and scenario set respectively. represents the user's historical behavior sequence arranged in chronological order, where is the i-th product in the user interaction sequence, and N is the maximum length of the sequence. Assuming we have m types of product attributes, then , where is the ID of the i-th interactive product, represents the j-th attribute of the i-th interactive product. Given the target product, target scenario and interaction history, the goal of multi-scenario modeling is to predict the probability that user u is interested in the target product in the scenario. Considering the CTR prediction task, it can be formalized as:.

Specific methods

Item-level Preference Extracting

In order to capture the user's coarse-grained (Item-level) preferences, the Item-level Preference Extracting (IPE) module uses the scene-aware multi-head self-attention (MHSA) mechanism to process the user's historical sequence composed of product IDs. We have strengthened the importance of target scenes and product-related information in the model. The embedding matrix representing the behavior sequence product ID, the target scene and target product ID feature vector can be expressed as . The final output of the scene-aware MHSA is calculated as follows:

In order to make the target product and historical behavior product fully interact, and use the target scene to guide the user's historical behavior encoding, Q, K, and V integrate the embedding corresponding to the target scene and product ID into the parameters of self-attention. The integration process is as follows:

Where is the transformation matrix, which represents the element-by-element product. It is reshaped by the embedding of the target scene and product ID, and has the same dimension as the transformation matrix. In this way, the information of the target scene and product ID can be fully involved in the coarse-grained user preference extraction process element by element. The final preference representation formula of IPE is as follows:

Similarly, the preference representation vector integrating other scene information can be calculated as follows: just replace Q, K, and V in the calculation formula with , where K represents the number of scenes.

Attribute-level Preference Extracting

The attributes of products are crucial to more comprehensively capture user preferences in different scenarios. However, to the best of our knowledge, existing multi-attribute modeling models do not utilize target scenarios and product information, which can lead to information loss and limit the network's expressive power. Let the j-th product attribute representing historical behavior; the Embedding matrix of the target scenario and the j-th target product attribute features be denoted as , respectively. Symmetrically, we replace X and in the IPE module Q, K, V calculation formulas with and , respectively. Then the Attribute-level Preference Extracting (APE) module can obtain m Attribute-level preference representations corresponding to m types of product attributes. represents the j-th representation. In order to capture users' different preferences for attributes (such as categories or brands), we fuse the m Attribute-level preference representations that fuse the target scenario information using the original attention network, which is defined as follows:

Similarly, we can obtain attribute-level preference representations that integrate other scene information.

Scenario Contrastive Module

As mentioned above, there are significant differences in user interests in different scenarios. Self-supervised learning is used to characterize this difference. Specifically, we regard the item-level preference representation and attribute-level preference representation that integrate the current target scene information as positive contrast samples, and the corresponding representations that integrate other scene information as negative samples. We use contrastive learning loss to guide the model to enhance the similarity score between two positive samples and weaken the similarity score between negative samples and two positive samples. Therefore, there are two contrast tasks as follows:

where represents the similarity function, which is used to calculate the cosine distance between two instances. is the temperature parameter. The Scenario Contrastive Module (SCM) finally self-supervises the learning of the difference between scenes and improves the model's ability to distinguish between scenes.

Prediction and Optimization

We concatenate the outputs of IPE and APE, the scene features, and the target product features, and then feed them into a multi-layer DNN tower:

is the probability that the user interacts with the target item. We use the general cross entropy loss as the objective function:

Where is the label of the sample. M is the number of samples. The joint loss function uses the hyperparameter gamma to balance the supervision target and the self-supervision target. The formula is as follows:

Experimental Section

Experimental setup

Dataset

We collected and sampled samples from multiple scenarios in Dewu as our experimental dataset. The product attributes in this dataset include price, third-level categories, and product brands. The attributes of all products are discrete category features.

Evaluation Metrics

For offline evaluation, we use the area under the ROC curve (AUC) as the evaluation metric, which is widely adopted in industrial recommender systems.

baseline

To verify the effectiveness of the proposed SACN, we compare the performance of our model with a series of state-of-the-art multi-scene learning (MSL) methods, namely, MMoE, PLE, M2M, PEPNet, and MARIA.

Overall experimental results

We repeat each model three times and report the average results. The offline comparison results are shown in Table 1 .

Table 1. Comparison of offline effects of SACN and other methods

The main observations are summarized as follows:

(1) MMOE is chosen as the base model because it is representative in MSL. Compared with MMOE, PLE achieves higher stability across scenarios by dividing the expert network into two different groups and more effectively extracts the differences and commonalities between scenarios, thus achieving better performance.

(2) Both MMOE and PLE introduce scene-specific DNN towers on top of the model and output multiple scores for different scenes. However, they ignore the optimization of the underlying layer of the model (e.g., Embedding layer), which will seriously reduce the effect of multi-scene modeling. PEPNet uses scene-aware gating units to adaptively adjust the Embedding layer and hidden layer. M2M introduces a novel meta unit that combines rich scene knowledge to explicitly learn scene correlations and enhances the ability to capture scene-specific feature representations. They all perform better than MMoE and PLE.

(3) MARIA outperforms other models by optimizing the underlying and upper structures of the model. However, all these methods only consider product IDs when leveraging user behavior in multi-scenario modeling, without considering the impact of multiple attributes of products, which are essential for generating rich latent interest representations and reflecting user interest differences in different scenarios. Due to the use of product attribute information in multi-scenario modeling, our SACN model achieves the best performance compared to other models in all scenarios.

Ablation experiment

To verify the effectiveness of each component in the proposed SACN, especially the multi-attribute correlation module, we conduct several ablation experiments:

(1) w/o APE: The APE module is deleted from the SACN model, which means that product attribute information is not introduced;

(2) w/o APE (w DIF-SR): The APE module is replaced by DIF-SR, which is an attribute-aware behavior modeling method based on self-attention, but it does not consider the target scene and product information;

(3) w/o APE (w ASIF) replaces the APE module with ASIF, which is also an attribute-aware behavior sequence modeling method with aligned product ID and attribute information;

(4) w/o SCM means that the SCM module is removed from the SACN, which means that the differences in interest are not clearly distinguished;

The ablation experiment results are shown in Table 2. We found that after removing APE, the model effect will drop significantly, even worse than PEPNet. This observation proves the correctness and necessity of using product attributes when modeling multiple scenarios. When APE is replaced by other multi-attribute-aware behavior modeling methods without using target scenes and product information, AUC is also observed to drop, which means that the target scene and product information are conducive to learning fine-grained preference representation. In addition, without SCM, the model performance will drop significantly in all scenarios. This shows that using item-level preferences and attribute preferences that integrate scene information as self-supervision signals is very helpful for distinguishing interests between scenes.

Table 2. SACN variant ablation experiment results

AB experiment

To further demonstrate the effectiveness of the proposed SACN, we deployed it on the platform for A/B testing. Due to industrial environment limitations, it is not possible to compare all baseline models online. Therefore, we choose PEPNet as the baseline model for comparison. The online evaluation metric is pvctr, which is the number of clicks divided by the number of impressions. After a week of online A/B testing, we found that the proposed SACN achieved continuous improvement over the PEPNet model, that is, an overall improvement of pvctr+1.02%. In short, the online A/B test results once again prove the effectiveness and practicality of our SACN model in industrial environments.

in conclusion

In this paper, we propose a new method for multi-scenario learning, SACN, which introduces both scenario information and product attribute information when modeling user behavior. SACN can capture users' coarse-grained and fine-grained preferences using product IDs and attributes. The preference extraction process also considers using target scenario and target product prior information to achieve better results. With the help of self-supervised learning, SACN combines users' item-level and attribute-level interest representations to distinguish the preference differences of users in different scenarios. Extensive experiments show that SACN consistently outperforms the most advanced baseline models.

6. SAINet: Towards Scenario-adaptive User Behavior Modeling for Multi-scenario Recommendation

Overall structure

As shown in Figure 5 (a) As shown in Figure 5 (b) , PEPNet uniformly applies the same weight to the representation vectors aggregated from historical behaviors (aggregation methods such as concat and pooling can be used). SAR-Net in the paper attempts to integrate the target scene information when calculating the weight of each behavior in the sequence through the target attention mechanism. However, they only use the information of the target scene and cannot distinguish the scene to which each behavior in the behavior sequence belongs. SAINet proposed in Figure 5 (c) adds the scene information as a kind of attribute side info to the user's historical behavior sequence adaptively, and uses a cascade stacking method to effectively express the scene information in the deep network structure. The specific structure of SAINet is shown in Figure 6. It mainly consists of three parts: a series of Scenario-adaptive Blocks, Target-aware Interest Fusion and Scenario-aware DNN Tower.

Figure 5. Comparison of SAINet, PEPNet, and SAR-Net

Figure 6. SAINet schematic diagram

Problem Definition

Let be a set of items, denote a set of K scenarios. denote a chronological sequence of user u’s historical behaviors, and N denotes the length of the sequence. Given the target item, target scenario, and other features, the multi-scenario modeling task aims to design a unified ranking model to provide accurate and personalized product recommendations in K scenarios simultaneously. We choose click-through rate (CTR) prediction as our task, which is formulated as follows:

CTR prediction is to predict the probability of user u interacting with the target product in the scene. We use the industry-wide Embedding technology to convert sparse features into low-dimensional dense vectors. For example, the Embeddings that represent the target product and other features respectively.

Specific methods

Scenario-adaptive Block

As shown in Figure 5, existing methods cannot distinguish the differences in user behaviors in different scenarios when performing multi-scenario modeling, while the scenario-aware prior knowledge contained in the behavior sequence has an important impact on improving the accuracy of the model results. Therefore, we designed a Scenario-adaptive Block, which adaptively injects scenario-aware context into the user behavior sequence to obtain a comprehensive and fine-grained interest representation, while customizing the user interest representation according to the target scenario information to further capture user interests closely related to the current scenario. The Scenario-adaptive Block is composed of L stacked layers, and each Block contains two modules (Scenario-aware Interest Extracting and Scenario Tailoring Module). By stacking Blocks, SAINet builds a deep network that gradually enhances its ability to model differences in behaviors in different scenarios. To illustrate this point, we give the calculation process within the lth Block as follows:

Scenario-aware Interest Extracting

The Scenario-aware Interest Extracting (SIE) module aims to integrate scenario prior information from historical behaviors and extract more fine-grained user interests. We adopt an improved multi-head self-attention (MHA) that contains specific scenario knowledge. represents the output of the (l-1)th block, and N represents the number of historical interactive products. After MHA, the encoded interest matrix of is recorded as, and the encoding calculation method is as follows:

Q, K, V integrate the scene embedding into the output of the (l-1)th block to obtain a more accurate representation of interest. Let the embedding matrix representing the scene from which each behavior comes. Then the integration process can be defined as:

Where h represents the number of heads. represents the weight matrix of the output linear transformation, where are the projection matrices of the i-th head corresponding to query, key, and value respectively. is the transformation matrix of the scene Embedding.

Scenario Tailoring Module

Although SIE considers leveraging scenario information in historical behaviors, it ignores the target scenario information, which is very important for explicitly capturing user interests related to the current scenario. We propose Scenario Tailoring Module (STM) to further tailor the user’s interest representation (i.e., STM consists of N lightweight gated units. The i-th gate adjustment is calculated as follows:

Where are the projection matrix and the bias term, respectively. Is the scaling factor used to further compress and amplify the custom signal. represents the embedding of the target scene. is the element-wise product. Finally, the interest representation of the l-th block can be defined as the Scenario-adaptive Bloc block iterated L times to improve its ability to capture behavioral differences between scenarios. In particular, the original input of the first block (i.e.) is defined as:

The embedding matrix representing the user behavior sequence

Target-aware Interest Fusion

After passing through the Scenario-adaptive Block, multiple interest representations are generated. All representations must be fused first to facilitate their fusion with other feature vectors transmitted to the downstream DNN network. We use Target-aware Interest Fusion (TIF) to fuse through the attention mechanism:

where represents the fused interest representation vector corresponding to the target product and the target scene. is a learnable parameter. is the attention weight, which can be calculated as follows:

The correlation between them. represents the vector obtained by concatenating the target product Embedding and the scene Embedding. It is a learnable parameter.

Scenario-aware DNN Tower

Although we prioritize the optimization of the underlying architecture (i.e., user behavior), the optimization of the top layer of the model cannot be ignored. Therefore, we dynamically scale the hidden units of the top layer of the DNN by introducing a Scenario-aware DNN Tower (SDT) during the prediction phase. We first connect all the outputs:

Then use Scenario-aware DNN Tower (SDT) to predict the probability of users clicking on the target product:

The specific calculation of the jth layer in sDNN is:

Where is the Sigmoid activation function. are the weight and bias of the jth layer respectively. For the CTR task, we use the following cross entropy loss as the objective function:

Where is the ground truth of the sample. |D| is the number of samples.

Experimental Section

We conduct extensive experiments to verify the effectiveness of our proposed SAINet model and answer the following questions:

RQ1: How does SAINet perform compared to state-of-the-art baselines?

RQ2: How does each module in SAINet work?

RQ3: How do the hyperparameters in the proposed SAINet affect its performance?

Experimental setup

1. Dataset

We conducted experiments on the following two real datasets. The dataset statistics are shown in Table 3 .

AliCCP: AliCCP is a public dataset with training and test sets released by Taobao, which is widely used in the relevant literature in the recommendation field. We divide the dataset into three scenarios (abbreviated as #C1 to #C3) according to the context feature values.

Dewu dataset: It involves user logs of five scenarios (denoted as #A1 to #A5) obtained by random sampling.

Table 3. Basic information statistics of the dataset

2. Evaluation Metrics

I used the widely used accuracy metric AUC to verify the model performance. AUC represents the area under the ROC curve on the test set. A small increase in AUC may lead to a significant increase in the CTR metric on a real industrial platform.

3. Baseline

To demonstrate the effectiveness of our proposed model, we compare SAINet with three categories of methods in multi-scenario modeling.

General recommenders: Samples from all scenes are combined to train a unified ranking model.

DNN: This is a general model where predictions and scores are performed by a single DNN tower with full parameter sharing.
DeepFM: It combines factorization machines and DNN components to eliminate manual feature engineering.

Scenario-specific network structures: Each scenario is regarded as a different task and consists of multiple scenario-specific networks.

SharedBottom (SBT): Shares all parameters at the bottom layer and employs multiple scene-specific DNN towers at the top layer.
MMoE: Transferring the original multi-task learning to multi-scene learning. MMoE applies a gating network to adjust the underlying expert output representation vector, followed by a scene-specific tower, and learns to model scene relationships from data.
LE: Introducing scenario-sharing experts and scenario-specific experts based on MMoE can effectively alleviate the seesaw phenomenon.
STAR: Design a star topology, where a central network maintains common information for all scenarios and a set of scenario-specific networks distinguish scenario-specific information.
AESM2: A novel expert network structure is proposed to automatically select fine-grained experts by calculating KL divergence and dynamically select the most appropriate shared experts and specific experts.

Parameter adaptive network structures: The scene context is directly applied to the embedding layer and the hidden layer of the DNN, and the model parameters are dynamically adjusted according to the scene changes.

PEPNet: Taking scene-related features as input, it dynamically scales the underlying embeddings and top-level DNN hidden units in the model through a gating mechanism.
AdaSparse (ADS) adaptively learns the sparse structure of each scene and prunes redundant neurons by learning weights to enhance cross-scene generalization capabilities.
SFPNet: It consists of a series of scene customization modules. It integrates scene information at a coarse-grained level by redefining basic features, and incorporates target scene information into behavior modeling to support user interest modeling with target scene awareness.

Overall experimental results

Table 4 shows the comparison results of all methods on the two datasets. For each method, we repeated the experiment five times and reported the average results. Statistical significance test was performed using t-test. The effect of our method over the best baseline is statistically significant at the 0.05 level.

Table 4. Comparison of offline performance of SAINet and other methods

Based on this we can draw the following conclusions:

All general recommenders perform worse than other methods on both datasets. This is because they all ignore the interrelationships and differences between scenarios. To address this problem, scenario-specific network structures were proposed and achieved significant performance improvements. SharedBottom adds several scenario-specific DNN towers to exploit scenario-specific knowledge, which fails to capture the complex interactions between scenarios. MMoE uses experts and gating networks to extract the commonalities of different scenarios and obtain better results. But MMoE exhibits a seesaw phenomenon in multiple scenarios (i.e., improvements in one scenario often lead to performance degradation in other scenarios). For example, compared with DeepFM, MMoE performs poorly in scenarios #A4 and #A5 of the industrial dataset because these scenarios have more different characteristics and uneven data distribution, and MMoE is not powerful enough to handle it. PLE alleviates this phenomenon by dividing experts into two groups (i.e., some scenario-shared and some scenario-specific) and performs better than MMoE on both datasets. AESM2 further introduces an automatic expert selection mechanism to obtain better performance than PLE.
Scenario-specific network structures only optimize the top layer of the network. However, the differences between scenarios in the underlying embedding representations are also crucial to achieve good performance. PEPNet achieves good performance improvements on both datasets by taking scenario-related features as input and dynamically scaling the underlying embeddings and top DNN hidden units in the model through a gate mechanism. AdaSparse performs poorly compared to PEPNet because the mechanism of sparse hidden units is difficult to learn. SFPNet exhibits the best performance among all baselines in experiments because it integrates scenario information to redefine basic features while fusing target scenario information and behavior to support scenario-aware user interest modeling. However, all these studies have neglected to effectively model the differences in user interests in different scenarios by leveraging the scenario prior knowledge involved in historical behavior sequences and target items. Our proposed SAINet explicitly models the differences in user behaviors in different scenarios by introducing rich scenario contexts. In addition, it can easily capture interest migration across scenarios. As shown in Table 4, SAINet outperforms all baselines in all scenarios on both datasets.

Ablation experiment

To evaluate the effectiveness of each module in SAINet, we also compare SAINet with its variants. The following variants are considered:

w/o SIE: SIE has been removed from the Scenario-adaptive Block, which means that this module no longer emphasizes scenario information in historical behavior.
w/o STM: STM is removed from the Scenario-adaptive Block, which means that the target scene information is not used for customized interest representation.
w/o TIF: The TIF module was removed from SAINet and replaced with a mean pooling operation.
w/o SDT: The SDT module is removed from SAINet and replaced with a normal DNN network.

The specific ablation experiment results are shown in Table 5 .

Table 5. SAINet variant ablation experiment results

As shown in Table 5 , each module contributes effectively to the performance of SAINet. Specifically, the absence of the SIE module (w/o SIE) affects the prediction performance of all scenarios, indicating that integrating scene prior knowledge from historical behaviors is crucial in enhancing the ability to model user interest differences in different scenarios. In addition, the removal of STM (w/o STM) also leads to a significant decrease in the prediction performance of the model. This strongly verifies the effectiveness of customizing user interest representation to capture user preferences that are significantly related to the current scenario. In addition, without the TIF module, the model performance decreases in both datasets. This reflects that the interest fusion of introducing the target attention mechanism instead of mean pooling is very helpful in ensuring prediction accuracy. Finally, removing SDT will hurt the performance of the model. This shows that adjusting the top-level network parameters with the help of scene information should not be ignored to obtain better results.

Hyperparameter Experiment

We conducted extensive experiments to examine the impact of several key hyperparameters, including the number of scenario-adaptive blocks L, the scaling factor in the STM module, and the number of heads h in the SIE module.

The impact of the number of blocks L: Figure 7 (a) The effect of different L on model effects is explained. As the values increase, AUC shows a trend of improvement. This is mainly because as L increases, the interaction between the representation of interest and the context of the scene is strengthened, and increasing L to more than 2 does not bring significant benefits.

Effects of: Scaling factor was performed in {0.8, 1.2, 1.6, 2.0, 2.4, 2.8}. According to Figure 7(b) The curve shown, when the factor value is equal to 2, SAINet performs best on AUC, and increasing the value above 2 will degrade its performance. Therefore, we set the scaling factor in SAINet and its variants to 2 in all experiments.

The effect of the head quantity h: h is selected at {2, 4, 6, 8, 10}. Figure 7 (c) The effect of the number of heads on the multi-head attention mechanism in SIE is shown. When the head is equal to 4, the AUC curve peaks, and introducing the number of heads above 4 leads to worse performance. Therefore, h is set to 4 in all experiments.

Figure 7. SAINet hyperparameter experimental curve diagram

in conclusion

This paper emphasizes the need to distinguish between user behavior differences in different scenarios for interest modeling, and proposes a SAINet model. It first introduces a series of Scenario-adaptive Blocks to integrate scene prior knowledge into user behavior to capture user's fine-grained interests and customize interest representations according to the target scenario context. By stacking blocks, the modeling ability of different scenarios can be enhanced. SAINet also uses scene-aware DNN Tower (SDT) to automatically adjust the top-level DNN hidden units to obtain better prediction results. A large number of experiments have proved the advantages of SAINet in multi-scene modeling.

七、DSWIN：Disentangling Scenario-wise Interest Network for Multi-scenario Recommendation

Overall structure

As shown in Figure 8 , similar to SAINet, we compare DSWIN with PEPNet and SAR-Net. Figure 8 (a) PEPNet uniformly applies the same weight to the representation vectors made from historical behavior aggregation (can use concat, pooling and other aggregation methods). Figure 8 (b) SAR-Net in the process attempts to fuse the target scene information when calculating the weight of each behavior in the sequence through the target attention mechanism. However, they only use the information of the target scene, and cannot distinguish the scenes of each behavior in the behavior sequence. The DSWIN we propose in Figure 8 (c) groups user behaviors according to the scene information, and introduces Global Interest Aggregation to model the user's global behavior interests. Local Interest Resolution models the user's local interests in each scenario. Finally, Interest Disentangling Module de-separates the user's interests in different scenarios. The specific structure of DSWIN is shown in Figure 9. It mainly consists of three parts: Global Interest Aggregation, Local Interest Resolution and Interest Disentangling Module.

Figure 8. Comparison between DSWIN and PEPNet and SAR-Net

Figure 9. DSWIN diagram

Problem definition

Let a set of products represent a set of K scenarios. Represents a sequence in which the historical behavior of user u is arranged in chronological order, and N represents the length of the sequence. Represents the subsequence of the behavior of users in the kth scenario, satisfying. Given the target product, target scenario, and other features, the multi-scene modeling task aims to design a unified ranking model while providing accurate and personalized recommendations in multiple scenarios. Our work considers the click-through rate (CTR) prediction task, and its formula is as follows:

CTR prediction is to predict that user u clicks on the target product in a given scenario using behavior sequences and other contextual features. We use the widely used Embedding technology to convert sparse features into low-dimensional density vectors. For example, Embedding vectors representing the target product and other features, respectively.

Specific methods

Global Interest Aggregation

As shown in Figure 8, previous research treats the user's historical behavior sequence as a whole, while ignoring the information of the source scenarios of the behavior generation. Therefore, we first designed the Global Interest Aggregation (GIA) module, which dynamically integrates the user's global behavior and scene-aware contextual information, aiming to obtain a comprehensive, fine-grained representation of interest.

Scenario-aware Context Aggregation Module

We first designed the Scenario-aware Context Aggregation Module (SCAM), which uses attention mechanism to aggregate behaviors from different scenarios. At the same time, we consider integrating target scenario prior information from historical behavior and current samples into SCAM to better understand the user's scenario-aware global interest. Therefore, SCAM can be expressed as follows:

It represents the interest representation vector corresponding to the current sample and target scenario. It is composed of weighted aggregation of historical interactive products and is a learnable parameter. It is an attention weight, which can be represented as follows:

This is the correlation between the target product and the jth interactive product of the user in the behavior sequence. It represents the splicing of the target product Embedding vector and the scene Embedding vector. Similarly, it represents the splicing of the jth click product Embedding vector and the corresponding scene Embedding vector. It is the learning parameter.

Context Feedback Fusion Module

Although SCAM considers utilizing scene information when calculating weights, its ability to express complex interactions between target scenarios and user behavior is limited. To further capture the global user interests closely related to the current target scenario, we propose a Context Feedback Fusion Module (CFFM), which fuses the behavior's interest representation vector with the corresponding context (i.e., target products and scenarios) through nonlinear feature interaction. Specifically, CFFM consists of MLPs with k blocks:

Among them is the output of the k-th layer, which is the learnable parameter. The initialization input formula is as follows:

Where * represents the element-by-element product. We can finally get global interest corresponding to the target scenario context and the current sample.

Local Interest Resolution

In order to clearly distinguish the differences in user interests in different scenarios, we designed the Local Interest Resolution (LIR) module to clearly extract the user's scene interests in each subscene. LIR splits the global behavior into multiple subsequences by scene and consists of multiple structurally symmetric Interest Extracting Units (IEUs). Considering that user behavior in the same scenario is more concentrated and clear, each IEU adopts improved multi-headed self-attention (MHA), introducing specific scene information as a deviation term in the IEU to model each subsequence and obtain local interest representations of users in different scenarios. In addition, MHA enables LIR to model user preferences from multiple interest perspectives.

Interest Extracting Unit

It represents the behavior subsequence of the user interacting in the kth scene, and represents the number of products that interact. After the Embedding transformation, it can represent the behavioral Embedding matrix in the scene. After MHA encoding, the matrix is marked as, and the calculation method is as follows:

Q, K, V integrate scene Embedding as a bias term into the behavior Embedding matrix to guide local interest analysis. Their definitions are as follows:

Where h represents the number of heads. It represents the weight matrix of the output linear transformation. It is the projection matrix of the i-th head corresponding to query, key, and value. It is the transformation matrix of the bias term. Then, the output matrix is processed by means pooling layer to obtain a representation vector that represents the user's local interest in the kth scene. In particular, for the current scene. Finally, the output of LIR consists of representations of all K scenes:

Interest Disentangling Module

As mentioned earlier, user interests in different scenarios overlap and differ. Since there is no labeled information about user interests (no explicit signal tells the model whether the two interest representations are similar), the current supervised modeling methods lack clear supervised signals to fully distinguish interests in different scenarios. Therefore, we use self-supervised learning to resolve interests between separate scenarios. Unlike existing methods of combining contrast learning, these methods tend to focus on complex data augmentation techniques, we analyze specific domain problems and design comparison strategies based on the original data. Specifically, we take the global interest representation and local interest representation of the current target scenario as positive contrast samples, and the local interest representation of other scenarios as negative samples. We use contrast learning loss to teach the model to enhance the similarity score between two positive samples, weakening the similarity score between negative samples and two positive samples. Therefore, there are two comparison tasks as follows:

The above optimization goals are calculated by InfoNCE loss, as shown below:

It represents a similarity function, used to calculate the cosine distance between two instances. It is a temperature parameter. LIR ultimately supervises the separation of scene interests through a strong self-supervised signal and improves the model's ability to distinguish different scenarios.

Prediction and Optimization

We connect the outputs of GIA and LIR, scene features, target product features and other features:

Then combine multi-layer DNN towers to predict the possibility of users clicking on target products:

Among them is the sigmoid activation function. For supervised tasks such as CTR prediction, we use cross entropy loss as the objective function:

Here is the ground truth of the sample. |D| is the number of samples. We train the model on supervised and self-supervised goals in an end-to-end manner. Specifically, the joint loss function has a hyperparameter to balance the target, which can be expressed as follows:

Experimental part

We conducted a large number of experiments to verify the effectiveness of the framework we proposed and answered the following questions:

RQ1: How does DSWIN perform compared to state-of-the-art baseline models?

RQ2: How does each module in DSWIN work?

RQ3: How does the hyperparameters in the proposed DSWIN affect its performance?

RQ4: Can DSWIN effectively separate user interests based on scenario decomposition?

Experimental setup

The same as the second article SAINet.

Overall experimental results

Table 6 shows the comparison results of all methods on industrial and public data sets, respectively. For each method, we repeated the experiment five times and reported the average results. Statistical significance tests were performed using t-test. Our model's performance relative to the best baseline was statistically significant at the 0.05 level. The following conclusions can be drawn.

Table 6. Comparison of offline effects of DSWIN and other methods

All General Recommenders methods have always been unsatisfactory compared to other solutions on both datasets. DNN uses a single DNN tower to handle all scenarios, completely ignoring the interrelationships and differences between scenes. Although DeepFM attempts to do some optimizations in terms of feature interactions, the effect does not improve significantly when the dataset exhibits obvious scene changes.

To solve the problem of the General Recommenders method, Scenario-specific network structures are proposed. They are all better than : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Ablation experiment

To evaluate the validity of each module in DSWIN, we also compared DSWIN with its variants in our experiments. We have three variants as follows:

w/o GIA: Delete the SCAM and CFFM modules in GIA, and directly replace it with a normal DIN network for global sequence modeling.
w/o LIR: The LIR module was removed from DSWIN, and the IDM module was also separated. This means that no clear local interest was extracted in different scenarios.
w/o IDM: The IDM module was removed from DSWIN, which means that the difference in interest was not emphasized and the separation of separation.

Table 7. Results of DSWIN variant ablation experiment

As shown in Table 7 , each module makes considerable contributions to ensure the quality of the prediction results of DSWIN in all scenarios. Specifically, the absence of GIA (w/o GIA) affects the prediction performance of all scenarios, indicating that it is important to integrate scene prior information from historical behavior and current instances when enhancing scenario-specific customization capabilities for user behavior. In addition, the CFFM in the GIA module can also capture the global user interest closely related to the current target scenario. In addition, the removal of LIR (w/o LIR) also leads to a significant decline in model prediction performance. This strongly verifies the principle of distinguishing which scenarios an individual behavior comes from and explicitly captures the differences in user interest in different scenarios. At the same time, LIR effectively learns the intrinsic correlation of user behavior in the same scenario. Finally, without IDM, the performance of the model in both datasets will decrease significantly. This reflects that self-supervised learning injected unsupervised signals is very helpful in decorating scene interests and ensuring prediction accuracy.

Superglycensor experiment

We conducted a large number of experiments to examine the effects of several key hyperparameters, including the number of blocks k in CFFM, the temperature in comparison loss, the weight of the equilibrium supervised loss and the unsupervised loss.

Effects of the number of blocks in CFFM: Figure 10 (a) The effects of different k are illustrated. As the values increase, AUC shows a trend of improvement. This is mainly because as k gets bigger, the interaction between the representation of interest and the context is deeper, and increasing k to more than 2 does not bring significant benefits.

Effect of temperature coefficient: The temperature is carefully adjusted within {0.1, 0.2, 0.4, 0.8}. According to Figure 10 (b) The best choices we found vary by dataset. A value that is too small (e.g. 0.1) or a value that is too large (e.g. 0.8) is not suitable. A value that is larger can weaken the ability to distinguish negative samples. Conversely, a value that is too small can over-exaggerate the effect of some negative samples, resulting in poor performance.

Effect of loss equilibrium weight: We conducted an experiment and will change from {1e0, 1e-1, 1e-2, 1e-3, 1e-4}. In particular, when 0, it is equivalent to deleting the module IDM. From Figure 10 (c) The results in this article show that when it is 1e-1, the performance reaches a peak. As the value increases further, the model performance becomes worse and worse. We attribute this to the decrease in importance of the main prediction task as it increases, which verifies the necessity of hyperparameters to balance different task objectives.

Figure 10. DSWIN hyperparameter experimental curve diagram

Visual Analytics

We explore through experiments how IDM promotes interest characterization learning and whether the module understands the similarity of interest in different scenarios. We calculate the cosine similarity of scenes based on the Embedding vector given by module LIR, and plot the distribution of similarity scores with the introduction and without introducing contrast losses, respectively (shown in Figure 11 ). From the results, we can observe that the similarity score of Embedding learned by DSWIN is smaller than the distribution learned without using IDM. This phenomenon shows that IDM enables the model to effectively decompose scene-based interests.

Figure 11. LIR output interest similarity distribution column graph

AB experiment

We conduct online A/B tests based on actual traffic. Specifically, we deploy DSWIN and baseline methods in the online service system and perform inference tasks for users' daily requests. Due to industrial limitations, all baseline models cannot be compared in the online system. Therefore, we chose PEPNet as the baseline model for comparison. We took the average results of the week's test, and DSWIN achieved pvctr+1.51% profit over PEPNet. 1.51% is a significant improvement in the effectiveness of mature industrial systems.

in conclusion

In this paper, we emphasize the need to distinguish between user behavior differences in different scenarios for interest modeling, and design a novel scenario-wise interest disentangling network called DSWIN. It first introduces a GIA module to integrate the user's global behavior and scene-aware context information, aiming to dynamically obtain comprehensive and fine-grained user interest representations. Subsequently, the LIR module is used to explicitly extract the user's scene interests in each subscene. Finally, DSWIN uses contrast learning technology to de-segregate interests by scene and distinguish the differences between scenes. In addition, our proposed model can effectively capture the migration of interest in different scenarios. A large number of offline and online experiments have proved the advantages of DSWIN in multi-scene modeling.

8. Summary and Outlook

The currently implemented multi-scene modeling method only conducts experiments on the dimension of searching to distinguish traffic sources. In the future, we can continue to think about how to generalize scenario-based information to people (such as division of purchasing power, activity, new and old customers, etc.) to achieve refined modeling of multi-size groups;
Furthermore, multiple scenarios can be generalized to multiple industries (such as differentiated modeling according to product categories) to achieve differentiated modeling in multiple industries;
In the current three tasks, the product side info used in the behavior sequence is only considered for the characteristics of the ID class (item ID, category ID and brand ID). The products also have rich multimodal (text, picture) information, which can also be further integrated into modeling to realize multi-scene & multimodal modeling.

<<: Flutter Development Getting Started Guide: Building Cross-Platform Apps from Scratch

>>: iOS 18.4 has been updated again to support "5G-A" network, and the network is better!

Practical Tips | Things to do to improve conversion rate (Part 1)

Zhang Lei’s 21 side job training courses: Let novices learn how to make money from side jobs, easy to understand and simple to use

Blog

How much does it cost to make a moving app in Dingxi?

Blog

42 marketing tips hidden by marketers!

Blog

Guangzhou WeChat legal consultation mini program function, how much does it cost to make a legal consultation mini program?

Blog

Foxconn's next stop, Anhui, has entered the provincial dispatch level