Intelligent Question Answering: BERT-based Semantic Model

Intelligent Question Answering: BERT-based Semantic Model

Author: Luo Zijian

background

Feishu Intelligent Q&A is used in employee service scenarios. It is committed to reducing customer service manpower consumption while efficiently solving users' knowledge exploration needs in the form of cards. Feishu Intelligent Q&A integrates the question and answer pairs in the service desk and wiki to form a question and answer knowledge base, and provides knowledge to users in a one-question-one-answer manner in comprehensive search and service desk.

As an enterprise-level SaaS application, Feishu has extremely high requirements for data security and service stability, which leads to a serious shortage of training data and a heavy reliance on public data rather than business data. During the model iteration process, reliance on public data also leads to inconsistencies between model training data and business data distribution. Through cooperation with multiple pilot service desks, after obtaining full authorization from users, training is performed in a way that does not touch the data. That is, the model can see the data, but humans cannot obtain the plaintext data in any way.

For the above reasons, our offline test data are all artificially constructed. Therefore, when calculating the AUC (Area Under Curve) value for evaluation, there will be inconsistencies with the business data distribution. It can only be used as a reference to verify the performance of the model, but cannot be used as a technical indicator for optimization and iteration. Therefore, we use user click behavior to verify whether the model effect has improved.

In the process of business implementation, whether to display the answer is determined by the similarity calculated by the model. The Threshold that controls the display of answers will have a significant impact on both the click-through rate and the display rate. Therefore, in order to avoid the interference of the Threshold value on the indicator, Feishu Q&A uses SSR (Session Success Rate) as a decisive indicator to evaluate the effect of the model. The calculation method is as follows, where total_search_number will record each question asked by the user, and search_click_number will record whether the user clicks after each question, and which answer is clicked.

bot_solve_rate(BSR): It is used to evaluate the effect of robot interception. The more work orders the robot intercepts, the less manpower will be consumed.

Feishu Intelligent Question Answering Model Technology

Original 1.0 version model

The earliest model used in the question-answering service was the SBERT (Sentence Embeddings using Siamese BERT) model (1), which is also a widely used model in the industry. Its model structure is as follows:

By inputting the query and FAQ questions into the twin BERT for training, and adjusting the parameters of BERT through binary classification, we can convert all questions into vectors offline and store them in the index library. During inference, the user's query is converted into an embedding and recalled in the index library. The similarity of the vectors is calculated using cosine similarity, which is referred to as similarity below.

The biggest advantage of this scheme over the interactive model is that the calculation time of text similarity is decoupled from the number of Faqs and does not increase linearly.

Improved 2.0 version model

The performance of the 1.0 version of the model in representation learning is still not good enough, mainly reflected in the fact that even if two sentences are not so similar, the model will still give a relatively high score, resulting in a low overall discrimination. The 2.0 version of the model considers increasing the interaction between the two sentences to obtain more information and better distinguish whether the two sentences are similar. In addition, the idea of ​​face recognition is introduced to make the distribution of similar content more compact and the distance between dissimilar content classes larger, thereby improving the discrimination of the model. Its structure is shown below:

Compared with the 1.0 version of the model, the 2.0 version emphasizes the importance of interaction. Based on the original concat, u*v is added as a feature, and an interaction layer (essentially an MLP layer) is added to further enhance the interaction. In addition, CosineAnnealingLR(2) and ArcMarinLoss(3) are introduced to optimize the training process. In addition, according to the experimental inspiration of Bert-Whitening(3), multiple pooling methods are used during pooling to find the optimal result.

CosineAnnealingLR

The cosine annealing learning rate can "escape" the local optimal space when the model is about to reach the local optimal point by slowly decreasing and suddenly increasing. It can also further retrieve a better local optimal solution. The following figure is from the paper of CosineAnnealingLR, where the default is the decay of StepLR, and the others are cosine periodic decay, and the maximum value is restored when the decay is 0.

Compared with the traditional StepLR attenuation, the cosine annealing attenuation method can make the model easier to find a better local optimal solution. The following figure simulates the StepLR and CosineAnnealingLR gradient descent process (4):

Similarly, since the cosine annealing model will find multiple local optima, the training time will be longer than the traditional StepLR decay. On the other hand, since the sudden increase in learning rate will lead to an increase in loss, the Early Stopping control during training can be based on Steps instead of Epoches.

In the service desk implementation scenario, the semantic spaces of different service desks are obviously different, and the amount of data (including the ratio of positive and negative examples) of different service desks also varies greatly. The model inevitably has bias. Cosine annealing can solve this problem to a certain extent.

ArcMarinLoss

The inspiration for the Arcface loss function comes from TripletLoss in the SBERT paper, which was first used in face recognition. However, TripletLoss relies on triplet input, and there was no such condition to obtain data when the model was built. Therefore, we found ArcMarginLoss: the latest loss function for face recognition, which has reached SOTA in the field of face recognition and has enhanced the distance between categories based on Softmax.

However, we cannot divide semantics into groups and perform N classifications on the groups like face recognition. NLP-related problems are much more complicated than face recognition, and training data is also difficult to obtain like face recognition. But we can still separate the two categories of related/unrelated more widely through binary classification.

Experimental Results

According to the results of the AB experiment, the indicators of the 1.0 version model and the 2.0 version model are as follows:


Top1 SSR

Top 10 SSRs

Top1 Click Rate

BSR

Relative improvement

+7.75%

+6.43%

+1.24%

+3.55%

The Top1 SSR only considers the first clicked SSR, while the Top10 SSR considers the first 10 clicked SSRs. Since only 10 similar results will be recalled for a user's request, the Top10 SSR is the overall SSR.

Top1 Click Rate refers to the probability of clicking on the first piece of information under the condition of clicking, that is, Top1SSR/Top10SSR.

As can be seen from the table above, the overall SSR has increased significantly, and the click share of the Top 1 has also increased slightly. Therefore, from the perspective of business indicators, it can be inferred that the 2.0 version of the model is significantly better than the 1.0 version of the model. As the most concerned indicator on the business side, BSR is greatly affected by user behavior and product strategy, but through AB experiments, it can also be seen that the number of work orders intercepted by robots in the new model has increased significantly, which can reduce the consumption of downstream customer service manpower.

Ablation experiment

An ablation experiment was conducted on the effect of ArcMarginLoss. With other conditions unchanged, the same artificial test set was used to train the model using ArcMarginLoss and CrossEntropyLoss respectively.

The reason for using AUC here is to observe:


ArcMarginLoss

CrossEntropyLoss

The value of AUC

0.925

0.919

Therefore, through the ablation experiment, we can see that although ArcMarginLoss has a slight improvement on the test set, the improvement is not obvious. The reason may be that ArcMarginLoss used for face recognition training models usually uses a large number of similar pictures as a label for training, while the data in this task is a 0/1 classification of the relationship between two sentences, which makes it different from the face recognition target and cannot produce a good effect.

Version 3.0 based on Contrastive Learning

Although the 2.0 version of the model alleviated the problem of relatively concentrated scores, it still could not solve the problem of uneven overall data distribution (positive and negative samples 1:10), that is, the number of positive samples was much smaller than that of negative samples, causing the model to be more inclined to learn the content of negative samples. The 3.0 version of the model borrowed the idea of ​​Contrastive Learning and transformed the binary classification problem into an N-classification problem. The negative samples are no longer multiple items, thus ensuring that the model can better learn the content of the positive samples.

Contrastive Learning

In the 3.0 version of the model, the idea of ​​Contrastive Learning was introduced into the model and trained with reference to the latest paper SimCSE(5). The idea is to use two dropouts to obtain different representations of the same sentence and then train it. As shown in the following figure:

In actual training, the idea of ​​Supervised SimCSE is adopted, and <query_i, question_i> is input into the model as a pair of positive samples. Each query is trained with other questions as negative samples, but does not interact with other queries (there are also negative examples between different queries in the original text). An example is shown below. Assuming that 3 pairs are input, the labels are as follows:


query1

query2

query3

question1

1

0

0

question2

0

1

0

question3

0

0

1

BERT can be used to convert text into embeddings and calculate similarity. According to the above label construction method, CrossEntropyLoss is used to calculate and update parameters.

Therefore, the objective function of Contrastive Learning can be expressed as follows:

Where sim is the cosine similarity calculation, and <hi,hi+> is a Sentence Pair.

A trick from Momentum contrastive

The original SimCSE uses CrossEntropyLoss to directly obtain the loss value. However, due to the existence of the hyperparameter T, the cosine similarity value is amplified, and the probability distribution after softmax is more concentrated on the value with high similarity, while the probability of the value with low similarity tends to 0. Therefore, the value of the hyperparameter T will seriously affect the size of the gradient during back propagation, and as T decreases, the gradient continues to increase, making the actual lr much larger than the lr defined in CrossEntropyLoss, resulting in a slower convergence of model training. To alleviate this problem, the method of loss_i = 2 * T * l_i is used in Momentum contrastive (6), and this paper applies this method as loss for training.

Experimental Results

According to the results of the AB experiment, the business indicators of the 2.0 and 3.0 versions of the model are as follows:


Top1 SSR

Top8 SSR

Top1 Click Rate

BSR

Relative Change

+7.10%

+5.88%

+1.19%

+4.07%

First, due to fluctuations in business data in different periods, the top 1 SSR in version 2.0 deviates from the previous data. In addition, the business side has been changed, and the total number of displays has been changed from 10 to 8, so the Top 10 SSR has been changed to the Top 8 SSR.

Finally, the performance of BSR in this AB experiment is similar to that of the 1.0 version model in the AB experiment above, because the statistical caliber of entering the robot ticket has been modified. Previously, when the problem was displayed in the search, it would jump directly to the robot and record it as the robot solving the problem. Now the search will directly display the answer without entering the service desk. Therefore, a part of the tickets were intercepted on the search side and were not recorded as robot interception, which led to a decrease in BSR, rather than data fluctuations. However, compared with the results of the AB experiment, the BSR of the new model still increased by 4.07%, which is a significant improvement.

In summary, it is clear that the model has significantly improved in core technical indicators. Not only has the user demand satisfaction rate increased by 5.88%, but the proportion of answers ranked first has also increased by 1.19%, and the model effect has been comprehensively improved.

Reasons for model improvement

  • The data is organized differently. Contrastive Learning only uses the positive examples in the dataset and uses other questions in the same batch as negative examples, which means that the number of negative examples for the same query is much larger than the original dataset. Compared to the 2.0 version of the model, for all positive examples, the model has more negative examples to better identify truly similar sentences.
  • The loss function determines the different training objectives. The loss function of version 2.0 only needs to consider the relationship between two sentences, while the loss function of version 3.0 needs to consider the relationship between all sentences in a batch at the same time and find positive examples among N sentences of batch_size = N.
  • According to the paper, the anisotropy of BERT can be eliminated through contrastive learning. Specifically, the semantic space of BERT is concentrated in a narrow cone area, which causes the cosine similarity value to be significantly larger, and even two completely unrelated sentences can get a relatively high score. This view was demonstrated in the embedding research of pre-trained models by Ethayarajh (6) and Bohan Li (7). This problem can be eliminated by using contrastive learning or various postprocessing (such as whitening (8), flow (9)).

  • High-frequency words seriously affect the embedding of sentences in BERT. A small number of high-frequency words determine the distribution of embeddings, resulting in poor expression of the BERT model. Contrastive Learning can eliminate this effect. According to the experiment of the paper ConSERT (10), it is proved that high-frequency words do seriously affect the expression of embeddings, as shown in the figure:

Ablation experiment

In order to explore the impact of loss and data on the model, this elimination experiment uses the same positive and negative samples and ratios as Contrastive Learning for training. The training method is as follows:

  1. Let batch_size = 32, then according to the idea of ​​Contrastive Learning, we get the cosine similarity matrix with a dimension of [32,32].
  2. The label matrix is ​​generated according to Contrastive Learning, which is the unit matrix of [32,32].
  3. Convert the cosine matrix and label matrix to [32*32,1] matrices. Under this condition, the data organization mode is exactly the same, only the loss function and training method are different. Use the above method to train and calculate the AUC on the offline test set.

In simple terms, Contrastive Learning finds the most similar sentence from 32 sentences, while CrossEntropyLoss performs 32 binary classifications. Using the above method, the binary classification of Contrastive Learning and CrossEntropyLoss is trained respectively, and the results are as follows:


Contrastive Learning Classification

CrossEntropyLoss binary classification

The value of AUC

0.935

0.861

Further research revealed that the binary classification of CrossEntropyLoss tends to mark the similarity of texts as 0. Therefore, excessive negative samples cause the model to ignore the learning of positive samples. By only judging the case of label = 0, an accuracy of 98% can be achieved on the training set.

The elimination experiment also proves that the loss function has a significant impact on the training of the model. When there are enough negative samples, under the same data organization method, the effect of Contrastive Learning is better than CrossEntropyLoss. Since the number of positive samples is much smaller than that of negative samples in real business, the training method based on Contrastive Learning is more suitable for business application.

<<:  Apple will officially release iOS 16 tomorrow, and the update rate of iOS 15 has reached 89%

>>:  Will iOS 16 be upgraded? Read these four major upgrades before deciding

Recommend

Marketing promotion: Luckin Coffee’s traffic pool thinking!

Luckin Coffee ’s new product, the Lucky Snacks, h...

Why can ordinary H5 games make users unable to stop playing?

When it comes to games , what comes to everyone’s...

Learn about these ways to promote on the Internet without spending money

Different periods of Internet development have di...

10 strategies to reduce user churn rate in APP!

Many companies have not developed specific plans ...

15 practical courses to quickly master AI from scratch

Introduction to 15 practical compulsory courses f...

After Zhihu and Kuaishou, who will be the next traffic gold mine?

WeChat will become more and more "formal&quo...

Information flow advertising | Learn the correct method to increase exposure!

My friends, I believe many of you are worried abo...

4 dimensions, detailed explanation of refined operations!

The mobile Internet attaches more importance to t...

How can event operation and promotion improve user conversion?

Many operators may have encountered this conversi...

Xiaomi and Huawei, turning hostility into friendship!

In recent days, the mobile phone industry has see...