Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Compared with "dictionary-based analysis", "machine learning-based" does not require a large number of annotated dictionaries, but requires a large amount of labeled data, such as:

Or the following sentence, if its label is:

Service quality - Medium (there are three levels: good, medium, and poor)

╮(╯-╰)╭, which is machine learning, training a model through a large amount of labeled data,

Then you enter a comment to determine the tag level

Ning Xin's comments: During the National Day event, you can use a credit card with a 62-digit number to buy an ice cream with a UnionPay card logo for 6.2 yuan.
There are three flavors to choose from: vanilla, chocolate and matcha. I chose the vanilla flavor, which is very rich.
In addition, you can buy two macarons for 10 yuan with any purchase. Although they are not very big, they are delicious and not too sweet, so you won’t feel sick of them.
Tags: Quality of Service - Medium

Naive Bayes

1. Bayesian Theorem

Assume that for a certain data set, the random variable C represents the probability that the sample belongs to class C, and F1 represents the probability of a certain feature of the test sample appearing. Applying the basic Bayesian formula, it is as follows:

The above formula represents the conditional probability that for a certain sample, when feature F1 appears, the sample is classified into category C. So how to use the above formula to classify the test sample?

For example, if there is a test sample with feature F1 (F1=1), then the probability values ​​of P(C=0|F1=1) and P(C=1|F1=1) are calculated. If the former is larger, the sample is considered to be in class 0; if the latter is larger, it is classified as class 1.

There are several concepts that need to be understood for this announcement:

Prior probability (Prior). P(C) is the prior probability of C, which can be obtained by calculating the proportion of samples classified into category C in the existing training set.

Evidence. The above formula P(F1) represents the probability of feature F1 appearing for a certain test sample. It can also be obtained from the proportion of samples corresponding to feature F1 in the training set to the total samples.

Likelihood. The above formula P(F1|C) indicates that if we know that a sample is classified into category C, then what is the probability that its feature is F1.

For multiple features, the Bayesian formula can be expanded as follows:

There is a long list of likelihood values ​​in the numerator. When there are many features, calculating these likelihood values ​​is extremely painful. What should I do now?

2. Simple concept

In order to simplify the calculation, the Naive Bayes algorithm makes an assumption: "It is naively believed that each feature is independent of each other." In this way, the numerator of the above formula is simplified to:

P(C)P(F1|C)P(F2|C)...P(Fn|C).

After this simplification, calculation becomes much easier.

This assumption is that each feature is independent, which seems to be an unscientific assumption. Because in many cases, each feature is closely related. However, a large number of applications of Naive Bayes have shown that it works quite well.

Secondly, since the working principle of Naive Bayes is to calculate P(C=0|F1...Fn) and P(C=1|F1...Fn), and take the one with the maximum value as its classification. The denominators of the two are exactly the same. Therefore, we can omit the denominator calculation, thereby further simplifying the calculation process.

In addition, there is an important prerequisite for the derivation of the Bayesian formula to be valid, that is, each piece of evidence cannot be 0. That is, for any feature Fx, P(Fx) cannot be 0. It is possible that some features do not appear in the test set. Therefore, some small processing is usually required in implementation, such as adding +1 to all counts (additive smoothing, also called Laplace smoothing). If smoothing is performed by adding an adjustable parameter alpha greater than 0, it is called Lidstone smoothing.

Sentiment classification based on Naive Bayes

Original data set, only 10 items were sampled

Read Data

Read excel files using the data type of DataFrame from the pandas library

Participle

Segment each comment and remove stop words to get the following word list

Each list corresponds to a comment.

statistics

What is being counted here? Two types of data are counted

1. Number of comment levels

There are three levels here corresponding to c0 → good 2
c1 → middle 3
c2 → difference 5

2. The number of times each word appears in the sentence

Get a dictionary data evalation [2, 5, 3]
Half price [0, 5, 0]
Cost-effective [1, 1, 0]
Not bad [0, 2, 0]
·········
Dissatisfied [0, 1, 0]
Important [0, 1, 0]
Clear [0, 1, 0]
Specifically [0, 1, 0]
The list coordinates after each word (feature): 0, 1, 2 correspond to good, medium, and poor respectively

After completing the above work, the model is trained, but the more data, the more accurate it is.

test

For example, enter a sentence

Comments on Century Lianhua (Bailian Xijiao Shopping Center) In a city that claims to be an international metropolis, the service attitude of the people at the cashier is extremely poor. UnionPay activity is 30-10, and you can't make consecutive orders.

Get the result

c2-difference

<<:  Deep learning: preconceptions, limitations, and the future

>>:  Key technologies for implementing microservice architecture

Recommend

Google is developing an IoT version of Android to unify smart homes

The battle for smartphone operating systems has c...

How to improve the conversion rate of new users?

We often say that while learning to maintain old ...

Geography of Chinese sesame paste

Loading long image... Source: National Geographic...

How much does it cost to make an entertainment app in Ningde?

Mini programs provide convenience for publicity a...

Top 10 Brand Keywords in 2019 and 5 Trends in 2020

Today, let’s summarize and review the 10 key word...

From theory to practice, how did Tmall use gamification in 2017 Double 11?

The curtain of Tmall Double 11 in 2017 has been o...

Galaxy S5 Upgrades

Samsung is pushing the first system upgrade for th...

Three typical APP promotion cases, which one would you choose?

Let’s talk about promotion again. It seems that t...