Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Compared with "dictionary-based analysis", "machine learning-based" does not require a large number of annotated dictionaries, but requires a large amount of labeled data, such as:

Or the following sentence, if its label is:

Service quality - Medium (there are three levels: good, medium, and poor)

╮(╯-╰)╭, which is machine learning, training a model through a large amount of labeled data,

Then you enter a comment to determine the tag level

Ning Xin's comments: During the National Day event, you can use a credit card with a 62-digit number to buy an ice cream with a UnionPay card logo for 6.2 yuan.
There are three flavors to choose from: vanilla, chocolate and matcha. I chose the vanilla flavor, which is very rich.
In addition, you can buy two macarons for 10 yuan with any purchase. Although they are not very big, they are delicious and not too sweet, so you won’t feel sick of them.
Tags: Quality of Service - Medium

Naive Bayes

1. Bayesian Theorem

Assume that for a certain data set, the random variable C represents the probability that the sample belongs to class C, and F1 represents the probability of a certain feature of the test sample appearing. Applying the basic Bayesian formula, it is as follows:

The above formula represents the conditional probability that for a certain sample, when feature F1 appears, the sample is classified into category C. So how to use the above formula to classify the test sample?

For example, if there is a test sample with feature F1 (F1=1), then the probability values ​​of P(C=0|F1=1) and P(C=1|F1=1) are calculated. If the former is larger, the sample is considered to be in class 0; if the latter is larger, it is classified as class 1.

There are several concepts that need to be understood for this announcement:

Prior probability (Prior). P(C) is the prior probability of C, which can be obtained by calculating the proportion of samples classified into category C in the existing training set.

Evidence. The above formula P(F1) represents the probability of feature F1 appearing for a certain test sample. It can also be obtained from the proportion of samples corresponding to feature F1 in the training set to the total samples.

Likelihood. The above formula P(F1|C) indicates that if we know that a sample is classified into category C, then what is the probability that its feature is F1.

For multiple features, the Bayesian formula can be expanded as follows:

There is a long list of likelihood values ​​in the numerator. When there are many features, calculating these likelihood values ​​is extremely painful. What should I do now?

2. Simple concept

In order to simplify the calculation, the Naive Bayes algorithm makes an assumption: "It is naively believed that each feature is independent of each other." In this way, the numerator of the above formula is simplified to:

P(C)P(F1|C)P(F2|C)...P(Fn|C).

After this simplification, calculation becomes much easier.

This assumption is that each feature is independent, which seems to be an unscientific assumption. Because in many cases, each feature is closely related. However, a large number of applications of Naive Bayes have shown that it works quite well.

Secondly, since the working principle of Naive Bayes is to calculate P(C=0|F1...Fn) and P(C=1|F1...Fn), and take the one with the maximum value as its classification. The denominators of the two are exactly the same. Therefore, we can omit the denominator calculation, thereby further simplifying the calculation process.

In addition, there is an important prerequisite for the derivation of the Bayesian formula to be valid, that is, each piece of evidence cannot be 0. That is, for any feature Fx, P(Fx) cannot be 0. It is possible that some features do not appear in the test set. Therefore, some small processing is usually required in implementation, such as adding +1 to all counts (additive smoothing, also called Laplace smoothing). If smoothing is performed by adding an adjustable parameter alpha greater than 0, it is called Lidstone smoothing.

Sentiment classification based on Naive Bayes

Original data set, only 10 items were sampled

Read Data

Read excel files using the data type of DataFrame from the pandas library

Participle

Segment each comment and remove stop words to get the following word list

Each list corresponds to a comment.

statistics

What is being counted here? Two types of data are counted

1. Number of comment levels

There are three levels here corresponding to c0 → good 2
c1 → middle 3
c2 → difference 5

2. The number of times each word appears in the sentence

Get a dictionary data evalation [2, 5, 3]
Half price [0, 5, 0]
Cost-effective [1, 1, 0]
Not bad [0, 2, 0]
·········
Dissatisfied [0, 1, 0]
Important [0, 1, 0]
Clear [0, 1, 0]
Specifically [0, 1, 0]
The list coordinates after each word (feature): 0, 1, 2 correspond to good, medium, and poor respectively

After completing the above work, the model is trained, but the more data, the more accurate it is.

test

For example, enter a sentence

Comments on Century Lianhua (Bailian Xijiao Shopping Center) In a city that claims to be an international metropolis, the service attitude of the people at the cashier is extremely poor. UnionPay activity is 30-10, and you can't make consecutive orders.

Get the result

c2-difference

<<:  Deep learning: preconceptions, limitations, and the future

>>:  Key technologies for implementing microservice architecture

Recommend

Dang Xing Xue Tang: Amateurs can also become popular in short videos

Dang Xing Xue Tang: Amateurs can also become popu...

What do we talk about when we talk about electric propulsion in aviation?

Let’s talk about what is aviation electric propul...

From ancient Greece, the "intoxicating journey" of beer begins...

A handful of lamb skewers, a plate of edamame, an...

WeChat Android version 8.0.14 beta version released, with developer updates

[[424212]] Yesterday, Tencent WeChat team release...

Fire hazard, did you know that the airway can also be "very injured"?

This is the 4864th article of Da Yi Xiao Hu In re...

Baidu Mini Program data, how to view traffic data of Baidu Smart Mini Program?

Smart Mini Program developers can use the data in...

APP operation: How to design an activity that users can’t stop

520 has just passed and the Dragon Boat Festival ...

Smart routing: the battle for the unclear “entrance”

The smart router market has always been interpret...

Medical AI technology is hot, but where does its business model end?

These are exciting times. With the huge wave of i...

The whole process of an Internet product from idea to realization

A good product has three basic conditions: value,...

Exclusive interview with APICloud CTO Zou Da: A full-stack engineer forced out

[[137215]] In the era of mobile Internet, APP is ...

The Shenzhou 14 crew took photos in orbit

This year's National Day was the first time t...