Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Compared with "dictionary-based analysis", "machine learning-based" does not require a large number of annotated dictionaries, but requires a large amount of labeled data, such as:

Or the following sentence, if its label is:

Service quality - Medium (there are three levels: good, medium, and poor)

╮(╯-╰)╭, which is machine learning, training a model through a large amount of labeled data,

Then you enter a comment to determine the tag level

Ning Xin's comments: During the National Day event, you can use a credit card with a 62-digit number to buy an ice cream with a UnionPay card logo for 6.2 yuan.
There are three flavors to choose from: vanilla, chocolate and matcha. I chose the vanilla flavor, which is very rich.
In addition, you can buy two macarons for 10 yuan with any purchase. Although they are not very big, they are delicious and not too sweet, so you won’t feel sick of them.
Tags: Quality of Service - Medium

Naive Bayes

1. Bayesian Theorem

Assume that for a certain data set, the random variable C represents the probability that the sample belongs to class C, and F1 represents the probability of a certain feature of the test sample appearing. Applying the basic Bayesian formula, it is as follows:

The above formula represents the conditional probability that for a certain sample, when feature F1 appears, the sample is classified into category C. So how to use the above formula to classify the test sample?

For example, if there is a test sample with feature F1 (F1=1), then the probability values of P(C=0|F1=1) and P(C=1|F1=1) are calculated. If the former is larger, the sample is considered to be in class 0; if the latter is larger, it is classified as class 1.

There are several concepts that need to be understood for this announcement:

Prior probability (Prior). P(C) is the prior probability of C, which can be obtained by calculating the proportion of samples classified into category C in the existing training set.

Evidence. The above formula P(F1) represents the probability of feature F1 appearing for a certain test sample. It can also be obtained from the proportion of samples corresponding to feature F1 in the training set to the total samples.

Likelihood. The above formula P(F1|C) indicates that if we know that a sample is classified into category C, then what is the probability that its feature is F1.

For multiple features, the Bayesian formula can be expanded as follows:

There is a long list of likelihood values in the numerator. When there are many features, calculating these likelihood values is extremely painful. What should I do now?

2. Simple concept

In order to simplify the calculation, the Naive Bayes algorithm makes an assumption: "It is naively believed that each feature is independent of each other." In this way, the numerator of the above formula is simplified to:

P(C)P(F1|C)P(F2|C)...P(Fn|C).

After this simplification, calculation becomes much easier.

This assumption is that each feature is independent, which seems to be an unscientific assumption. Because in many cases, each feature is closely related. However, a large number of applications of Naive Bayes have shown that it works quite well.

Secondly, since the working principle of Naive Bayes is to calculate P(C=0|F1...Fn) and P(C=1|F1...Fn), and take the one with the maximum value as its classification. The denominators of the two are exactly the same. Therefore, we can omit the denominator calculation, thereby further simplifying the calculation process.

In addition, there is an important prerequisite for the derivation of the Bayesian formula to be valid, that is, each piece of evidence cannot be 0. That is, for any feature Fx, P(Fx) cannot be 0. It is possible that some features do not appear in the test set. Therefore, some small processing is usually required in implementation, such as adding +1 to all counts (additive smoothing, also called Laplace smoothing). If smoothing is performed by adding an adjustable parameter alpha greater than 0, it is called Lidstone smoothing.

Sentiment classification based on Naive Bayes

Original data set, only 10 items were sampled

Read Data

Read excel files using the data type of DataFrame from the pandas library

Participle

Segment each comment and remove stop words to get the following word list

Each list corresponds to a comment.

statistics

What is being counted here? Two types of data are counted

1. Number of comment levels

There are three levels here corresponding to c0 → good 2
c1 → middle 3
c2 → difference 5

2. The number of times each word appears in the sentence

Get a dictionary data evalation [2, 5, 3]
Half price [0, 5, 0]
Cost-effective [1, 1, 0]
Not bad [0, 2, 0]
·········
Dissatisfied [0, 1, 0]
Important [0, 1, 0]
Clear [0, 1, 0]
Specifically [0, 1, 0]
The list coordinates after each word (feature): 0, 1, 2 correspond to good, medium, and poor respectively

After completing the above work, the model is trained, but the more data, the more accurate it is.

test

For example, enter a sentence

Comments on Century Lianhua (Bailian Xijiao Shopping Center) In a city that claims to be an international metropolis, the service attitude of the people at the cashier is extremely poor. UnionPay activity is 30-10, and you can't make consecutive orders.

Get the result

c2-difference

<<: Deep learning: preconceptions, limitations, and the future

>>: Key technologies for implementing microservice architecture

How to deliver information flow in the gaming and software industries? One article to understand

Electric Technology Car News: Mild Hybrid System + 9AT Can Cadillac XT5 successfully squeeze into the first echelon of SUVs

A few days before the upcoming Chengdu Auto Show,...

Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Naive Bayes

1. Bayesian Theorem

2. Simple concept

Sentiment classification based on Naive Bayes

Read Data

Participle

statistics

test

How to deliver information flow in the gaming and software industries? One article to understand

Increase speed and reduce fees? The three major operators are playing tricks together

Be careful! If you see them, show zero tolerance!

Big oranges are the most important, and nine out of ten oranges are big... There is a scientific basis behind this

In this park, we don’t plant trees, we let nature plant them.

Insomnia the day before work? Here is a guide to curing post-holiday syndrome →

The first lesson of money freedom: the "wallet" money management system

Why Apple's licensing of its iOS operating system was a bad move

How to improve user registration conversion and user activation

6 warning signs before a heart attack! Remember these, they can save lives at critical moments!

Recommend

Dang Xing Xue Tang: Amateurs can also become popular in short videos

Wanqing's popular wealth creation course that everyone can copy, a simple, practical and replicable wealth action guide

What do we talk about when we talk about electric propulsion in aviation?

From ancient Greece, the "intoxicating journey" of beer begins...

Electric Technology Car News: Mild Hybrid System + 9AT Can Cadillac XT5 successfully squeeze into the first echelon of SUVs

From 0 to 450 million users, how did the mobile transmission app "Kuaiya" promote itself?

WeChat Android version 8.0.14 beta version released, with developer updates

Fire hazard, did you know that the airway can also be "very injured"?

Baidu Mini Program data, how to view traffic data of Baidu Smart Mini Program?

APP operation: How to design an activity that users can’t stop

Smart routing: the battle for the unclear “entrance”

Medical AI technology is hot, but where does its business model end?

The whole process of an Internet product from idea to realization

Exclusive interview with APICloud CTO Zou Da: A full-stack engineer forced out

The Shenzhou 14 crew took photos in orbit