Write your own Bayesian classifier to classify books

Write your own Bayesian classifier to classify books

[[141939]]

Background and Purpose

First of all, this is a self-study record of a machine learning beginner and a non-typical engineer who is not a math major. So this article will not be particularly theoretical, nor will it explain the formulas too deeply, but it will be very purposeful, targeting a particularly realistic problem, sharing solutions from scratch, including some optimization solutions.

Start with a Question

The problem we want to solve is to perform binary classification on books. The basis for classification is the book tags. These tags may come from experts, editors, or users. For example, "foreign literature", "detective", "computer", and "python" are all tags.

For the sake of our small experimental project, to simplify the problem, we now want to classify books into two categories: "humanities" or "non-humanities". So, as mentioned above, this is a binary classification problem for books.

For example, Introduction to Computer Science has the tags "computer", "science", "classic", and "introduction", so it belongs to "non-humanities". The Catcher in the Rye has the tags "novel", "literature", and "America", so it belongs to "humanities". Our classifier has the ability to automatically classify a book as "humanities" or "non-humanities" based on its tags. Try it out, it's quite interesting!

In order to solve this problem, we give several premises:

  • Any book can only be classified as "humanities" or "non-humanities"
  • 1 book has 1 or more tags
  • All books do not have the tags of "humanities" and "non-humanities" (What? You don't believe it? Just look at Amazon and JD.com to find out)
  • You need very little knowledge of probability, such as what is probability? What is conditional probability?

Using Python and NumPy

We will use Python as the programming language for this lab project.

Numpy is a scientific computing library for Python that you need to install yourself. Because our program involves some simple matrix operations, using Numpy can greatly simplify the programming workload.

Rationale

How Bayesian Classifiers Work

You still need to understand some theoretical knowledge, but don't worry, this part will pass quickly. I will explain it directly in conjunction with the problem to be solved.

Basically, Bayesian classification is to solve a problem like this: Given a book with these tags: tag1, tag2, tag3..., what is the probability that it belongs to the "humanities" category? What is the probability that it belongs to the "non-humanities" category?

Assume that p1 represents the probability that it belongs to "humanities" in this case, and p2 represents the probability that it belongs to "non-humanities" in this case.

If p1>p2, then the book belongs to "humanities", otherwise it belongs to "non-humanities". We do not consider the case where p1=p2.

It’s simple, isn’t it?

So, the question becomes, how to calculate p1 and p2 through tag1, tag2, tag3... After all, as long as we know these two values, our final problem is solved.

Conditional Probability

In fact, this is a problem of conditional probability. The so-called conditional probability is to find the probability of a happening when b happens. We write it as: p(a|b).

In combination with our actual problem, that is, when tag1, tag2, tag3... have already occurred (that is, the tags of this book are tag1, tag2, tag3...), the probability of this book belonging to "humanities" and "non-humanities" is written as:
p(cate1|tag1,tag2,tag3...) means the probability that this book belongs to cate1 (cate1 = "Humanities") when tag1,tag2,tag3... occur.

p(cate2|tag1,tag2,tag3...) means the probability that this book belongs to cate2 (cate2 = "non-humanities") when tag1,tag2,tag3... occur.

The p(cate1|tag1,tag2,tag3...) here is actually the p1 mentioned above. We use a more professional method to write it here.

How do we calculate the conditional probability? This is the Bayesian formula:

p(a|b) = p(b|a) * p(a) / p(b)

This means: if you want to find p(a|b), and you know the values ​​of p(b|a), p(a) and p(b), then you can find p(a|b) through p(b|a)*p(a)/p(b).

Changing to the actual problem we want to solve, it is equal to:

p(cate1|tag1,tag2,tag3...) = p(tag1,tag2,tag3...|cate1) * p(cate1) / p(tag1,tag2,tag3...)

Translated into human language, that is, you want to find p(cate1|tag1,tag2,tag3...), and now you know:

  • The value of p(tag1,tag2,tag3...|cate1) is the probability that tag1,tag2,tag3... appear together when a book has been classified as "humanities".
  • p(cate1), that is, the probability that all books marked as "humanities" appear in all books ("humanities" and "non-humanities") (in the training set)
  • p(tag1,tag2,tag3...), that is, the probability of tag1,tag2,tag3... appearing in all tags (in the training set)

In other words, as long as we find the above three items one by one, we can find p(cate1|tag1,tag2,tag3...). Similarly, p(cate2|tag1,tag2,tag3...) can also be found.

Here is a tip worth noting. In the above three items, we don't need to calculate the third item. Because our purpose is to compare the size of p(cate1|tag1,tag2,tag3...) and p(cate2|tag1,tag2,tag3...), not to get the actual value. Since the denominator p(tag1,tag2,tag3...) in the above formula is the same, we only need to compare the size of the numerator. That is, compare:

p(tag1,tag2,tag3...|cate1) * p(cate1),

The size of p(tag1,tag2,tag3...|cate2) * p(cate2)

This saves us some calculations.

Naive Bayes

So, how to calculate p(tag1, tag2, tag3...|cate1)? Here we need to use the concept of naive Bayes, that is, we believe that in the tags of a book, each tag is independent of each other and has nothing to do with whether the other tag appears. In other words, the probability of "computer" and "classic" appearing is unrelated to each other, and the probability of "classic" appearing will not be high just because "computer" appears.

Since they are independent of each other, then p(tag1,tag2,tag3...|cate1) is equal to:

p(tag1|cate1) xp(tag2|cate1) xp(tag3|cate1) x ...

p(tag1,tag2,tag3...|cate2) is equal to:

p(tag1|cate2) xp(tag2|cate2) xp(tag3|cate2) x ...

In other words, we can calculate the probability of each tag appearing in all tags of "humanities" and "non-humanities" books, and then multiply them together to get what we want.

Example Analysis

We now have a book "Introduction to Computer Science", and its labels are "computer", "science", "theory", "classic", and "introduction". We want to know the probability that "Introduction to Computer Science" belongs to "humanities" and "non-humanities" respectively when these labels appear.

So, what do we have? Fortunately, we have 10 books on hand, 6 of which are known to be "humanities" and 4 are "non-humanities". After sorting out the duplicates, these 10 books have a total of 70 different tags, including "computer", "science", "theory", and "introduction".

Based on this, we can conclude that p(cate1)=6/10=0.6, p(cate2)=1-0.6=0.4. That is to say, the probability of "humanities" books among all books is 0.6, and "non-humanities" is 0.4.

The next thing is p(tag1,tag2,tag3...|cate1) and p(tag1,tag2,tag3...|cate2). That is, we need to calculate the probability of the tags "computer", "science", "theory", "classic", and "introduction" appearing in all the tags of "humanities" books in all the books in the "humanities" category. Similarly, we also need to calculate the probability of the above tags appearing in all the tags of all "non-humanities" books in all the books in the "non-humanities" category. The calculation method is to multiply the probability of each tag appearing in "humanities" and "non-humanities", and then multiply it by 0.6 and 0.4 respectively.

Then just compare the sizes. That is, compare the sizes of p(cate1) xp(tag1,tag2,tag3...|cate1) and p(cate2) xp(tag1,tag2,tag3...|cate2).

Get started

1. Prepare the training set

Almost all machine learning requires a training set. The same is true for Bayesian classification. In fact, the data we mentioned above is the training set. The 10 books mentioned in the example above, and all the tags of these 10 books after duplication, are our training set; and the two probabilities of 0.6 and 0.4, as well as p1(tag1,tag2,tag3...|cate1) and p2(tag1,tag2,tag3...|cate2), are what we calculated based on the data in the training set. Machine learning calls this "training".

Based on our problem, we need to prepare 100 books, artificially divide them into two categories: "humanities" and "non-humanities", and collect all the tags of these books. How to get these books? You can crawl book resources on Amazon or Douban.

2. Form a tag set

The tags mentioned above are saved in a list in Python. We make it dicts. Each element in dicts is a tag, for example:

dicts = ['science', 'theory', 'c++'] format.

3. Calculate the probability of "humanities" and "non-humanities" in the training set

It is very simple. As our example says, assuming that there are 60 books in the 100 books in the training set that are "Humanities", then p(cate1)=60/100=0.6. p(cate2)=1-p(cate1)=0.4. Here we use the variables:

pcate1 = 0.6
pcate2 = 0.4

4. Calculate the probability of each tag in the tag set appearing in the training set "Humanities" books

First, we construct a list based on the training set. Each item in this list is a list. Each item in this list is either 1 or 0. 1 means that the tag at this position in the dictionary is a tag of this book.

For example: suppose our dicts look like this:

['Computer', 'Novel', 'Psychology', 'Science', 'Programming', 'Behavior', 'Introduction', 'Classic', 'Travels', 'America']
We have a list like this: tag_vector_cate1
[
[0,1,0,0,0,0,0,1,1],
[0,0,1,0,0,0,0,1,0],
..............
]

This list corresponds to the "Humanities" category.

Each row represents a book in the "Humanities" category in the training set. The book corresponding to the first row is "The Catcher in the Rye", and its tags are "novel", "classic", and "America". The book corresponding to the second row is "Predictably Irrational", and its tags are "psychology", "behavior", and "America". Note that we use the entire tag set dicts to represent the tags of a book. Therefore, the 1 in the first column of the first row (we start counting from 0) indicates that "The Catcher in the Rye" has a tag of "novel" (corresponding to the first column in dicts); the 0 in the second column of the first row indicates that the book "The Catcher in the Rye" does not have the tag of "psychology" (corresponding to the second column in dicts). Similarly, we see that the seventh column of the first row and the second row are both 1, indicating that "The Catcher in the Rye" and "Predictably Irrational" both have the tag of "America".

With such data, it is easy for us to calculate. Now let's calculate p(tag1|cate1) as an example. For tag1, we calculate how many times tag1 appears in all the books of "Humanities" in the training set. For example: in the training set, there are 60 books of "Humanities", and 40 of them have the tag "Classics", so we set num_of_tag1=40. According to this method, we find out how many times each tag appears, for example: num_of_tag2=32, num_of_tage=18......

Then, we find the total number of tags for all books in the "Humanities" category (note that duplicates are not excluded here). For example, there are 2 books in the "Humanities" category. The first book is labeled "Prose", "Classic", "Foreign", and the second book is labeled "Classic" "Novel". Then, the total number of tags for all books is 3+2=5. Now, we find the total number of tags for all 100 books in the training set. Assume that the total number is 700. Let total_cate1=700.

Therefore, the probability of tag1 appearing in the "Humanities" category is: p(tag1|cate1) = num_of_tag1 / total_cate1 = 40/700 = 0.057. Similarly, we get p(tag2|cate1), p(tag3|cate1)...

Using numpy, we can easily implement this process with just a few lines of code.

  1. from numpy import *
  2. num_tags_cate1 = ones(len(dicts)) #( 1 )
  3. total_cate1 = 2.0 #( 2 )
  4. for item in tag_vector_cate1: num_tags_cate1 += item #( 3 ) total_cate1 += sum(item) #( 4 ) p_tags_cate1 = num_tags_cate1 / total_cate1 #( 5 )

Let me explain here.
The code (1) indicates the generation of a numpy array. ones() is a numpy function that returns a numpy array filled with the value 1, and the parameter is the length of this array. For example: temp=ones(3) means that a numpy array [1,1,1] is generated and returned to temp. Therefore, the code (1) uses the length of the tag set dicts of the training set as a parameter to generate a numpy array filled with 1 of the same length as dicts, and returns it to num_tags_cate1. Why should it be the same length as dicts? Remember, we use the entire dictionary set to represent a book. What we want to calculate is the probability of each tag in this dicts and put it into an array. num_tags_cate1 is this array. As for why this array is filled with 1, it will be explained later.

(2) total_cate1 = 2.0. total_cate1 is the denominator, and the denominator cannot be 0, so we need to make its initial value non-zero. Why is it 2.0? We will explain it later.

(3)num_tags_cate1 += item. Item is obviously a python list, which is what we just said [0,1,0,0,0,0,0,1,1]. When you use a numpy array plus a python list, numpy will help you calculate the corresponding items, which is equivalent to overloading +. For example, a is a numpy array: [1,2,3,5,0], and b is a python list: [0,0,3,2,1]. a + b = [1,2,6,7,1], and the result is a numpy array. In this example, the number of the three tags "novel", "classic", and "USA" has increased by 1 respectively.

(4) Add up the number of all tags that appear in each book. sum(item) is also a numpy function that adds up each item in item. For example: sum([2,5,-1]), the result is 2+5+(-1)=6. If item is a list like this: [0,1,0,0,0,0,0,1,1], it corresponds to "The Catcher in the Rye", and its tags are "novel", "classic", and "America", which means that the total number of tags has increased by 3.

(5) Obviously, we divide num_tags_cate1 by total_cate1, which is also numpy overloading the "/" operator. For example, [2,4,6]/2 is equivalent to dividing each item by 2, and finally getting a numpy array, that is, [1,2,3]. In this example, it is equivalent to dividing the number of times tag1, tag2, tag3, etc. appear by the total number of tags, and getting a numpy array p_tags_cate1. Each item in this array is a probability value, representing the probability of its corresponding tag appearing in the cate1 ("humanities") category.

Similarly, we can calculate p_tags_cate2, which is the probability of each tag appearing in cate2 ("non-humanities").

5. What do we have now

At this point, we have almost everything. Recall the formula for Bayesian classification:

p(cate1|tag1,tag2,tag3...) = p(tag1,tag2,tag3...|cate1) xp(cate1) / p(tag1,tag2,tag3...)

We have discussed before that the numerator can be ignored and not calculated, which means that we do not need to care about the denominator p(tag1,tag2,tag3...).

Furthermore, according to the Naive Bayesian theory, the numerator is equal to:

p(tag1,tag2,tag3...|cate1) xp(cate1) = p(tag1|cate1) xp(tag2|cate1) xp(tag3|cate1) x ... xp(cate1)

p(cate1) is equal to pcate1 mentioned above.

p(tag1|cate1), p(tag2|cate1)......are each item in the numpy array p_tags_cate1 we obtained above. We just need to multiply them together to get p(tag1|cate1) xp(tag2|cate1) x......!

At this point, we need to explain why the code above fills num_tags_cate1 with 1. If we fill it with 0, when a certain tag is always 0 (although it is impossible in theory), the result of multiplying the entire numerator is 0, so the value of *** becomes 0, affecting the result. Therefore, in order to avoid this situation, we assume that each tag must appear at least once, so we fill it with ones. In this way, in the worst case, num_tags_cate1=[1,1,1,.....].

When total_cate1=2.0, it corresponds to num_tags_cate1=[1,1,1,...], then we think the probability of each tag appearing is 0.5 (1/2.0). This is an adjustable parameter, but remember not to set total_cate1=1.0. If so, the probability of each tag appearing becomes 1, which is a big problem.

6. Use the training data to classify new books

Now that we have our Bayesian classifier, let’s see how to classify new books.

The so-called classification of new books means that after the training set has been completed (remember? The 100 manually classified books are the training set), at this time, we need to classify the 101st book. This book is not a book in the training set, it is a new book. We classify it based on several elements in the formula calculated earlier.

Similarly, we extract the tags of the new books and save them in a Python list, denoted as tagvects, in the form of [1,0,0,1,0,0,1....].

Next, we multiply each entry in p_tags_cate1 by the corresponding entry in tagvects:

  1. results_tags_cate1 = p_tags_cate1 * tagvects

Then multiply each item in num_tags_cate1:

  1. temp1 = 1.0  
  2. for item in results_tags_cate1: if item != 0 : temp1 = temp1 * item

In the same way, calculate temp2:

  1. results_tags_cate2 = p_tags_cate2 * tagvects
  2. temp2 = 1.0  
  3. for item in results_tags_cate2: if item != 0 : temp2 = temp2 * item

***,so:

  1. p_cate1_tags = temp1 * pcate1
  2. p_cate2_tags = temp2 x pcate2
  3. if p_cate1_tags > p_cate2_tags: print 'Humanities'   else : print 'Non-humanity'  

Obviously, by comparing the size of p_cate1_tags and p_cate2_tags, we can classify the new books. The new books will be classified into the side with the larger value.

Optimization trick

Since the above formula multiplies multiple probabilities, this is a terrible approach when the length of your tag set dicts is very large (that is, when your book has a lot of tags). Since each item is a decimal, multiplying so many decimals may overflow, or the number is too small to result in a calculation result of 0. At this time, a trick is needed to optimize and avoid this situation.

We take a very popular approach in mathematics and take the logarithm ln to improve our algorithm. In Python, the function for taking the logarithm is log().

There are several places where you can take logarithms. Here is a recommended approach, which is to change the equation to be calculated to:

  1. ln(p(tag1|cate1) * p(tag2|cate1) *....* p(cate1)))

Expanded, it becomes:

  1. ln(p(tag1|cate1)) + ln(p(tag2|cate1)) + ... + ln(pcate1)

Recall that p(tag1|cate1), p(tag2|cate1)... are each item of p_tags_cate1 we calculated above (p_tags_cate1 is a numpy array, each of which represents the probability of the corresponding tag appearing in the "Humanities" category). In our calculation above:

  1. p_tags_cate1 = num_tags_cate1 / total_cate1

Now we take the logarithm of it, so we change the code to:

  1. p_tags_cate1 = log(num_tags_cate1 / total_cate1)

Note that the log here is the log of each item in the numpy array, and the result is still an array.

Therefore, p_tags_cate1 becomes the logarithm of the array.

Then find ln(pcate1) and change pcate1 to:

  1. pcate1 = log(pcate1)

Therefore, the code for the above *** classification is changed to:

  1. results_tags_cate1 = p_tags_cate1 * tagvects
  2. temp1 = 1.0  
  3. for item in results_tags_cate1: if item != 0 : temp1 = temp1 + item

In the same way, calculate temp2:

  1. results_tags_cate2 = p_tags_cate2 * tagvects
  2. temp2 = 1.0  
  3. for item in results_tags_cate2: if item != 0 : temp2 = temp2 + item

Then you can classify:

  1. p_cate1_tags = temp1 + pcate1
  2. p_cate2_tags = temp2 + pcate2
  3. if p_cate1_tags > p_cate2_tags: print 'Humanities'   else : print 'Non-humanity'  

Summarize

I'm glad you finally came here. This article strives to be concise and minimize the learning cost. However, it is possible that you may not understand it after reading it for the first time. This is normal. Everything has a process, especially in the field of machine learning, which requires you to chew, think and practice repeatedly. In the process of writing this article, I have also repeatedly deepened my understanding of Bayesian classification, but there are still some unclear points in my expression. If you have any questions or suggestions, please feel free to discuss with me.

Through this article, we understand:

  1. What is Naive Bayes
  2. In order to perform naive Bayes classification, how should we prepare the training set, among which the tag set is very important?
  3. Based on the training data, the probability of each tag appearing in "humanities" and "non-humanities" is obtained
  4. Use these occurrence probabilities to classify new articles
  5. Make clever use of logarithms and some initial values ​​(such as ones()) to optimize the algorithm

<<:  Record the judgment method of conventional jailbreak

>>:  Implementing dynamic pop-up button effect

Recommend

How to build a good live broadcast room?

I believe everyone is already familiar with the c...

How to reduce APP uninstall rate? Here are seven ways!

The mobile application market is now a crowded ma...

Novel placement TikTok advertising landing page format!

Today an advertiser said, I want to promote my no...

Toutiao information flow advertising analysis and delivery skills!

All the newbies who switch to the information flo...

How to build a reasonable B-side operation system?

This article is a review and summary of the autho...