Background and Purpose First of all, this is a self-study record of a machine learning beginner and a non-typical engineer who is not a math major. So this article will not be particularly theoretical, nor will it explain the formulas too deeply, but it will be very purposeful, targeting a particularly realistic problem, sharing solutions from scratch, including some optimization solutions. Start with a Question The problem we want to solve is to perform binary classification on books. The basis for classification is the book tags. These tags may come from experts, editors, or users. For example, "foreign literature", "detective", "computer", and "python" are all tags. For the sake of our small experimental project, to simplify the problem, we now want to classify books into two categories: "humanities" or "non-humanities". So, as mentioned above, this is a binary classification problem for books. For example, Introduction to Computer Science has the tags "computer", "science", "classic", and "introduction", so it belongs to "non-humanities". The Catcher in the Rye has the tags "novel", "literature", and "America", so it belongs to "humanities". Our classifier has the ability to automatically classify a book as "humanities" or "non-humanities" based on its tags. Try it out, it's quite interesting! In order to solve this problem, we give several premises:
Using Python and NumPy We will use Python as the programming language for this lab project. Numpy is a scientific computing library for Python that you need to install yourself. Because our program involves some simple matrix operations, using Numpy can greatly simplify the programming workload. Rationale How Bayesian Classifiers Work You still need to understand some theoretical knowledge, but don't worry, this part will pass quickly. I will explain it directly in conjunction with the problem to be solved. Basically, Bayesian classification is to solve a problem like this: Given a book with these tags: tag1, tag2, tag3..., what is the probability that it belongs to the "humanities" category? What is the probability that it belongs to the "non-humanities" category? Assume that p1 represents the probability that it belongs to "humanities" in this case, and p2 represents the probability that it belongs to "non-humanities" in this case. If p1>p2, then the book belongs to "humanities", otherwise it belongs to "non-humanities". We do not consider the case where p1=p2. It’s simple, isn’t it? So, the question becomes, how to calculate p1 and p2 through tag1, tag2, tag3... After all, as long as we know these two values, our final problem is solved. Conditional Probability In fact, this is a problem of conditional probability. The so-called conditional probability is to find the probability of a happening when b happens. We write it as: p(a|b). In combination with our actual problem, that is, when tag1, tag2, tag3... have already occurred (that is, the tags of this book are tag1, tag2, tag3...), the probability of this book belonging to "humanities" and "non-humanities" is written as: p(cate2|tag1,tag2,tag3...) means the probability that this book belongs to cate2 (cate2 = "non-humanities") when tag1,tag2,tag3... occur. The p(cate1|tag1,tag2,tag3...) here is actually the p1 mentioned above. We use a more professional method to write it here. How do we calculate the conditional probability? This is the Bayesian formula:
This means: if you want to find p(a|b), and you know the values of p(b|a), p(a) and p(b), then you can find p(a|b) through p(b|a)*p(a)/p(b). Changing to the actual problem we want to solve, it is equal to:
Translated into human language, that is, you want to find p(cate1|tag1,tag2,tag3...), and now you know:
In other words, as long as we find the above three items one by one, we can find p(cate1|tag1,tag2,tag3...). Similarly, p(cate2|tag1,tag2,tag3...) can also be found. Here is a tip worth noting. In the above three items, we don't need to calculate the third item. Because our purpose is to compare the size of p(cate1|tag1,tag2,tag3...) and p(cate2|tag1,tag2,tag3...), not to get the actual value. Since the denominator p(tag1,tag2,tag3...) in the above formula is the same, we only need to compare the size of the numerator. That is, compare: p(tag1,tag2,tag3...|cate1) * p(cate1), The size of p(tag1,tag2,tag3...|cate2) * p(cate2) This saves us some calculations. Naive Bayes So, how to calculate p(tag1, tag2, tag3...|cate1)? Here we need to use the concept of naive Bayes, that is, we believe that in the tags of a book, each tag is independent of each other and has nothing to do with whether the other tag appears. In other words, the probability of "computer" and "classic" appearing is unrelated to each other, and the probability of "classic" appearing will not be high just because "computer" appears. Since they are independent of each other, then p(tag1,tag2,tag3...|cate1) is equal to:
p(tag1,tag2,tag3...|cate2) is equal to:
In other words, we can calculate the probability of each tag appearing in all tags of "humanities" and "non-humanities" books, and then multiply them together to get what we want. Example Analysis We now have a book "Introduction to Computer Science", and its labels are "computer", "science", "theory", "classic", and "introduction". We want to know the probability that "Introduction to Computer Science" belongs to "humanities" and "non-humanities" respectively when these labels appear. So, what do we have? Fortunately, we have 10 books on hand, 6 of which are known to be "humanities" and 4 are "non-humanities". After sorting out the duplicates, these 10 books have a total of 70 different tags, including "computer", "science", "theory", and "introduction". Based on this, we can conclude that p(cate1)=6/10=0.6, p(cate2)=1-0.6=0.4. That is to say, the probability of "humanities" books among all books is 0.6, and "non-humanities" is 0.4. The next thing is p(tag1,tag2,tag3...|cate1) and p(tag1,tag2,tag3...|cate2). That is, we need to calculate the probability of the tags "computer", "science", "theory", "classic", and "introduction" appearing in all the tags of "humanities" books in all the books in the "humanities" category. Similarly, we also need to calculate the probability of the above tags appearing in all the tags of all "non-humanities" books in all the books in the "non-humanities" category. The calculation method is to multiply the probability of each tag appearing in "humanities" and "non-humanities", and then multiply it by 0.6 and 0.4 respectively. Then just compare the sizes. That is, compare the sizes of p(cate1) xp(tag1,tag2,tag3...|cate1) and p(cate2) xp(tag1,tag2,tag3...|cate2). Get started 1. Prepare the training set Almost all machine learning requires a training set. The same is true for Bayesian classification. In fact, the data we mentioned above is the training set. The 10 books mentioned in the example above, and all the tags of these 10 books after duplication, are our training set; and the two probabilities of 0.6 and 0.4, as well as p1(tag1,tag2,tag3...|cate1) and p2(tag1,tag2,tag3...|cate2), are what we calculated based on the data in the training set. Machine learning calls this "training". Based on our problem, we need to prepare 100 books, artificially divide them into two categories: "humanities" and "non-humanities", and collect all the tags of these books. How to get these books? You can crawl book resources on Amazon or Douban. 2. Form a tag set The tags mentioned above are saved in a list in Python. We make it dicts. Each element in dicts is a tag, for example:
3. Calculate the probability of "humanities" and "non-humanities" in the training set It is very simple. As our example says, assuming that there are 60 books in the 100 books in the training set that are "Humanities", then p(cate1)=60/100=0.6. p(cate2)=1-p(cate1)=0.4. Here we use the variables:
4. Calculate the probability of each tag in the tag set appearing in the training set "Humanities" books First, we construct a list based on the training set. Each item in this list is a list. Each item in this list is either 1 or 0. 1 means that the tag at this position in the dictionary is a tag of this book. For example: suppose our dicts look like this:
This list corresponds to the "Humanities" category. Each row represents a book in the "Humanities" category in the training set. The book corresponding to the first row is "The Catcher in the Rye", and its tags are "novel", "classic", and "America". The book corresponding to the second row is "Predictably Irrational", and its tags are "psychology", "behavior", and "America". Note that we use the entire tag set dicts to represent the tags of a book. Therefore, the 1 in the first column of the first row (we start counting from 0) indicates that "The Catcher in the Rye" has a tag of "novel" (corresponding to the first column in dicts); the 0 in the second column of the first row indicates that the book "The Catcher in the Rye" does not have the tag of "psychology" (corresponding to the second column in dicts). Similarly, we see that the seventh column of the first row and the second row are both 1, indicating that "The Catcher in the Rye" and "Predictably Irrational" both have the tag of "America". With such data, it is easy for us to calculate. Now let's calculate p(tag1|cate1) as an example. For tag1, we calculate how many times tag1 appears in all the books of "Humanities" in the training set. For example: in the training set, there are 60 books of "Humanities", and 40 of them have the tag "Classics", so we set num_of_tag1=40. According to this method, we find out how many times each tag appears, for example: num_of_tag2=32, num_of_tage=18...... Then, we find the total number of tags for all books in the "Humanities" category (note that duplicates are not excluded here). For example, there are 2 books in the "Humanities" category. The first book is labeled "Prose", "Classic", "Foreign", and the second book is labeled "Classic" "Novel". Then, the total number of tags for all books is 3+2=5. Now, we find the total number of tags for all 100 books in the training set. Assume that the total number is 700. Let total_cate1=700. Therefore, the probability of tag1 appearing in the "Humanities" category is: p(tag1|cate1) = num_of_tag1 / total_cate1 = 40/700 = 0.057. Similarly, we get p(tag2|cate1), p(tag3|cate1)... Using numpy, we can easily implement this process with just a few lines of code.
Let me explain here. (2) total_cate1 = 2.0. total_cate1 is the denominator, and the denominator cannot be 0, so we need to make its initial value non-zero. Why is it 2.0? We will explain it later. (3)num_tags_cate1 += item. Item is obviously a python list, which is what we just said [0,1,0,0,0,0,0,1,1]. When you use a numpy array plus a python list, numpy will help you calculate the corresponding items, which is equivalent to overloading +. For example, a is a numpy array: [1,2,3,5,0], and b is a python list: [0,0,3,2,1]. a + b = [1,2,6,7,1], and the result is a numpy array. In this example, the number of the three tags "novel", "classic", and "USA" has increased by 1 respectively. (4) Add up the number of all tags that appear in each book. sum(item) is also a numpy function that adds up each item in item. For example: sum([2,5,-1]), the result is 2+5+(-1)=6. If item is a list like this: [0,1,0,0,0,0,0,1,1], it corresponds to "The Catcher in the Rye", and its tags are "novel", "classic", and "America", which means that the total number of tags has increased by 3. (5) Obviously, we divide num_tags_cate1 by total_cate1, which is also numpy overloading the "/" operator. For example, [2,4,6]/2 is equivalent to dividing each item by 2, and finally getting a numpy array, that is, [1,2,3]. In this example, it is equivalent to dividing the number of times tag1, tag2, tag3, etc. appear by the total number of tags, and getting a numpy array p_tags_cate1. Each item in this array is a probability value, representing the probability of its corresponding tag appearing in the cate1 ("humanities") category. Similarly, we can calculate p_tags_cate2, which is the probability of each tag appearing in cate2 ("non-humanities"). 5. What do we have now At this point, we have almost everything. Recall the formula for Bayesian classification:
We have discussed before that the numerator can be ignored and not calculated, which means that we do not need to care about the denominator p(tag1,tag2,tag3...). Furthermore, according to the Naive Bayesian theory, the numerator is equal to:
p(cate1) is equal to pcate1 mentioned above. p(tag1|cate1), p(tag2|cate1)......are each item in the numpy array p_tags_cate1 we obtained above. We just need to multiply them together to get p(tag1|cate1) xp(tag2|cate1) x......! At this point, we need to explain why the code above fills num_tags_cate1 with 1. If we fill it with 0, when a certain tag is always 0 (although it is impossible in theory), the result of multiplying the entire numerator is 0, so the value of *** becomes 0, affecting the result. Therefore, in order to avoid this situation, we assume that each tag must appear at least once, so we fill it with ones. In this way, in the worst case, num_tags_cate1=[1,1,1,.....]. When total_cate1=2.0, it corresponds to num_tags_cate1=[1,1,1,...], then we think the probability of each tag appearing is 0.5 (1/2.0). This is an adjustable parameter, but remember not to set total_cate1=1.0. If so, the probability of each tag appearing becomes 1, which is a big problem. 6. Use the training data to classify new books Now that we have our Bayesian classifier, let’s see how to classify new books. The so-called classification of new books means that after the training set has been completed (remember? The 100 manually classified books are the training set), at this time, we need to classify the 101st book. This book is not a book in the training set, it is a new book. We classify it based on several elements in the formula calculated earlier. Similarly, we extract the tags of the new books and save them in a Python list, denoted as tagvects, in the form of [1,0,0,1,0,0,1....]. Next, we multiply each entry in p_tags_cate1 by the corresponding entry in tagvects:
Then multiply each item in num_tags_cate1:
In the same way, calculate temp2:
***,so:
Obviously, by comparing the size of p_cate1_tags and p_cate2_tags, we can classify the new books. The new books will be classified into the side with the larger value. Optimization trick Since the above formula multiplies multiple probabilities, this is a terrible approach when the length of your tag set dicts is very large (that is, when your book has a lot of tags). Since each item is a decimal, multiplying so many decimals may overflow, or the number is too small to result in a calculation result of 0. At this time, a trick is needed to optimize and avoid this situation. We take a very popular approach in mathematics and take the logarithm ln to improve our algorithm. In Python, the function for taking the logarithm is log(). There are several places where you can take logarithms. Here is a recommended approach, which is to change the equation to be calculated to:
Expanded, it becomes:
Recall that p(tag1|cate1), p(tag2|cate1)... are each item of p_tags_cate1 we calculated above (p_tags_cate1 is a numpy array, each of which represents the probability of the corresponding tag appearing in the "Humanities" category). In our calculation above:
Now we take the logarithm of it, so we change the code to:
Note that the log here is the log of each item in the numpy array, and the result is still an array. Therefore, p_tags_cate1 becomes the logarithm of the array. Then find ln(pcate1) and change pcate1 to:
Therefore, the code for the above *** classification is changed to:
In the same way, calculate temp2:
Then you can classify:
Summarize I'm glad you finally came here. This article strives to be concise and minimize the learning cost. However, it is possible that you may not understand it after reading it for the first time. This is normal. Everything has a process, especially in the field of machine learning, which requires you to chew, think and practice repeatedly. In the process of writing this article, I have also repeatedly deepened my understanding of Bayesian classification, but there are still some unclear points in my expression. If you have any questions or suggestions, please feel free to discuss with me. Through this article, we understand:
|
<<: Record the judgment method of conventional jailbreak
>>: Implementing dynamic pop-up button effect
Since 2015, the traffic dividend of China's m...
I believe everyone is already familiar with the c...
The mobile application market is now a crowded ma...
2023 Postgraduate English Entrance Examination Li...
High-efficiency traffic acceptance anchor trainin...
Today an advertiser said, I want to promote my no...
All the newbies who switch to the information flo...
The "Announcement on the Handling of Inducem...
This article is a review and summary of the autho...
The personal income tax rate is stipulated by the...
More and more businesses are paying attention to ...
After being delayed for more than a month, Douyin...
The success or failure of the initial authorizati...
How much is the quotation for Binzhou women's...
Compared with the Android system, Apple's IOS...