Text mining (getting information from text) is a relatively broad concept. This technology is gaining more and more attention in today's era when massive amounts of text data are generated every day. At present, with the help of machine learning models, many text mining applications including sentiment analysis, document classification, topic classification, text summarization, machine translation, etc. have been automated. Among these applications, spam filtering is a good start for beginners to practice file classification. For example, the "spam mailbox" in the Gmail account is a real-life application of spam filtering. Next, we will write a spam filter based on a public email dataset Ling-spam. The download address of the Ling-spam dataset is as follows: http://t.cn/RKQBl9c Here we have extracted the same number of spam and non-spam emails from Ling-spam. The specific download address is as follows: http://t.cn/RKQBkRu We will go through the following steps to write a realistic spam filter.
***, we will validate the filter through a test dataset. 1. Prepare text dataHere we divide the dataset into two parts: training set (702 emails) and test set (260 emails), with spam and non-spam emails accounting for 50% each. Here, because each spam email dataset is named spmsg, it is easy to distinguish. In most text mining problems, text cleaning is the first step, that is, we must first clean up those words and sentences that are irrelevant to our target information, and this example is no exception. Usually, emails contain a lot of useless characters, such as punctuation marks, stop words, numbers, etc. These characters are not helpful for detecting spam, so we need to clean them up. Here, the emails in the Ling-spam dataset have been processed in the following steps: a) Remove stop words - Stop words like "and", "the", "of" etc. are very common in English sentences. However, these stop words are not very useful in determining the true identity of the email, so these words have been removed from the email. b) Lemmatization - This is the process of grouping different forms of the same word together so that they can be analyzed as a single item. For example, "include", "includes" and "included" can all be represented by "include". At the same time, the contextual meaning of the sentence is also preserved through lemmatization, which is different from the stemming method (Note: stemming is another text mining method that does not consider the meaning of the sentence). In addition, we also need to remove some non-words, such as punctuation marks or special characters. There are many ways to achieve this step. Here, we will first create a dictionary and then remove these non-words. It should be pointed out that this approach is actually very convenient, because once you have a dictionary, you only need to remove each non-word symbol once. 2. Creating word dictionaryA sample email in a dataset usually looks like this:
You will find that the first line of the email is the title, and the third line is the body. Here we only perform text analysis based on the content of the email body to determine whether the email is spam. In the first step, we need to create a dictionary of words and the frequency of word occurrence. In order to create such a "dictionary", we use the 700 emails in the training set. For detailed implementation, please see the following Python function:
After the dictionary is created, we only need to add a few more lines of code to the above function to remove the non-text symbols mentioned above. Here I also deleted some single characters that are not related to spam determination. See the following code for details. Note that these codes should be attached to the end of the def make_Dictionary(train_dir) function.
Here you can output the dictionary by entering the print dictionary command. It should be noted that you may see many irrelevant words in the printed dictionary, but don't worry about it, because we always have the opportunity to adjust it in subsequent steps. In addition, if you strictly follow the data set mentioned above, then your dictionary should have the following high-frequency words (in this example, we selected the 3000 most frequent words):
3. Feature extractionAfter the dictionary is ready, we can extract a word count vector of dimension 3000 for each email in the training set (this vector is our feature). Each word count vector contains the specific frequency of occurrence of the 3000 high-frequency words selected previously. Of course, as you may have guessed, most of the occurrence frequencies should be 0. For example: for example, there are 500 words in our dictionary, and each word count vector contains the frequency of occurrence of these 500 words in the training set. Suppose the training set has a set of text: "Get the work done, work done". Then, the word count vector corresponding to this sentence should be as follows: [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0,0]. Here, the frequency of each word in the sentence can be displayed: these words correspond to the 296th, 359th, 415th and 495th positions in the word count vector of length 500, and the other positions are displayed as 0. The following python function will help us generate a feature vector matrix with 700 rows and 3000 columns. Each row represents each of the 700 emails in the training set, and each column represents the 3000 keywords in the dictionary. The value at the "ij" position represents the number of times the jth word in the dictionary appears in the email (the i-th one).
4. Training the classifierHere we will use the scikit-learn machine learning library to train the classifier. The relevant links of the scikit-learn library are as follows: http://t.cn/SMzAoZ This is an open source machine learning library bundled with the third-party python distribution Anaconda. It can be downloaded and installed with Anaconda, or you can install it independently by following the instructions in the following link: http://t.cn/8kkrVlQ After installation, we just need to import it into our program to use it. Here we trained two models, namely the Naive Bayes classifier and SVM (Support Vector Machine). The Naive Bayes classifier is a traditional supervised probabilistic classifier, which is very commonly used in text classification scenarios. It is based on Bayes' theorem and assumes that each pair of features is independent of each other. SVM is a supervised binary classifier, which is very effective when facing scenarios with a large number of features. Its ultimate goal is to separate a subset from the training data, called the support vector (the boundary of the separating hyperplane). The SVM decision function that determines the final category of the test data is based on the support vector and the kernel trick. After the classifier is trained, we can test the performance of the model on the test set. Here we extract the word count vector for each email in the test set, and then use the trained naive Bayes classifier and SVM model to predict its category (normal email or spam). Below is the complete python code for the spam classifier, which also needs to include the two functions we defined in steps 2 and 3.
Performance TestingHere our test set contains 130 spam emails and 130 non-spam emails. If you have successfully completed all the previous steps, you will get the following results. Here is the confusion matrix of the two models in the test data. The diagonal elements represent the number of correctly identified emails, and the off-diagonal elements represent the wrong classification. As you can see, the two models have similar performance on the test set, but SVM is more inclined to identify spam. It should be noted that the test dataset here is neither used to create a dictionary nor for model training. expandInterested friends can follow the steps described above to make some expansions. Here we introduce the expansion-related databases and results. The expansion uses the pre-processed Euron-spam database, which contains 6 directories and 33,716 emails. Each directory contains non-spam and spam subdirectories. The total number of non-spam and spam emails is 16,545 and 17,171 respectively. The download link of Euron-spam library is as follows: http://t.cn/RK84mv6 It should be noted that since the organization of the Euron-spam database is different from the ling-spam library mentioned above, some of the above functions also need to be slightly modified before they can be applied to Euron-spam. Here we divide the Euron-spam database into training set and test set in a ratio of 3:2. Following the above steps, we get the following results from 13478 test emails: As you can see, SVM performs slightly better than Naive Bayes. SummarizeIn this article, we try to keep the description simple and easy to understand, and omit many technical explanations and terms. We hope that this is a simple and easy-to-understand tutorial, and we hope that this tutorial can be helpful to beginners who are interested in text analysis. Some friends may be curious about the mathematical principles behind the Naive Bayes model and the SVM model. It should be pointed out here that SVM is a more complex model in mathematics, while Naive Bayes is relatively easier to understand. We certainly encourage friends who are interested in mathematical principles to explore in depth. There are very detailed tutorials and examples on these mathematical models online. In addition, using different methods to achieve the same goal is also a good research method. For example, you can adjust the following parameters to observe their impact on the actual effect of spam filtering:
***, the complete python code mentioned in the blog is available at the following link: http://t.cn/R6ZeuiN If you have any questions, please leave a message at the end of the article for discussion. This article is reproduced from Leifeng.com. The original text comes from a blog of a foreign great writer and was compiled by two members of the Leifeng.com subtitle team, Peng Yanlei and Lin Lihong. |
>>: Vue.js and MVVM small details
Recently I have been thinking about a question: f...
Marketing should go where consumers' attentio...
According to foreign media reports, if you are lo...
In 2021, Zhihu celebrates its tenth anniversary, ...
If the content of the article is in this situatio...
The article summarizes 36 rules for creating popu...
Normal 0 10 pt 0 2 false false false EN-US ZH-CN ...
Please don’t think that Ai Qijun is a clickbait t...
Regarding event operations , this article summari...
In traditional advertising, there is a famous phe...
How much does it cost to develop a textbook apple...
Founded in 2014, after more than five years of de...
The concept of growth hacking has been very popul...
For SEO website optimizers, knowing how to build ...
For entrepreneurs, although mini program developm...