A step-by-step guide to spam filtering with Python and Scikit-Learn

A step-by-step guide to spam filtering with Python and Scikit-Learn

[[197427]]

Text mining (getting information from text) is a relatively broad concept. This technology is gaining more and more attention in today's era when massive amounts of text data are generated every day. At present, with the help of machine learning models, many text mining applications including sentiment analysis, document classification, topic classification, text summarization, machine translation, etc. have been automated.

Among these applications, spam filtering is a good start for beginners to practice file classification. For example, the "spam mailbox" in the Gmail account is a real-life application of spam filtering. Next, we will write a spam filter based on a public email dataset Ling-spam. The download address of the Ling-spam dataset is as follows:

http://t.cn/RKQBl9c

Here we have extracted the same number of spam and non-spam emails from Ling-spam. The specific download address is as follows:

http://t.cn/RKQBkRu

We will go through the following steps to write a realistic spam filter.

1. Prepare text data;

2. Create a word dictionary;

3. Feature extraction;

4. Train the classifier.

***, we will validate the filter through a test dataset.

1. Prepare text data

Here we divide the dataset into two parts: training set (702 emails) and test set (260 emails), with spam and non-spam emails accounting for 50% each. Here, because each spam email dataset is named spmsg, it is easy to distinguish.

In most text mining problems, text cleaning is the first step, that is, we must first clean up those words and sentences that are irrelevant to our target information, and this example is no exception. Usually, emails contain a lot of useless characters, such as punctuation marks, stop words, numbers, etc. These characters are not helpful for detecting spam, so we need to clean them up. Here, the emails in the Ling-spam dataset have been processed in the following steps:

a) Remove stop words - Stop words like "and", "the", "of" etc. are very common in English sentences. However, these stop words are not very useful in determining the true identity of the email, so these words have been removed from the email.

b) Lemmatization - This is the process of grouping different forms of the same word together so that they can be analyzed as a single item. For example, "include", "includes" and "included" can all be represented by "include". At the same time, the contextual meaning of the sentence is also preserved through lemmatization, which is different from the stemming method (Note: stemming is another text mining method that does not consider the meaning of the sentence).

In addition, we also need to remove some non-words, such as punctuation marks or special characters. There are many ways to achieve this step. Here, we will first create a dictionary and then remove these non-words. It should be pointed out that this approach is actually very convenient, because once you have a dictionary, you only need to remove each non-word symbol once.

2. Creating word dictionary

A sample email in a dataset usually looks like this:

  1. Subject: posting
  2.  
  3. hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu

You will find that the first line of the email is the title, and the third line is the body. Here we only perform text analysis based on the content of the email body to determine whether the email is spam. In the first step, we need to create a dictionary of words and the frequency of word occurrence. In order to create such a "dictionary", we use the 700 emails in the training set. For detailed implementation, please see the following Python function:

  1. def make_Dictionary(train_dir):
  2. emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]
  3. all_words = []
  4. for mail in emails:
  5. with open(mail) as m:
  6. for i,line in enumerate(m):
  7. if i == 2 : #Body of email is only 3rd line of text file  
  8. words = line.split()
  9. all_words += words
  10.      
  11. dictionary = Counter(all_words)
  12. # Paste code for non-word removal here(code snippet is given below)  
  13. return dictionary

After the dictionary is created, we only need to add a few more lines of code to the above function to remove the non-text symbols mentioned above. Here I also deleted some single characters that are not related to spam determination. See the following code for details. Note that these codes should be attached to the end of the def make_Dictionary(train_dir) function.

  1. list_to_remove = dictionary.keys()
  2. for item in list_to_remove:
  3. if item.isalpha() == False :
  4. del dictionary[item]
  5. elif len(item) == 1 :
  6. del dictionary[item]
  7. dictionary = dictionary.most_common( 3000 )

Here you can output the dictionary by entering the print dictionary command. It should be noted that you may see many irrelevant words in the printed dictionary, but don't worry about it, because we always have the opportunity to adjust it in subsequent steps. In addition, if you strictly follow the data set mentioned above, then your dictionary should have the following high-frequency words (in this example, we selected the 3000 most frequent words):

[('order', 1414), ('address', 1293), ('report', 1216), ('mail', 1127), ('send', 1079), ('language', 1072), ('email', 1051), ('program', 1001), ('our', 987), ('list', 935), ('one', 917), ('name', 878), ('receive', 826), ('money', 788), ('free', 762)

3. Feature extraction

After the dictionary is ready, we can extract a word count vector of dimension 3000 for each email in the training set (this vector is our feature). Each word count vector contains the specific frequency of occurrence of the 3000 high-frequency words selected previously. Of course, as you may have guessed, most of the occurrence frequencies should be 0. For example: for example, there are 500 words in our dictionary, and each word count vector contains the frequency of occurrence of these 500 words in the training set. Suppose the training set has a set of text: "Get the work done, work done". Then, the word count vector corresponding to this sentence should be as follows: [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0,0]. Here, the frequency of each word in the sentence can be displayed: these words correspond to the 296th, 359th, 415th and 495th positions in the word count vector of length 500, and the other positions are displayed as 0.

The following python function will help us generate a feature vector matrix with 700 rows and 3000 columns. Each row represents each of the 700 emails in the training set, and each column represents the 3000 keywords in the dictionary. The value at the "ij" position represents the number of times the jth word in the dictionary appears in the email (the i-th one).

  1. def extract_features(mail_dir):
  2. files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  3. features_matrix = np.zeros((len(files), 3000 ))
  4. docID = 0 ;
  5. for fil in files:
  6. with open(fil) as fi:
  7. for i,line in enumerate(fi):
  8. if i == 2 :
  9. words = line.split()
  10. for word in words:
  11. wordID = 0  
  12. for i,d in enumerate(dictionary):
  13. if d[ 0 ] == word:
  14. wordID = i
  15. features_matrix[docID,wordID] = words.count(word)
  16. docID = docID + 1       
  17. return features_matrix

4. Training the classifier

Here we will use the scikit-learn machine learning library to train the classifier. The relevant links of the scikit-learn library are as follows:

http://t.cn/SMzAoZ

This is an open source machine learning library bundled with the third-party python distribution Anaconda. It can be downloaded and installed with Anaconda, or you can install it independently by following the instructions in the following link:

http://t.cn/8kkrVlQ

After installation, we just need to import it into our program to use it.

Here we trained two models, namely the Naive Bayes classifier and SVM (Support Vector Machine). The Naive Bayes classifier is a traditional supervised probabilistic classifier, which is very commonly used in text classification scenarios. It is based on Bayes' theorem and assumes that each pair of features is independent of each other. SVM is a supervised binary classifier, which is very effective when facing scenarios with a large number of features. Its ultimate goal is to separate a subset from the training data, called the support vector (the boundary of the separating hyperplane). The SVM decision function that determines the final category of the test data is based on the support vector and the kernel trick.

After the classifier is trained, we can test the performance of the model on the test set. Here we extract the word count vector for each email in the test set, and then use the trained naive Bayes classifier and SVM model to predict its category (normal email or spam). Below is the complete python code for the spam classifier, which also needs to include the two functions we defined in steps 2 and 3.

  1. import os
  2. import numpy as np
  3. from collections import Counter
  4. from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
  5. from sklearn.svm import SVC, NuSVC, LinearSVC
  6. from sklearn.metrics import confusion_matrix
  7. # Create a dictionary of words with its frequency  
  8.  
  9. train_dir = 'train-mails'  
  10. dictionary = make_Dictionary(train_dir)
  11.  
  12. # Prepare feature vectors per training mail and its labels  
  13.  
  14. train_labels = np.zeros( 702 )
  15. train_labels[ 351 : 701 ] = 1  
  16. train_matrix = extract_features(train_dir)
  17.  
  18. # Training SVM and Naive bayes classifier  
  19.  
  20. model1 = MultinomialNB()
  21. model2 = LinearSVC()
  22. model1.fit(train_matrix,train_labels)
  23. model2.fit(train_matrix,train_labels)
  24.  
  25. # Test the unseen mails for Spam  
  26. test_dir = 'test-mails'  
  27. test_matrix = extract_features(test_dir)
  28. test_labels = np.zeros( 260 )
  29. test_labels[ 130 : 260 ] = 1  
  30. result1 = model1.predict(test_matrix)
  31. result2 = model2.predict(test_matrix)
  32. print confusion_matrix(test_labels,result1)
  33. print confusion_matrix(test_labels,result2)

Performance Testing

Here our test set contains 130 spam emails and 130 non-spam emails. If you have successfully completed all the previous steps, you will get the following results. Here is the confusion matrix of the two models in the test data. The diagonal elements represent the number of correctly identified emails, and the off-diagonal elements represent the wrong classification.

As you can see, the two models have similar performance on the test set, but SVM is more inclined to identify spam. It should be noted that the test dataset here is neither used to create a dictionary nor for model training.

expand

Interested friends can follow the steps described above to make some expansions. Here we introduce the expansion-related databases and results.

The expansion uses the pre-processed Euron-spam database, which contains 6 directories and 33,716 emails. Each directory contains non-spam and spam subdirectories. The total number of non-spam and spam emails is 16,545 and 17,171 respectively. The download link of Euron-spam library is as follows:

http://t.cn/RK84mv6

It should be noted that since the organization of the Euron-spam database is different from the ling-spam library mentioned above, some of the above functions also need to be slightly modified before they can be applied to Euron-spam.

Here we divide the Euron-spam database into training set and test set in a ratio of 3:2. Following the above steps, we get the following results from 13478 test emails:

As you can see, SVM performs slightly better than Naive Bayes.

Summarize

In this article, we try to keep the description simple and easy to understand, and omit many technical explanations and terms. We hope that this is a simple and easy-to-understand tutorial, and we hope that this tutorial can be helpful to beginners who are interested in text analysis.

Some friends may be curious about the mathematical principles behind the Naive Bayes model and the SVM model. It should be pointed out here that SVM is a more complex model in mathematics, while Naive Bayes is relatively easier to understand. We certainly encourage friends who are interested in mathematical principles to explore in depth. There are very detailed tutorials and examples on these mathematical models online. In addition, using different methods to achieve the same goal is also a good research method. For example, you can adjust the following parameters to observe their impact on the actual effect of spam filtering:

a) Size of training data

b) The size of the dictionary

c) Different machine learning models, including GaussianNB, BernoulliNB, SVC

d) Different SVM model parameters

e) Improve the dictionary by removing insignificant words (e.g. manual deletion)

f) Use other feature models (find td-idf)

***, the complete python code mentioned in the blog is available at the following link:

http://t.cn/R6ZeuiN

If you have any questions, please leave a message at the end of the article for discussion.

This article is reproduced from Leifeng.com. The original text comes from a blog of a foreign great writer and was compiled by two members of the Leifeng.com subtitle team, Peng Yanlei and Lin Lihong.

<<:  Teach you step by step to build the PHP version of RabbitMQ message queue development environment and Demo practice

>>:  Vue.js and MVVM small details

Recommend

Practical tips + cases | 3 ways to double event revenue!

Recently I have been thinking about a question: f...

5 tips for brand live streaming marketing!

Marketing should go where consumers' attentio...

Bing launches "Menu Favorites" feature on mobile search client

According to foreign media reports, if you are lo...

Zhihu Product Analysis Report

In 2021, Zhihu celebrates its tenth anniversary, ...

Nanping SEO training: The website articles have been updated but not received

If the content of the article is in this situatio...

Rules for creating popular short videos on Tik Tok!

The article summarizes 36 rules for creating popu...

Test combination: Can Baidu MTC relieve your pain?

Normal 0 10 pt 0 2 false false false EN-US ZH-CN ...

Event Operations: Avoid These 12 Pitfalls for Beginners

Regarding event operations , this article summari...

How to use IIS to build a website on a VPS server?

For SEO website optimizers, knowing how to build ...

How much does it cost to join a restaurant kitchen app in Jixi?

For entrepreneurs, although mini program developm...