Abstract: Nowadays, mobile Internet is becoming more and more developed, and various apps emerge in an endless stream, which leads to the distinction between good and bad. Compared with ordinary apps, we are definitely willing to use those conscientious software, but it is not easy to find these apps. This article uses the Scrapy framework to crawl more than 6,000 apps on the famous application download market "Cool Security Network". Through analysis, the best apps in various categories are found. These apps can be called truly conscientious works, and using them will bring you a brand new mobile phone experience. 1. Analysis background 1.1. Why choose CoolAn If GitHub is a paradise for programmers, then Coolank is a paradise for mobile app enthusiasts (also known as "machine geeks"). Compared with those traditional mobile app download markets, Coolank has three special features:
As an App lover, I have found many good Apps on Coolapk. The more I use it, the more I feel that what I know is just the tip of the iceberg. I want to dig out how many good things there are on this website. It is definitely unrealistic to find them one by one manually. Naturally, I think of the best way - to use crawlers to solve it. To achieve this goal, I recently learned the Scrapy crawler framework and crawled about 6,000 Apps on the website. Through analysis, I found high-quality Apps in different fields. Let's take a look. 1.2. Analysis content An overall analysis of 6,000 apps’ ratings, downloads, size and other indicators. Based on daily usage scenarios, apps are divided into 10 categories, including system tools, information reading, social entertainment, etc., and high-quality apps in each category are selected. 1.3. Analysis Tools
2. Data Capture Since the CoolApp mobile app has set up anti-pickup measures, Charles failed to capture the package after trying, so we temporarily used Scrapy to capture the App information on the web. The crawling period ended on November 23, 2018, with a total of 6086 apps and 8 fields of information captured: App name, download volume, rating, number of ratings, number of comments, number of followers, size, and App category label. 2.1. Target website analysis This is the target webpage we want to crawl. Clicking on the page will reveal two useful pieces of information:
Next, let's see what information to capture. You can see that the main page displays information such as the App name, download volume, and rating. Click the App icon to enter the details page, and you can see that more complete information is provided, including: category tags, number of ratings, number of followers, etc. Since we need to classify and filter Apps later, category tags are very useful, so here we choose to enter each App homepage to capture the required information indicators. Through the above analysis, we can determine the crawling process. First, traverse the main page and crawl the URLs of the detail pages of 10 apps. Then the detail pages crawl the indicators of each app. After traversing like this, we need to crawl about 6,000 web page contents. The crawling workload is not small, so we will try to use the Scrapy framework for crawling. 2.2. Introduction to Scrapy Framework Before introducing the Scrapy framework, let's recall the Pyspider framework. We used it to crawl 50,000 articles from Huxiu.com. It is a crawler tool written by a domestic master. Its Github Star exceeds 10K, but its overall functionality is relatively weak. Is there a more powerful framework than it? Yes, that is the Scrapy framework we are going to talk about here. Its Github Star exceeds 30K. It is the most widely used crawler framework in the Python crawler world. You must know how to use this framework to crawl. There are many official documents and tutorials about Scrapy on the Internet. Here are a few.
The Scrapy framework is relatively more complex than Pyspider. It has different processing modules, and the project file is composed of several programs. Different crawler modules need to be placed in different programs. So when you first get started, you will feel that the programs are scattered and it is easy to confuse people. It is recommended to take the following ideas to quickly get started with Scrapy:
This learning path is relatively fast and effective, and is much better than just following the tutorial without doing anything. Next, we will take Coolan.com as an example and use Scrapy to crawl it. 2.3. Capture data First, we need to install the Scrapy framework. If it is a Windwos system and Anaconda has been installed, then installing the Scrapy framework is very simple. Just open the Anaconda Prompt command window and enter the following command, which will automatically help us install all the libraries that Scrapy needs to be installed and depends on.
2.3.1. Create a project Next, we need to create a crawler project, so we first switch from the root directory to the working path where the project needs to be placed. For example, the storage path I set here is: E:\my_Python\training\kuan, and then continue to enter the following line of code to create the kuan crawler project:
After executing the above command, a scrapy crawler project named kuan will be generated, which contains the following files:
Next, we need to create a crawling main program in the spiders folder: kuan.py, and then run the following two commands:
2.3.2. Declare item After the project files are created, we can start writing the crawler program. First, you need to pre-define the names of the field information to be crawled in the items.py file, as shown below:
The field information here is the 8 field information we located in the web page earlier, including: name represents the App name, volume represents the volume, and download represents the number of downloads. After defining it here, we will use these field information in the subsequent crawling main program. 2.3.3. Crawling the main program After creating the kuan project, the Scrapy framework will automatically generate some crawling code. Next, we need to add the field parsing content of the web page crawling in the parse method.
Open Dev Tools on the homepage, find the node position of each crawling indicator, and then use CSS, Xpath, regular expressions and other methods to extract and parse. Scrapy supports all these methods and you can choose any of them. Here we use CSS syntax to locate nodes, but it should be noted that Scrapy's CSS syntax is slightly different from the CSS syntax we used with pyquery before. Here are a few examples to compare and explain. First, we locate the homepage URL node of the first APP. We can see that the URL node is located in the a node under the div node with the class attribute app_left_list. Its href attribute is the URL information we need. Here is a relative address, which will be the complete URL after splicing. Next, we enter the Coolan details page, select the App name and locate it. We can see that the App name node is located in the text of the p node whose class attribute is .detail_app_title. After locating these two nodes, we can use CSS to extract field information. Here is a comparison between the conventional writing method and the writing method in Scrapy:
As you can see, to get the href or text attribute, you need to use ::, for example, to get text, use ::text. extract_first() means extracting the first element. If there are multiple elements, use extract(). Next, we can refer to write the parsing code for the 8 fields. First, we need to extract the URL list of apps on the home page, and then go to each app's detail page to further extract 8 fields of information.
Here, we use the response.urljoin() method to concatenate the extracted relative URLs into a complete URL, and then use the scrapy.Request() method to construct a request for each App detail page. Here we pass two parameters: url and callback. The url is the detail page URL, and the callback is the callback function, which passes the response returned by the homepage URL request to the parse_url() method specifically used to parse the field content, as shown below:
Here, two methods get_comment() and get_tags() are defined separately. The get_comment() method extracts information of the four fields volume, download, follow, and comment through regular matching. The regular matching results are as follows:
Then use result[0], result[1], etc. to extract four pieces of information respectively. Taking volume as an example, output the extraction results of the first page:
In this way, all the field information of the 10 apps on the first page are successfully extracted and returned to the yielded item generator. Let's output its content:
2.3.4. Paginated crawling Above, we crawled the first page of content. Next, we need to crawl all 610 pages. There are two ways to do this:
Here, we write the parsing code for two methods respectively. The first method is very simple. Just continue to add the following lines of code after the parse method:
The second method is to define a start_requests() method before the parse() method at the beginning to generate 610 pages of URLs in batches, and then pass them to the following parse() method for parsing through the callback parameter in the scrapy.Request() method.
The above is the idea of crawling all pages. After crawling successfully, we need to store them. Here, I choose to store them in MongoDB. I have to say that compared with MySQL, MongoDB is much more convenient and hassle-free. 2.3.5. Storing Results In the pipelines.py program, we define the data storage method. Some parameters of MongoDB, such as the address and database name, need to be stored separately in the settings.py setting file, and then called in the pipelines program.
First, we define a MongoPipeline() storage class, which defines several methods. Let's briefly explain them:
After completing the above code, enter the following line of command to start the entire crawler's crawling and storage process. If running on a single machine, it will take a long time to complete 6,000 web pages, so be patient.
Here are two additional points: First, in order to reduce the pressure on the website, we'd better set a few seconds delay between each request. You can add the following lines of code at the beginning of the KuanSpider() method:
Second, in order to better monitor the operation of the crawler program, it is necessary to set up an output log file, which can be achieved through Python's own logging package:
The level parameter here indicates the warning level. The severity levels range from low to high: DEBUG < INFO < WARNING < ERROR < CRITICAL. If you don't want the log file to record too much content, you can set a higher level. Here it is set to WARNING, which means that only information above the WARNING level will be output to the log. The datefmt parameter is added to add a specific time in front of each log, which is very useful. Above, we have completed the capture of the entire data. With the data, we can start the analysis, but before that, we still need to simply clean and process the data. 3. Data cleaning First, we read the data from MongoDB and convert it into a DataFrame, then take a look at the basic situation of the data.
From the first five rows of data output by data.head(), we can see that except for the score column which is in float format, all other columns are in object text type. Some rows in the five columns of data, comment, download, follow, and num_score, have the suffix "万". You need to remove the characters and convert them into numeric types. The volume column has the suffixes "M" and "K" respectively. In order to unify the sizes, you need to divide "K" by 1024 and convert it into "M" volume. The entire data has 6086 rows x 8 columns, and there are no missing values in each column. The df.describe() method makes basic statistics on the score column. We can see that the average score of all apps is 3.9 (out of 5), the lowest score is 1.6, and the highest score is 4.8. Next, we convert the above columns of text data into numerical data. The code is as follows:
The above completes the conversion of several columns of text data. Let's check the basic situation: The download column is the number of App downloads. The App with the most downloads has 51.9 million downloads, the App with the least downloads has 0 (very few), and the average number of downloads is 140,000. The following information can be seen from this:
The above completes the basic data cleaning process. Next, we will conduct an exploratory analysis of the data. 4. Data Analysis We mainly analyze App downloads, ratings, size and other indicators from two dimensions: overall and classified. 4.1. General situation 4.1.1. Download ranking First, let’s take a look at the download volume of the App. Many times when we download an App, the download volume is a very important reference indicator. Since the download volume of most Apps is relatively small, the histogram cannot show the trend, so we choose to segment the data and discretize it into a bar chart. The drawing tool used is Pyecharts. It can be seen that as many as 5,517 apps (accounting for 84% of the total) have less than 100,000 downloads, and only 20 apps have more than 5 million downloads. To develop a profitable app, user downloads are particularly important. From this point of view, most apps are in an awkward situation, at least on the Coolapk platform. The code is implemented as follows:
Next, let's take a look at the 20 most downloaded apps: As you can see, the "Cool Security" App is far ahead with more than 50 million downloads, nearly twice the 27 million downloads of the second place WeChat. Such a huge advantage is easy to understand. After all, it is a self-owned App. If you don't have "Cool Security" on your phone, it means you are not a real "gadget enthusiast". From the picture we can also see the following information:
For comparison, let's take a look at the 20 apps with the least downloads. As you can see, compared with the apps with the most downloads above, these pale in comparison. The one with the least downloads, "Guangzhou Traffic Restriction Pass", has only 63 downloads. This is not surprising. It may be that the App has not been promoted, or it may have just been developed. With such a small number of downloads, the rating is still good, and it can continue to be updated. I give a thumbs up to these developers. In fact, this type of app is not embarrassing. The really embarrassing ones are those apps with a lot of downloads but the lowest ratings. They give people the feeling: "I am so bad, so be it. If you have the ability, don't use me." 4.1.2. Rating Ranking Next, let's take a look at the overall score of the App. Here, the score is divided into the following 4 intervals, and corresponding levels are defined for different scores. Several interesting phenomena can be found:
Next, let’s take a look at the 20 highest-rated apps. Many times, we download apps based on the feeling of “download the one with the highest rating”. It can be seen that the 20 highest-rated apps all scored 4.8 points, including: RE Manager (appears again), Pure Light Rain Icon Pack, etc. There are also some less common ones, which may be good apps, but we still need to look at the download volume. Their download volumes are all above 10,000. With a certain amount of downloads, the ratings are relatively reliable, and we can download them with confidence to experience them. After the above overall analysis, we have roughly found some good apps, but it is not enough, so we will subdivide them and set certain filtering conditions. 4.2. Classification According to the app functions and daily usage scenarios, the apps are divided into the following 9 categories, and then the 20 best apps are selected from each category. In order to find the best app possible, here are three conditions:
After selection, we got the 20 apps with the highest scores in each category. Most of these apps are indeed conscientious software. 4.2.1. System Tools System tools include: input method, file management, system cleaning, desktop, plug-ins, lock screen, etc. As you can see, the first place is the well-known old-fashioned file manager "RE Manager". It is only 5M in size. In addition to having all the functions of an ordinary file manager, its biggest feature is the ability to uninstall apps that come with the phone, but it requires Root. The file analyzer of "ES File Explorer" is very powerful and can effectively clean up bloated mobile phone space. The App "A Mu Han" is quite awesome. Just as its software introduction says, "It's better to have me than to have many things", when you open it, you will find that it provides dozens of practical functions, such as: translation, image search, express delivery query, making emoticons, etc. "Super SU", "Storage Cleaner", "Lanthanum", "MT Manager" and "My Android Tools" are all highly recommended. In short, the apps on this list are worthy of being included in your mobile app usage list. 4.2.2. Social Chat In the social chat category, "Share Weibo Client" ranks first. As a third-party client App, it is naturally better than the official version. For example, compared with the 70M size of the genuine version, it is only one-tenth of its size, and there are almost no advertisements. It also has many additional powerful functions. If you love to browse Weibo, then you might as well try this "Share". The "Ji Ke" app is also quite good. If you scroll down, you can also see the "Bullet Messenger" which was very popular a while ago. It claims that it will replace WeChat, but it seems that it will not be able to do so in the short term. You may find that common apps such as Zhihu, Douban, and Jianshu are not on this social list. This is because their ratings are relatively low, only 2.9, 3.5, and 2.9 points respectively, so they naturally cannot be included in this list. If you really want to use them, it is recommended that you use their third-party clients or historical versions. 4.2.3. Information reading It can be seen that in the information reading category, "Jingdu Tianxia" firmly occupies the first place. I have previously written an article specifically to introduce it: The most powerful reader for Android. Apps in the same category like "Duokan Reading", "Book Chasing Tool" and "WeChat Reading" also made the list. In addition, if you often have a headache because you don’t know where to download e-books, you might as well try "Book Search Master" or "Laozi Book Search". 4.2.4. Audiovisual Entertainment Next is the audio-visual entertainment section, where NetEase's "NetEase Cloud Music" takes the top spot without any pressure, a rare high-quality product from a major company. If you love playing games, then you should try Adobe AIR. If you are artistic, you will probably like the short video shooting app "VUE". You can definitely show off by posting your creations to your Moments. The last one, "Hiby Music", is great. I recently discovered that it has a powerful function that can be used in conjunction with Baidu Netdisk. It can automatically identify audio files and play them. 4.2.5. Communication network Next is the communication network category, which mainly includes: browser, address book, notification, mailbox and other subcategories. Each of us has a browser on our mobile phones, and we use them in a variety of ways. Some people use the browser that comes with the phone, while others use big-name browsers such as Chrome and Firefox. However, you will find that you may not have heard of the first three on the list, but they are really awesome, and it is most appropriate to describe them as "extremely simple and efficient, refreshing and fast". Among them, "Via" and "X Browser" are less than 1M in size, which is truly "small but complete", and highly recommended. 4.2.6. Photographic images Taking photos and editing them is also a common function. You may have your own photo management software, but here I strongly recommend the first app "Quick Picture Browser". It is only 3M in size, but it can instantly find and load tens of thousands of photos. If you are a photo fanatic, you can open as many photos as you want with it. In addition, it has functions such as hiding private photos and automatically backing up Baidu Netdisk. It is one of the apps I have used the longest. 4.2.7. Documentation We often need to write and take memos on our mobile phones, so naturally we need good document writing apps. There is no need to say much about "Evernote", I think it is the best note-taking and summary app. If you like to write in Markdown, then the exquisite app "Pure Writing" should be very suitable for you. It is less than 3M in size but has dozens of functions such as cloud backup, generation of long images, automatic spacing between Chinese and English, etc. Even so, it still maintains a design style of simplicity. This is probably the reason why the number of downloads has soared tenfold from 20,000 to 30,000 in just two or three months. Behind this App is a big guy who has sacrificed several years of his spare time to continuously develop and update it. He is worthy of admiration. 4.2.8. Travel, transportation and shopping In this category, the first place is 12306. When it is mentioned, it reminds us of those weird verification codes. However, the App here is not from the official website, but developed by a third party. The most amazing function should be "Grab the Ticket". If you are still relying on posting on Moments to grab tickets, you might as well try it. 4.2.9. Xposed plugin The last category is Xposed, which many people may not be familiar with, but many people should know about the red envelope grabbing and anti-withdrawal functions on WeChat. These awesome and unusual functions use various module functions in the Xposed framework. This framework is from the famous foreign XDA mobile phone forum. Some of the so-called software cracked by XDA masters that you often hear about come from this forum. Simply put, after installing the Xposed framework, you can install some fun and interesting plug-ins in it. With these plug-ins, your phone can achieve more and greater functions. For example: it can remove advertisements, crack App payment functions, kill power-consuming self-starting processes, virtual phone positioning and other functions. However, using this framework and these plug-ins requires flashing and ROOT, which is a bit high. 5. Summary This article uses the Scrapy framework to crawl and analyze 6,000 apps on Kuaik.com. Beginners to Scrapy may find the program writing rather messy, so you can try using ordinary function methods first, write the program completely together, and then split it into Scrapy projects. This will also help shift your thinking from a single program to a framework. I will write a separate article about it later. Since the number of web-based apps is less than that of apps, many useful apps are not included, such as Chrome, MX player, Snapseed, etc. It is recommended to use Coolapk App, where there are more fun things. The above is the crawling and analysis process of the entire article. The article involves a lot of fine software. If you are interested, you can try to download and experience it. |
<<: iPhone XR sales are not good, Apple uses trade-in to offer big discounts
>>: Apple has selected 6 apps worth recommending in 2018. How many of them have you played?
[[413605]] The official version of iOS 14.7.1 has...
Before talking about what are good creative mater...
This article is a product experience report of Yo...
[[243890]] After a long wait, the official versio...
Yu Minhong’s New Oriental has recently gotten int...
In marketing psychology, herd mentality, greed fo...
How much bandwidth is required for renting differ...
[[329485]] According to foreign media reports, Ap...
There are many scenarios for users to place order...
Let me share with you today: What are the methods...
How much does it cost to attract investors for th...
Counting with fingers Today is 520 Love Confessio...
Faced with an increasing number of channels, CPs ...
In most traditional industries, there is actually...
Certificate authority Let's Encrypt has warne...