Two weeks ago, I was working on a project that required simulating a batch of user evaluation data. If I wanted the data to look more realistic, I had to use random user nicknames and avatars. If the avatars or nicknames were all the same, then people would be able to tell at a glance that the data was fake. So I wrote a spider web crawler program that started from my QQ space, and the program ran intermittently for two weeks. In total, I crawled 30 million QQ data of Tencent, of which 3 million contained detailed data of users (QQ number, nickname, space name, member level, avatar, the content of the last post, the time of the last post, space introduction, gender, birthday, province, city, marital status). I have crawled my 7th circle of friends (depth=7) with a total of 30 million data. The current bottleneck is the Internet speed at home and the configuration of the computer. At the fastest time, the crawling speed reaches 5 million new Q data per day. Without pictures, what am I talking about? The current data volume is about 2G. Let's take a look at some interesting statistics I generated based on this data (the amount of data is too large to load into memory at once, so the following statistics only take about 80W of data with small depth values and relatively complete data): The memory is full, it's not my fault. Who can sponsor a server? 1. When do people usually post on Weibo? From the graph, we can see that the quietest time of the day is 4 a.m., when most people are sleeping. The most excited time is 10 to 11 p.m., when people like to check other people's spaces and post comments before going to bed. There is also a small peak around 12 noon. I will make a chart later about what time Chinese people usually get up, eat, and go to bed. 2. In which month do Chinese people like to give birth to children? The most popular months are January and October, and the least popular is April. It is easy to understand that there are many births in October, as the year is almost busy and the weather is neither too hot nor too cold, which is a good time to have children. But it is a bit hard to understand that January is the coldest and the difference from February is huge. How can people not be afraid of cold in such a cold weather? I guess it is because the Chinese New Year is coming soon in January, and those who have not been together before finally get together, so they are easily impulsive and have sex. It is easy to understand that there are the least birthdays in April, because Chinese people don’t like the number 4. Big data is interesting, isn’t it! ! I think it is very interesting, and there are many more to come. #p# 3. This is the user location distribution I have crawled so far Can you guess where I am from? The top four are: Guangdong, Hunan, Sichuan, and Jiangsu. Yes, I am from Hunan! There are so many Hunanese working in Guangdong, which explains why Guangdong is the most popular. Jiangsu is where I went to school. What is a bit puzzling is that Sichuan and I are not related at all, but it ranks third. My friends, who planted the seeds? Stand up! Another possibility is that Sichuan people have the best communication skills. I usually eat Chongqing Xiaomian. Sichuan people are really special. They speak so fast and have such high tones. I can't stand it! 4. Age distribution of the data population I accidentally revealed my age. That's right. I was born in 1990. From the current data, the correlation between my age and distribution area is still very large. As the amount of data continues to increase, this correlation will gradually decrease, and the statistical chart will gradually approach the real situation of users across the country. I really want to set up a few servers for distribution. I estimate that I can crawl hundreds of millions of simple data in a week. It is still far from achieving this goal with my laptop and the super-poor Internet speed at home. 5. Gender distribution of data population The number of males is 23% more than that of females. I think the actual difference is not that big, but females generally set higher permissions than males when setting QQ space access permissions. So the data I crawled mostly included males. 6. The following series of charts are compiled based on the frequency of appearance of some "keywords" in the comments, which is quite interesting. 6.1 Stock Market in Pictures In the Zhihu question "What cool, interesting and useful things can be done with crawler technology?", a Google intern @Emily L crawled 40 billion tweets and made a lot of interesting comments. Among them, he mentioned a paper about using people's moods on Twitter to predict the stock market (http://battleofthequants.net/wp-content/uploads/2013/03/2010-10-15_JOCS_Twitter_Mood.pdf) which is very interesting. I also attached my answer to the question "Use crawlers to monitor her (his) Zhihu dynamics". I just did it for fun. Please don't criticize me for being vulgar. If we have a large amount of QQ space and Sina Weibo data, I think it is feasible to use them to make some analysis and predictions on the stock market or other aspects, and the accuracy should be very high. I may consider doing this interesting thing next. We can analyze the keywords in stocks in massive data, for example, we can get the ranking list of stocks discussed on the day. Then we can get a large number of users discussing stocks, find out the positive correlation factors of stock rise and fall through the actual feedback of the market, and then analyze and calculate these massive users to get the most reliable stock recommendation master ranking list. We can classify these users, and get data according to priority and crawling density. We can use these data to analyze which stocks are reliable. #p# 6.2 The ranking of celebrities most discussed by the public is still very reliable. I also attached the QQ numbers of celebrities I grabbed, just for entertainment, I will defend whether they are real or fake. Some of the spaces do have a lot of private photos of their daily lives. Zhang Jie QQ: 419998 Zhao Liying of The Journey of Flower QQ: 427794 Xie Na QQ: 500746 Yang Mi QQ: 456773 Fan Bingbing QQ: 88597 Jay Chou QQ: 332661 6.3 Most Popular Mobile Phone Brands 6.4 The Internet company that people like to talk about the most, the reason why Alibaba is so low is probably because everyone likes to call it Taobao or Tmall. Giving it so many names is asking for trouble. 6.5 Ranking of the most frequently discussed social platforms in QQ space. 6.6 Statistics of life Love>hate; happy>sad; laughter>sigh; there are a lot of foodies; who the hell said that China is not happy, this is full of positive energy data. Well, there are many other analyses that can be done. If you have any interesting data analysis that you would like to know, please leave me a message. I won't say much about the technology. The program is not difficult, but the multi-threaded database operation is what made me suffer. Fortunately, the program is now almost stable. The process is also very interesting. I will write about the interesting things in the program upgrade process when I have time. I think a wonderful program must be highly simulated by reality, just like airplanes imitate dragonflies and radars imitate bats. This time the program design simulates the production line of a factory. Attached is the design diagram In addition, we are widely soliciting everyone's clever ideas, whether it is possible to use this data to make an interesting website or app. It doesn't matter if it is interesting or can make a little money, as long as it is not illegal. |
<<: Do you still remember the text message you left in the corner?
The effects and functions of mutton: It is hot in...
Qutoutiao focuses on "light reading", f...
Although there are a total of 5 million apps avai...
The cold starts from the feet. If you keep your f...
Only by using fewer "tricks" can you tr...
It’s the Spring Festival again. Brand owners will...
Today I’m going to talk to my friends about Zhihu...
As the saying goes, "every industry has its ...
In fact, the concept of grabbing volume has exist...
Operation promotion plays a very important role i...
According to the data from the "iiMedia Repo...
1. What is Feed Flow 1. What is a feed? Feed stre...
1. What is the expected domestic launch date of A...
Retention rate is the most important criterion fo...
JSPatch is a small-sized JavaScript library that ...