I "stole" 30 million QQ user data from Tencent and published a very interesting exclusive report!

I "stole" 30 million QQ user data from Tencent and published a very interesting exclusive report!

[[141319]]

Two weeks ago, I was working on a project that required simulating a batch of user evaluation data. If I wanted the data to look more realistic, I had to use random user nicknames and avatars. If the avatars or nicknames were all the same, then people would be able to tell at a glance that the data was fake.

So I wrote a spider web crawler program that started from my QQ space, and the program ran intermittently for two weeks. In total, I crawled 30 million QQ data of Tencent, of which 3 million contained detailed data of users (QQ number, nickname, space name, member level, avatar, the content of the last post, the time of the last post, space introduction, gender, birthday, province, city, marital status).

I have crawled my 7th circle of friends (depth=7) with a total of 30 million data. The current bottleneck is the Internet speed at home and the configuration of the computer. At the fastest time, the crawling speed reaches 5 million new Q data per day.

Without pictures, what am I talking about?

The current data volume is about 2G.

Let's take a look at some interesting statistics I generated based on this data (the amount of data is too large to load into memory at once, so the following statistics only take about 80W of data with small depth values ​​and relatively complete data):

The memory is full, it's not my fault. Who can sponsor a server?

1. When do people usually post on Weibo?

From the graph, we can see that the quietest time of the day is 4 a.m., when most people are sleeping. The most excited time is 10 to 11 p.m., when people like to check other people's spaces and post comments before going to bed. There is also a small peak around 12 noon.

I will make a chart later about what time Chinese people usually get up, eat, and go to bed.

2. In which month do Chinese people like to give birth to children?

The most popular months are January and October, and the least popular is April. It is easy to understand that there are many births in October, as the year is almost busy and the weather is neither too hot nor too cold, which is a good time to have children. But it is a bit hard to understand that January is the coldest and the difference from February is huge. How can people not be afraid of cold in such a cold weather? I guess it is because the Chinese New Year is coming soon in January, and those who have not been together before finally get together, so they are easily impulsive and have sex. It is easy to understand that there are the least birthdays in April, because Chinese people don’t like the number 4. Big data is interesting, isn’t it! ! I think it is very interesting, and there are many more to come.

#p#

3. This is the user location distribution I have crawled so far

Can you guess where I am from? The top four are: Guangdong, Hunan, Sichuan, and Jiangsu. Yes, I am from Hunan! There are so many Hunanese working in Guangdong, which explains why Guangdong is the most popular. Jiangsu is where I went to school. What is a bit puzzling is that Sichuan and I are not related at all, but it ranks third. My friends, who planted the seeds? Stand up! Another possibility is that Sichuan people have the best communication skills. I usually eat Chongqing Xiaomian. Sichuan people are really special. They speak so fast and have such high tones. I can't stand it!

4. Age distribution of the data population

I accidentally revealed my age. That's right. I was born in 1990. From the current data, the correlation between my age and distribution area is still very large. As the amount of data continues to increase, this correlation will gradually decrease, and the statistical chart will gradually approach the real situation of users across the country. I really want to set up a few servers for distribution. I estimate that I can crawl hundreds of millions of simple data in a week. It is still far from achieving this goal with my laptop and the super-poor Internet speed at home.

5. Gender distribution of data population

The number of males is 23% more than that of females. I think the actual difference is not that big, but females generally set higher permissions than males when setting QQ space access permissions. So the data I crawled mostly included males.

6. The following series of charts are compiled based on the frequency of appearance of some "keywords" in the comments, which is quite interesting.

6.1 Stock Market in Pictures

In the Zhihu question "What cool, interesting and useful things can be done with crawler technology?", a Google intern @Emily L crawled 40 billion tweets and made a lot of interesting comments. Among them, he mentioned a paper about using people's moods on Twitter to predict the stock market (http://battleofthequants.net/wp-content/uploads/2013/03/2010-10-15_JOCS_Twitter_Mood.pdf) which is very interesting. I also attached my answer to the question "Use crawlers to monitor her (his) Zhihu dynamics". I just did it for fun. Please don't criticize me for being vulgar.

If we have a large amount of QQ space and Sina Weibo data, I think it is feasible to use them to make some analysis and predictions on the stock market or other aspects, and the accuracy should be very high. I may consider doing this interesting thing next.

We can analyze the keywords in stocks in massive data, for example, we can get the ranking list of stocks discussed on the day. Then we can get a large number of users discussing stocks, find out the positive correlation factors of stock rise and fall through the actual feedback of the market, and then analyze and calculate these massive users to get the most reliable stock recommendation master ranking list. We can classify these users, and get data according to priority and crawling density. We can use these data to analyze which stocks are reliable.

#p#

6.2 The ranking of celebrities most discussed by the public is still very reliable.

I also attached the QQ numbers of celebrities I grabbed, just for entertainment, I will defend whether they are real or fake. Some of the spaces do have a lot of private photos of their daily lives.

Zhang Jie QQ: 419998 Zhao Liying of The Journey of Flower QQ: 427794 Xie Na QQ: 500746 Yang Mi QQ: 456773 Fan Bingbing QQ: 88597 Jay Chou QQ: 332661

6.3 Most Popular Mobile Phone Brands

6.4 The Internet company that people like to talk about the most, the reason why Alibaba is so low is probably because everyone likes to call it Taobao or Tmall. Giving it so many names is asking for trouble.

6.5 Ranking of the most frequently discussed social platforms in QQ space.

6.6 Statistics of life

Love>hate; happy>sad; laughter>sigh; there are a lot of foodies; who the hell said that China is not happy, this is full of positive energy data.

Well, there are many other analyses that can be done. If you have any interesting data analysis that you would like to know, please leave me a message.

I won't say much about the technology. The program is not difficult, but the multi-threaded database operation is what made me suffer. Fortunately, the program is now almost stable. The process is also very interesting. I will write about the interesting things in the program upgrade process when I have time. I think a wonderful program must be highly simulated by reality, just like airplanes imitate dragonflies and radars imitate bats. This time the program design simulates the production line of a factory. Attached is the design diagram

In addition, we are widely soliciting everyone's clever ideas, whether it is possible to use this data to make an interesting website or app. It doesn't matter if it is interesting or can make a little money, as long as it is not illegal.

<<:  Do you still remember the text message you left in the corner?

>>:  Winning in design, high-end and classy or low-key and luxurious with connotation – iteration vs. planning

Recommend

The effects and functions of mutton, the effects and functions of mutton soup!

The effects and functions of mutton: It is hot in...

How to develop an addictive app?

Although there are a total of 5 million apps avai...

3 strategies for native advertising!

Only by using fewer "tricks" can you tr...

Brand promotion: How to do Spring Festival marketing?

It’s the Spring Festival again. Brand owners will...

Zhihu quick traffic generation skills and operation strategies!

Today I’m going to talk to my friends about Zhihu...

Solid info! 10 common methods to expand Baidu bidding in 2021

In fact, the concept of grabbing volume has exist...

A brief discussion on the five steps of online operation and promotion

Operation promotion plays a very important role i...

Zhihu Marketing Methodology in 2019!

According to the data from the "iiMedia Repo...

Analysis of the account opening process of Douyin Feed

1. What is Feed Flow 1. What is a feed? Feed stre...

How to improve user retention rate? Share 6 rules!

Retention rate is the most important criterion fo...