Everyone knows the background. It's the Chinese New Year, and red envelopes are flying everywhere. I just learned Python two days ago, and I was quite excited. I studied how to crawl Weibo red envelopes. Why Weibo red envelopes instead of Alipay red envelopes? Because I only know the Web. If I have the energy, I may also study the whack-a-mole algorithm later. Because I am a beginner in Python, and this program is the third one I wrote after learning Python, so please don’t point out any bad parts in the code. The key is the idea. Well, if there are any bad parts in the idea, please don’t point them out to me. You see, IE has the nerve to set itself as the default browser, so it’s acceptable for me to show off by writing a crappy article, right? I use Python 2.7. I heard that there are big differences between Python 2 and Python 3. Friends who are even less knowledgeable than me should pay attention. 0×01 Thoughts I'm too lazy to describe it in words, so I drew a sketch and I think you can understand it. First of all, as usual, let's introduce a bunch of libraries that I don't know what they are used for but cannot be without: - import re
- import urllib
- import urllib2
- import cookielib
- import base64
- import binascii
- import os
- import json
- import sys
- import cPickle as p
- import rsa
Then declare some other variables that will be used later: - reload(sys)
- sys.setdefaultencoding( 'utf-8&' ) #Set the character encoding to utf -8
- luckyList=[] #Red envelope list
- lowest = 10 # What is the lowest record of receiving red envelopes that can be tolerated?
An rsa library is used here. Python does not come with it by default, so you need to install it: https://pypi.python.org/pypi/rsa/ After downloading, run setpy.py install to install it, and then we can start our development steps. 0×02 Weibo login The action of grabbing red envelopes can only be performed after logging in, so there must be a login function. Logging in is not the key, the key is the preservation of cookies, which requires the cooperation of cookielib. - cj = cookielib.CookieJar()
- opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
- urllib2.install_opener(opener)
In this way, all network operations performed using opener will process the status of the cookie. Although I don’t quite understand it, it feels magical. Next, you need to encapsulate two modules, one is the data acquisition module, which is used to simply GET data, and the other is used to POST data. In fact, there are only a few more parameters, which can be merged into one function, but I am lazy and stupid, and I don’t want to and can’t change the code. - def getData(url) :
- try :
- req = urllib2.Request(url)
- result = opener.open(req)
- text = result.read()
- text=text.decode( "utf-8" ).encode( "gbk" , 'ignore' )
- return text
- except Exception, e:
- print u 'Request exception, url:' +url
- print e
-
- def postData(url,data,header):
- try :
- data = urllib.urlencode(data)
- req = urllib2.Request(url,data,header)
- result = opener.open(req)
- text = result.read()
- return text
- except Exception, e:
- print u 'Request exception, url:' +url
With these two modules, we can GET and POST data. The reason why getData is decoded and then encoded is because the output is always garbled when I debug under Win7, so some encoding processing is added. These are not the point. The following login function is the core of Weibo login. - def login(nick, pwd):
- print u "----------Logging in----------"
- print "----------......----------"
- prelogin_url = 'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=%s&rsakt=mod&checkpin=1&client=ssologin.js(v1.4.15)&_=1400822309846' % nick
- preLogin = getData(prelogin_url)
- servertime = re.findall( '"servertime":(.+?),' , preLogin)[ 0 ]
- pubkey = re.findall( '"pubkey":"(.+?)",' , preLogin)[ 0 ]
- rsakv = re.findall( '"rsakv":"(.+?)",' , preLogin)[ 0 ]
- nonce = re.findall( '"nonce":"(.+?)",' , preLogin)[ 0 ]
- #print bytearray( 'xxxx' , 'utf-8' )
- su = base64.b64encode(urllib.quote(nick))
- rsaPublickey= int (pubkey, 16 )
- key = rsa.PublicKey(rsaPublickey, 65537 )
- message = str(servertime) + '\t' + str(nonce) + '\n' + str(pwd)
- sp = binascii.b2a_hex(rsa.encrypt(message,key))
- header = { 'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)' }
- param = {
- 'entry' : 'weibo' ,
- 'gateway' : '1' ,
- 'from' : '' ,
- 'savestate' : '7' ,
- 'userticket' : '1' ,
- 'ssosimplelogin' : '1' ,
- 'vsnf' : '1' ,
- 'vsnval' : '' ,
- 'su' : su,
- 'service' : 'miniblog' ,
- 'servertime' : servertime,
- 'nonce' : nonce,
- 'pwencode' : 'rsa2' ,
- 'sp' : sp,
- 'encoding' : 'UTF-8' ,
- 'url' : 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack' ,
- 'returntype' : 'META' ,
- 'rsakv' : rsakv,
- }
- s = postData( 'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)' ,param,header)
-
- try :
- urll = re.findall( "location.replace\(\'(.+?)\'\);" , s)[ 0 ]
- login=getData(urll)
- print u "---------Login successful!-------"
- print "----------......----------"
- except Exception, e:
- print u "---------Login failed!-------"
- print "----------......----------"
- exit( 0 )
The parameters and encryption algorithms here are all copied from the Internet, and I don’t really understand them. It’s probably about first requesting a timestamp and public key, then encrypting it with RSA, and finally processing it and submitting it to the Sina login interface. After a successful login from Sina, a Weibo address will be returned. A request is required to make the login status take effect completely. After a successful login, subsequent requests will carry the current user’s cookie. After successfully logging into Weibo, I couldn't wait to find a red envelope to try it out, of course, I had to try it in the browser first. After clicking and clicking, I finally found a page with a button to grab a red envelope. I pressed F12 to summon the debugger to see how the data packet was requested. You can see that the requested address is http://huodong.weibo.com/aj_hongbao/getlucky. There are two main parameters. One is ouid, which is the red envelope ID, which can be seen in the URL. The other share parameter determines whether to share it to Weibo. There is also a _t, which I don’t know what it is used for. Well, now theoretically, you can complete the red envelope extraction by submitting three parameters to this URL. However, when you actually submit the parameters, you will find that the server will magically return you a string like this: - 1
-
- { "code" : 303403 , "msg" : "Sorry, you do not have permission to access this page" , "data" : []}
Don't panic at this time. Based on my many years of experience in Web development, the other party's programmer should have determined the referer. It's very simple. Just copy all the headers of the request. - def getLucky(id): #Lottery program
- print u "---Drawing red envelope:" +str(id)+ "---"
- print "----------......----------"
-
- if checkValue(id)==False: #Does not meet the conditions, this is the following function
- return
- luckyUrl= "http://huodong.weibo.com/aj_hongbao/getlucky"
- param={
- 'ouid' :id,
- 'share' : 0 ,
- '_t' : 0
- }
-
- header={
- 'Cache-Control' : 'no-cache' ,
- 'Content-Type' : 'application/x-www-form-urlencoded' ,
- 'Origin' : 'http://huodong.weibo.com' ,
- 'Pragma' : 'no-cache' ,
- 'Referer' : 'http://huodong.weibo.com/hongbao/' +str(id),
- 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 BIDUBrowser/6.x Safari/537.36' ,
- 'X-Requested-With' : 'XMLHttpRequest'
- }
- res = postData(luckyUrl,param,header)
In theory, there is no problem, and in fact, there is no problem. After the lottery action is completed, we need to judge the status. The returned res is a json string, where the code is 100000 for success, 90114 for today's lottery reaching the upper limit, and other values are also failures, so: - hbRes = json.loads(res)
- if hbRes[ "code" ]== '901114' : #Today's red envelopes have been snatched
- print u "---------The upper limit has been reached---------"
- print "----------......----------"
- log( 'lucky' ,str(id)+ '---' +str(hbRes[ "code" ])+ '---' +hbRes[ "data" ][ "title" ])
- exit( 0 )
- elif hbRes[ "code" ] == '100000' : # Success
- print u "---------Congratulations on your prosperity---------"
- print "----------......----------"
- log( 'success' ,str(id)+ '---' +res)
- exit( 0 )
-
- if hbRes[ "data" ] and hbRes[ "data" ][ "title" ]:
- print hbRes[ "data" ][ "title" ]
- print "----------......----------"
- log( 'lucky' ,str(id)+ '---' +str(hbRes[ "code" ])+ '---' +hbRes[ "data" ][ "title" ])
- else :
- print u "---------Request error---------"
- print "----------......----------"
- log( 'lucky' ,str(id)+ '---' +res)
Among them, log is also a function I customized to record logs: - def log(type,text):
- fp = open(type+ '.txt' , 'a' )
- fp.write(text)
- fp.write( '\r\n' )
- fp.close()
0×04 Crawling the red envelope list After the single red envelope receiving action test is successful, it is the core module of our program - crawling the red envelope list. There should be many methods and entrances to crawl the red envelope list, such as various Weibo search keywords and so on, but I use the simplest method here: crawling the red envelope list. On the homepage of the red envelope activity (http://huodong.weibo.com/hongbao), you can see everything through various points. Although there are many links in the list, they can be summarized into two categories (except the richest red envelope list): theme and ranking list. Continue to summon F12 and analyze the formats of these two pages. First, there is a list of topics, such as: http://huodong.weibo.com/hongbao/special_quyu You can see that the red envelope information is all in a div with the class name info_wrap, so we just need to activate the source code of this page, grab all the infowrap, and then simply process it to get the red envelope list of this page. Here we need to use some regular expressions: - def getThemeList(url,p):#Theme red envelope
- print u "---------第" +str(p)+ "页---------"
- print "----------......----------"
- html=getData(url+ '?p=' +str(p))
- pWrap=re.compile(r '(.+?)' ,re.DOTALL) #h Get all info_wrap regular expressions
- pInfo=re.compile(r '.+(.+).+(.+).+(.+).+href="(.+)" class="btn"' ,re.DOTALL) #Get red envelope information
- List=pWrap.findall(html,re.DOTALL)
- n = len(List)
- if n== 0 :
- return
- for i in range(n): #Traverse all info_wrap divs
- s=pInfo.match(List[i]) #Get red envelope information
- info=list(s.groups( 0 ))
- info[ 0 ] = float (info[ 0 ].replace( '\xcd\xf2' , '0000' )) #Cash, 10,000 -> 0000
- try :
- info[ 1 ] = float (info[ 1 ].replace( '\xcd\xf2' , '0000' )) #gift value
- except Exception, e:
- info[ 1 ] = float (info[ 1 ].replace( '\xd2\xda' , '00000000' )) #gift value
- info[ 2 ] = float (info[ 2 ].replace( '\xcd\xf2' , '0000' )) # Sent
- if info[ 2 ] == 0 :
- info[ 2 ] = 1 # prevent division by 0
- if info[ 1 ] == 0 :
- info[ 1 ] = 1 # prevent division by 0
- info.append(info[ 0 ]/(info[ 2 ]+info[ 1 ])) #Red envelope value, cash/(number of recipients + prize value)
- # if info[ 0 ]/(info[ 2 ]+info[ 1 ])> 100 :
- # print url
- luckyList.append(info)
- if 'class="page"' in html:#Next page exists
- p=p+ 1
- getThemeList(url,p) #Recursively call to crawl the next page
Regular expressions are difficult. It took me a long time to learn them and I was only able to write these two sentences. There is also an info[4] appended to the info here. It is an algorithm I came up with to roughly determine the value of a red envelope. Why do we do this? Because there are many red envelopes but we can only draw four times. In the vast sea of red envelopes, we must find the most valuable red envelope and then draw it. There are three data for reference: cash value, gift value and number of recipients. Obviously, if there is little cash and many recipients or the prize value is extremely high (some are even crazy and in the billions), then it is not worth grabbing. So I worked hard for a long time and finally came up with an algorithm to measure the weight of red envelopes: red envelope value = cash/(number of recipients + prize value). The principle of the ranking page is the same, find the key tags and match them with regular expressions. - def getTopList(url,daily,p):#Ranking list red envelope
- print u "---------第" +str(p)+ "页---------"
- print "----------......----------"
- html=getData(url+ '?daily=' +str(daily)+ '&p=' +str(p))
- pWrap=re.compile(r '(.+?)' ,re.DOTALL) #h Get all list_info regular expressions
- pInfo=re.compile(r '.+(.+).+(.+).+(.+).+href="(.+)" class="btn rob_btn"' ,re.DOTALL) #Get red envelope information
- List=pWrap.findall(html,re.DOTALL)
- n = len(List)
- if n== 0 :
- return
- for i in range(n): #Traverse all info_wrap divs
- s=pInfo.match(List[i]) #Get red envelope information
- topinfo=list(s.groups( 0 ))
- info=list(topinfo)
- info[ 0 ]=topinfo[ 1 ].replace( '\xd4\xaa' , '' ) #元-> ''
- info[ 0 ] = float (info[ 0 ].replace( '\xcd\xf2' , '0000' )) #Cash, 10,000 -> 0000
- info[ 1 ]=topinfo[ 2 ].replace( '\xd4\xaa' , '' ) #元-> ''
- try :
- info[ 1 ] = float (info[ 1 ].replace( '\xcd\xf2' , '0000' )) #gift value
- except Exception, e:
- info[ 1 ] = float (info[ 1 ].replace( '\xd2\xda' , '00000000' )) #gift value
- info[ 2 ]=topinfo[ 0 ].replace( '\xb8\xf6' , '' ) # -> ''
- info[ 2 ] = float (info[ 2 ].replace( '\xcd\xf2' , '0000' )) # Sent
- if info[ 2 ] == 0 :
- info[ 2 ] = 1 # prevent division by 0
- if info[ 1 ] == 0 :
- info[ 1 ] = 1 # prevent division by 0
- info.append(info[ 0 ]/(info[ 2 ]+info[ 1 ])) #Red envelope value, cash/(number of recipients + gift value)
- # if info[ 0 ]/(info[ 2 ]+info[ 1 ])> 100 :
- # print url
- luckyList.append(info)
- if 'class="page"' in html:#Next page exists
- p=p+ 1
- getTopList(url,daily,p) #recursively call to crawl the next page
OK, now we can successfully crawl the lists of both topic pages. The next step is to get the list of lists, that is, the collection of all these list addresses, and then crawl them one by one: - def getList():
- print u "---------Search target---------"
- print "----------......----------"
-
- themeUrl={ #Theme list
- 'theme' : 'http://huodong.weibo.com/hongbao/theme' ,
- 'pinpai' : 'http://huodong.weibo.com/hongbao/special_pinpai' ,
- 'daka' : 'http://huodong.weibo.com/hongbao/special_daka' ,
- 'youxuan' : 'http://huodong.weibo.com/hongbao/special_youxuan' ,
- 'qiye' : 'http://huodong.weibo.com/hongbao/special_qiye' ,
- 'quyu' : 'http://huodong.weibo.com/hongbao/special_quyu' ,
- 'meiti' : 'http://huodong.weibo.com/hongbao/special_meiti' ,
- 'hezuo' : 'http://huodong.weibo.com/hongbao/special_hezuo'
- }
-
- topUrl={ #Ranking list
- 'mostmoney' : 'http://huodong.weibo.com/hongbao/top_mostmoney' ,
- 'mostsend' : 'http://huodong.weibo.com/hongbao/top_mostsend' ,
- 'mostsenddaka' : 'http://huodong.weibo.com/hongbao/top_mostsenddaka' ,
- 'mostsendpartner' : 'http://huodong.weibo.com/hongbao/top_mostsendpartner' ,
- 'cate' : 'http://huodong.weibo.com/hongbao/cate?type=' ,
- 'clothes' : 'http://huodong.weibo.com/hongbao/cate?type=clothes' ,
- 'beauty' : 'http://huodong.weibo.com/hongbao/cate?type=beauty' ,
- 'fast' : 'http://huodong.weibo.com/hongbao/cate?type=fast' ,
- 'life' : 'http://huodong.weibo.com/hongbao/cate?type=life' ,
- 'digital' : 'http://huodong.weibo.com/hongbao/cate?type=digital' ,
- 'other' : 'http://huodong.weibo.com/hongbao/cate?type=other'
- }
-
- for (theme,url) in themeUrl.items():
- print "----------" +theme+ "----------"
- print url
- print "----------......----------"
- getThemeList(url, 1 )
-
- for (top,url) in topUrl.items():
- print "----------" +top+ "----------"
- print url
- print "----------......----------"
- getTopList(url, 0 , 1 )
- getTopList(url, 1 , 1 )
0×05 Determine the availability of red envelopes This is relatively simple. First, search the keywords in the source code to see if there is a red envelope grabbing button, and then go to the receiving ranking to see what the highest record is. If the highest amount you receive is only a few dollars, then bye bye... The address to view the red envelope record is http://huodong.weibo.com/aj_hongbao/detailmore?page=1&type=2&_t=0&__rnd=1423744829265&uid=red envelope id - def checkValue(id):
- infoUrl= 'http://huodong.weibo.com/hongbao/' +str(id)
- html=getData(infoUrl)
-
- if 'action-type="lottery"' in html or True: #There is a button to grab the red envelope
- logUrl= "http://huodong.weibo.com/aj_hongbao/detailmore?page=1&type=2&_t=0&__rnd=1423744829265&uid=" +id # View ranking data
- param={}
- header={
- 'Cache-Control' : 'no-cache' ,
- 'Content-Type' : 'application/x-www-form-urlencoded' ,
- 'Pragma' : 'no-cache' ,
- 'Referer' : 'http://huodong.weibo.com/hongbao/detail?uid=' +str(id),
- 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 BIDUBrowser/6.x Safari/537.36' ,
- 'X-Requested-With' : 'XMLHttpRequest'
- }
- res = postData(logUrl,param,header)
- pMoney=re.compile(r '< span class="money">(\d+?.+?)\xd4\xaa< /span>' ,re.DOTALL) #h Get all list_info regular expressions
- luckyLog=pMoney.findall(html,re.DOTALL)
-
- if len(luckyLog)== 0 :
- maxMoney= 0
- else :
- maxMoney= float (luckyLog[ 0 ])
-
- if maxMoney< lowest: #The maximum red packet in the record is less than the set value
- return False
- else :
- print u "---------One step slower---------"
- print "----------......----------"
- return False
- return True
0×06 Finishing work The main modules have been completed, and now we need to connect all the steps in series: - def start(username,password,low,fromFile):
- gl=False
- lowest=low
- login(username, password)
- if fromfile== 'y' :
- if os.path.exists( 'luckyList.txt' ):
- try :
- f = file( 'luckyList.txt' )
- newList = []
- newList = p.load(f)
- print u '---------Loading list---------'
- print "----------......----------"
- except Exception, e:
- print u 'Parsing the local list failed, crawling the online page.'
- print "----------......----------"
- gl=True
- else :
- print u 'luckyList.txt does not exist locally, fetch the online page.'
- print "----------......----------"
- gl=True
- if gl==True:
- getList()
- from operator import itemgetter
- newList=sorted(luckyList, key=itemgetter( 4 ),reverse=True)
- f = file( 'luckyList.txt' , 'w' )
- p.dump(newList, f) #Save the captured list to a file so you don’t have to capture it again next time
- f.close()
-
- for lucky in newList:
- if not 'http://huodong.weibo.com' in lucky[ 3 ]: #Not a red envelope
- continue
- print lucky[ 3 ]
- id=re.findall(r '(\w*[0-9]+)\w*' ,lucky[ 3 ])
- getLucky(id[ 0 ])
Because it is troublesome to crawl the red envelope list repeatedly every time you test, I added a code to dump the complete list to a file, so that you can read the local list and grab the red envelopes in the future. After constructing the start module, write an entry program to pass the Weibo account to it: - if __name__ == "__main__" :
- print u "------------------Weibo Red Packet Assistant------------------"
- print "---------------------v0.0.1---------------------"
- print u "-------------by @All-powerful Soul Master----------------"
- print "-------------------------------------------------"
-
- try :
- uname=raw_input(u "Please enter your Weibo account: " .decode( 'utf-8' ).encode( 'gbk' ))
- pwd = raw_input(u "Please enter your Weibo password: " .decode( 'utf-8' ).encode( 'gbk' ))
- low = int (raw_input(u "Participate when the maximum cash received by the red envelope is greater than n: " .decode( 'utf-8' ).encode( 'gbk' )))
- fromfile=raw_input(u "Do you want to use the red envelope list in luckyList.txt: (y/n) " .decode( 'utf-8' ).encode( 'gbk' ))
- except Exception, e:
- print u "Parameter error"
- print "----------......----------"
- print e
- exit( 0 )
-
- print u "---------Program starts---------"
- print "----------......----------"
- start(uname,pwd,low,fromfile)
- print u "------------Program ends---------"
- print "----------......----------"
- os.system( 'pause' )
0×07 Go away! The basic crawler skeleton has been basically completed. In fact, there is still a lot of room for improvement in many details of this crawler, such as modifying it to support batch login, optimizing the red envelope value algorithm, and there should be many places in the code itself that can be optimized, but with my ability, I think I can only get this far. Everyone has seen the result of the program. I wrote hundreds of lines of code and thousands of words of articles, but all I got in return was a set of double-color balls. What a rip-off! How could it be a double-color ball? (Aside: The author became more and more excited as he spoke, and he actually started crying. People around him tried to persuade him: "Brother, it's not that serious. It's just a Weibo red envelope. I shook my hands so hard yesterday but I didn't get a WeChat red envelope.") Alas, actually I am not crying about this. I am sad because I am already in my twenties and still doing such boring things as writing programs to grab red envelopes on Weibo. This is not the life I want at all! Source code download: http://download..com/data/1984536 |