iOS grabs HTML, CSS XPath parses data

iOS grabs HTML, CSS XPath parses data

In the past, we used AFN to get JSON data. For example, click here to view JSON data. http://news-at.zhihu.com/api/4/news/latest

But for example, Baidu Tieba and Douban Reading below do not provide us with an API to obtain data.

Baidu Post Bar:

Baidu Tieba data.png

Douban Reading:

Douban reading data.png

At this point we can parse their HTML to get the data we want.

Tool Preparation

At this time we need two tools, Firefox and FireBug.

You can download the FireFox browser from http://www.firefox.com.cn/download/, and then download the FireBug plug-in from the Add-ons Manager in the upper right corner of the menu.

FireBug has powerful JavaScript debugging capabilities and can also edit HTML CSS in real time. It is a favorite tool for front-end developers.

After downloading and installing, click the Bug icon in the upper right corner to use FireBug to debug the current web page.

If you don't know XPath, you can learn from w3school's tutorial.

Open FireBug.png

Ono Open Source Library

Ono is an open source project on Github, which can help us parse XML, HTML tags, and support CSS XPath to search for specific nodes.

You may not have heard of this library, but you certainly know its author. Mattt Thompson, the author of AFN and the author of the blog NSHipster.

Swift version of similar open source library Ji

Java or Android can use Jsoup

start

All preparations are OK. Let's start coding. Create a new blank project. Note that if you want to add two lines of App Transport Security Settings and Allow Arbitrary Loads YES in Info.plist, you can allow HTTP transmission.

App allows Http.png

Then use CocoaPods to add the third-party library pod 'Ono'.

Here, the HTML data to be parsed is my blog

Create another Post class that inherits from NSObject to represent each article. Modify the .h file as follows

  1. #import
  2.  
  3. @class ONOXMLElement;
  4.  
  5.   
  6.  
  7. @interface Post : NSObject
  8.  
  9. @property (copy,nonatomic) NSString *title; //Article title
  10.  
  11. @property (copy,nonatomic) NSString *postDate; //Article publication date
  12.  
  13. @property (copy,nonatomic) NSString *postUrl; //Url of the article content
  14.  
  15.   
  16.  
  17. +(NSArray*)getNewPosts; //Get all articles
  18.  
  19. +(instancetype)postWithHtmlStr:(ONOXMLElement*)element; //Create a Post class with HTML data
  20.  
  21. @ end  

Import Ono in the .m file and add a constant Url.

  1. #import
  2.  
  3. static NSString *const kUrlStr=@ "http://BigPi.me" ;

Then we can use AFN to download the HTML data of the URL, and then use XPath to get the XPath representing each article.

First open FireFox and FireBug, click the picture below

FireBug element selector.png

Move the mouse appropriately and click to select an article on the web page.

Post data.png

At this point we can see that the HTML tree of FireBug is expanded, and we can find that each

Tags contain data about an article.

We right click

, copy its XPath

Copy XPath.png

The copied result //*[@id="posts"], this

Each child node under the node represents an article.

Now let's use this XPath to get all the HTML data. Add the following method in Post.m:

  1. +(NSArray*)getNewPosts{
  2.  
  3. NSMutableArray *array=[NSMutableArray array];
  4.  
  5. NSData *data= [NSData dataWithContentsOfURL:[NSURL URLWithString:kUrlStr]]; //Download web page data
  6.  
  7.   
  8.  
  9. NSError *error;
  10.  
  11. ONOXMLDocument *doc=[ONOXMLDocument HTMLDocumentWithData:data error:&error];
  12.  
  13. ONOXMLElement *postsParentElement= [doc firstChildWithXPath:@ "//*[@id='posts']" ]; //Find the HTML node represented by the XPath,
  14.  
  15. //Traverse its child nodes,
  16.  
  17. [postsParentElement.children enumerateObjectsUsingBlock:^(ONOXMLElement *element, NSUInteger idx, BOOL * _Nonnull stop) {
  18.  
  19. NSLog(@ "%@" ,element);
  20.  
  21. }];
  22.  
  23. return array;
  24.  
  25. }

And call this method in ViewController.m:

  1. @implementation ViewController
  2.  
  3.   
  4.  
  5. - (void)viewDidLoad {
  6.  
  7. [super viewDidLoad];
  8.  
  9. [Post getNewPosts];
  10.  
  11. }
  12.  
  13.   
  14.  
  15. @ end  

After running, check the Console, we can already get the HTML of each article, and then we will parse the specific data of each article.

Switch to FireBug and expand the node of one of the articles

Article HTML node.png

We can see that under the <h2 class="title"> node

  • <a href="/post/jazzhands/jazzhands-yuan-ma-shi-xian-fen-xi">
  • <i class="fa fa-leaf"></i>
  • JazzHands source code implementation analysis</a>

The tag contains the article's URL and article title.

<div class="info">Under the node,

  • <span class="date">
  • <i class="fa fa-clock-o"></i>
  • 2016-03-04 21:39

The tag has the time when the article was published. At this time, we can right-click the node and copy the XPath of the node such as the article title and publishing time.

But here we use relative XPath.

The HTML structure of each article is as follows:

  • Article title, URL, etc.

So our

  • Article Url XPath: “h2/a”
  • Article title XPath: href attribute value of a tag
  • Article release time XPath: “div[2]/span[1]”

Next, let’s analyze the detailed data of each article.

Add the following method to Post.m:

  1. +(instancetype)postWithHtmlStr:(ONOXMLElement*)element{
  2.  
  3.   
  4.  
  5. Post *p=[Post new];
  6.  
  7. ONOXMLElement *titleElement= [element firstChildWithXPath:@ "h2/a" ]; // Get the a tag containing the article title according to XPath
  8.  
  9. p.postUrl= [titleElement valueForAttribute:@ "href" ]; //Get the href attribute of the a tag
  10.  
  11. p.title= [titleElement stringValue];
  12.  
  13. ONOXMLElement *dateElement= [element firstChildWithXPath:@ "div[2]/span[1]" ]; //According to XPath, get the span tag of the article publishing time
  14.  
  15. p.postDate= [dateElement stringValue];
  16.  
  17. return p;
  18.  
  19. }

Then modify the +(NSArray*)getNewPosts method as follows:

  1. ...
  2.  
  3. [postsParentElement.children enumerateObjectsUsingBlock:^(ONOXMLElement *element, NSUInteger idx, BOOL * _Nonnull stop) {
  4.  
  5. //NSLog(@ "%@" ,element);
  6.  
  7. Post *post=[Post postWithHtmlStr:element];
  8.  
  9. if(post){
  10.  
  11. [array addObject:post];
  12.  
  13. }
  14.  
  15. }];
  16.  
  17. ...

Finally, because the URL of the HTML article we obtained is a relative URL, similar to

/post/jazzhands/jazzhands-yuan-ma-shi-xian-fen-xi

So we concatenate the domain name in the Setter method, http://BigPi.me

  1. -(void)setPostUrl:(NSString *)postUrl{
  2.  
  3. _postUrl=[kUrlStr stringByAppendingString:postUrl];
  4.  
  5. }

We breakpoint at the position below to view the results:

Code breakpoint.png

Running it, the results are as follows:

Crawl article data results.png

So far we can use FireBug + Ono + XPath to parse HTML data

I used this method to obtain the HTML of our school's academic management system and created an App that counts grades and calculates GPA.

Replenish

  • FireBug is a very powerful front-end debugging tool.
  • You can also use regular expressions to parse HTML data. However, from the StackOverflow discussion, it is recommended to use regular expressions to parse HTML data.
  • RayWonderLich has an older tutorial that uses a similar technique to parse HTML.
  • Finally, it is very important to note that HTML data may change frequently, especially if the web page is not something we can manage ourselves, so XPath parsing may fail at any time.
  • If you must use XPath to parse HTML data, you can do this on the server side, then modify it to an API, and let the mobile side GET JSON data as before.
  • At the same time, the server can also set exception handling, caching and other strategies.

The demo for this article can be found at https://github.com/iShawnWang/BlogDemo/tree/master/ParseHTMLDemo

<<:  How to build an Android MVVM application framework

>>:  Don't worry about MVC or MVP. Listen to me.

Recommend

Why is Hongmeng competing with iOS instead of Android?

Guests | Li Chuanzhao, Song Xujun Written by | Yu...

Technology Morning News | The fastest asteroid in the solar system has appeared;

【Today’s cover】 Now is the busy season for shovel...

How to motivate new users? Take Meituan as an example

I believe most of you know the significance of ne...

Google testing Android/Chrome OS hybrid system 'Andromeda'

[[172434]] According to foreign media reports, Go...

Complete knowledge about information flow advertising on iQiyi channel!

Information flow ads are ads located in the updat...

What are the channels for promoting Tik Tok? Here are 5 tips for you!

How to promote Tik Tok ? What are the channels fo...

Experts use UC information flow, how to deliver UC headlines?

Mobile information flow ads have been very popula...

Qualcomm CPU vulnerability found: 900 million Android devices worldwide affected

Information security research company Check Point ...

7 steps to help you write a clear plan

I believe this phenomenon should be quite common ...