iOS grabs HTML, CSS XPath parses data

In the past, we used AFN to get JSON data. For example, click here to view JSON data. http://news-at.zhihu.com/api/4/news/latest

But for example, Baidu Tieba and Douban Reading below do not provide us with an API to obtain data.

Baidu Post Bar:

Baidu Tieba data.png

Douban Reading:

Douban reading data.png

At this point we can parse their HTML to get the data we want.

Tool Preparation

At this time we need two tools, Firefox and FireBug.

You can download the FireFox browser from http://www.firefox.com.cn/download/, and then download the FireBug plug-in from the Add-ons Manager in the upper right corner of the menu.

FireBug has powerful JavaScript debugging capabilities and can also edit HTML CSS in real time. It is a favorite tool for front-end developers.

After downloading and installing, click the Bug icon in the upper right corner to use FireBug to debug the current web page.

If you don't know XPath, you can learn from w3school's tutorial.

Open FireBug.png

Ono Open Source Library

Ono is an open source project on Github, which can help us parse XML, HTML tags, and support CSS XPath to search for specific nodes.

You may not have heard of this library, but you certainly know its author. Mattt Thompson, the author of AFN and the author of the blog NSHipster.

Swift version of similar open source library Ji

Java or Android can use Jsoup

start

All preparations are OK. Let's start coding. Create a new blank project. Note that if you want to add two lines of App Transport Security Settings and Allow Arbitrary Loads YES in Info.plist, you can allow HTTP transmission.

App allows Http.png

Then use CocoaPods to add the third-party library pod 'Ono'.

Here, the HTML data to be parsed is my blog

Create another Post class that inherits from NSObject to represent each article. Modify the .h file as follows

 #import 
 
 @class ONOXMLElement; 
 
  
 
 @interface Post : NSObject 
 
 @property (copy,nonatomic) NSString *title; //Article title 
 
 @property (copy,nonatomic) NSString *postDate; //Article publication date 
 
 @property (copy,nonatomic) NSString *postUrl; //Url of the article content 
 
  
 
 +(NSArray*)getNewPosts; //Get all articles 
 
 +(instancetype)postWithHtmlStr:(ONOXMLElement*)element; //Create a Post class with HTML data 
 
 @ end

Import Ono in the .m file and add a constant Url.

 #import 
 
 static NSString *const kUrlStr=@ "http://BigPi.me" ;

Then we can use AFN to download the HTML data of the URL, and then use XPath to get the XPath representing each article.

First open FireFox and FireBug, click the picture below

FireBug element selector.png

Move the mouse appropriately and click to select an article on the web page.

Post data.png

At this point we can see that the HTML tree of FireBug is expanded, and we can find that each

Tags contain data about an article.

We right click

, copy its XPath

Copy XPath.png

The copied result //*[@id="posts"], this

Each child node under the node represents an article.

Now let's use this XPath to get all the HTML data. Add the following method in Post.m:

 +(NSArray*)getNewPosts{ 
 
    NSMutableArray *array=[NSMutableArray array]; 
 
 NSData *data= [NSData dataWithContentsOfURL:[NSURL URLWithString:kUrlStr]]; //Download web page data 
 
  
 
 NSError *error; 
 
    ONOXMLDocument *doc=[ONOXMLDocument HTMLDocumentWithData:data error:&error]; 
 
 ONOXMLElement *postsParentElement= [doc firstChildWithXPath:@ "//*[@id='posts']" ]; //Find the HTML node represented by the XPath, 
 
 //Traverse its child nodes, 
 
    [postsParentElement.children enumerateObjectsUsingBlock:^(ONOXMLElement *element, NSUInteger idx, BOOL * _Nonnull stop) { 
 
        NSLog(@ "%@" ,element); 
 
 }]; 
 
 return array; 
 
 }

And call this method in ViewController.m:

 @implementation ViewController 
 
  
 
 - (void)viewDidLoad { 
 
 [super viewDidLoad]; 
 
 [Post getNewPosts]; 
 
 } 
 
  
 
 @ end

After running, check the Console, we can already get the HTML of each article, and then we will parse the specific data of each article.

Switch to FireBug and expand the node of one of the articles

Article HTML node.png

We can see that under the <h2 class="title"> node

<a href="/post/jazzhands/jazzhands-yuan-ma-shi-xian-fen-xi">
<i class="fa fa-leaf"></i>
JazzHands source code implementation analysis</a>

The tag contains the article's URL and article title.

<div class="info">Under the node,

<span class="date">
<i class="fa fa-clock-o"></i>
2016-03-04 21:39

The tag has the time when the article was published. At this time, we can right-click the node and copy the XPath of the node such as the article title and publishing time.

But here we use relative XPath.

The HTML structure of each article is as follows:

Article title, URL, etc.

So our

Article Url XPath: “h2/a”
Article title XPath: href attribute value of a tag
Article release time XPath: “div[2]/span[1]”

Next, let’s analyze the detailed data of each article.

Add the following method to Post.m:

 +(instancetype)postWithHtmlStr:(ONOXMLElement*)element{ 
 
  
 
 Post *p=[Post new]; 
 
 ONOXMLElement *titleElement= [element firstChildWithXPath:@ "h2/a" ]; // Get the a tag containing the article title according to XPath 
 
 p.postUrl= [titleElement valueForAttribute:@ "href" ]; //Get the href attribute of the a tag 
 
    p.title= [titleElement stringValue]; 
 
 ONOXMLElement *dateElement= [element firstChildWithXPath:@ "div[2]/span[1]" ]; //According to XPath, get the span tag of the article publishing time 
 
    p.postDate= [dateElement stringValue]; 
 
 return p; 
 
 }

Then modify the +(NSArray*)getNewPosts method as follows:

 ... 
 
 [postsParentElement.children enumerateObjectsUsingBlock:^(ONOXMLElement *element, NSUInteger idx, BOOL * _Nonnull stop) { 
 
        //NSLog(@ "%@" ,element); 
 
        Post *post=[Post postWithHtmlStr:element]; 
 
 if(post){ 
 
            [array addObject:post]; 
 
 } 
 
 }]; 
 
 ...

Finally, because the URL of the HTML article we obtained is a relative URL, similar to

/post/jazzhands/jazzhands-yuan-ma-shi-xian-fen-xi

So we concatenate the domain name in the Setter method, http://BigPi.me

 -(void)setPostUrl:(NSString *)postUrl{ 
 
    _postUrl=[kUrlStr stringByAppendingString:postUrl]; 
 
 }

We breakpoint at the position below to view the results:

Code breakpoint.png

Running it, the results are as follows:

Crawl article data results.png

So far we can use FireBug + Ono + XPath to parse HTML data

I used this method to obtain the HTML of our school's academic management system and created an App that counts grades and calculates GPA.

Replenish

FireBug is a very powerful front-end debugging tool.
You can also use regular expressions to parse HTML data. However, from the StackOverflow discussion, it is recommended to use regular expressions to parse HTML data.
RayWonderLich has an older tutorial that uses a similar technique to parse HTML.
Finally, it is very important to note that HTML data may change frequently, especially if the web page is not something we can manage ourselves, so XPath parsing may fail at any time.
If you must use XPath to parse HTML data, you can do this on the server side, then modify it to an API, and let the mobile side GET JSON data as before.
At the same time, the server can also set exception handling, caching and other strategies.

The demo for this article can be found at https://github.com/iShawnWang/BlogDemo/tree/master/ParseHTMLDemo

<<: How to build an Android MVVM application framework

>>: Don't worry about MVC or MVP. Listen to me.

Peach gum has become the new favorite of Internet celebrity sweet soup, will it become milk tea 2.0?

Recommend

The flood season is here! Parents and children should be alert to these high-risk scenes that may cause drowning hazards →

During the flood season, children are more likely...

iOS grabs HTML, CSS XPath parses data

Peach gum has become the new favorite of Internet celebrity sweet soup, will it become milk tea 2.0?

Wedding photography, second-tier e-commerce and other industries information flow delivery cases and data references

5 major illusions in the marketing and advertising circles

“Ding Xiang Doctor”’s new media matrix building skills for millions of fans!

Can drinking tea really help you get rid of fat? Let's take a look at the scientific explanation

Nokia's 6-inch giant screen phone exposed: equipped with Qualcomm Snapdragon 800 processor

Five data analysis tools you must know for mobile app market operations

[Popular Science of Chinese Military Technology] How does the “Flying Leopard” fighter jet soar into the sky?

How to study at the beginning of the school year? Follow the aerospace scientists!

The 10 most common problems on Android

Recommend

The flood season is here! Parents and children should be alert to these high-risk scenes that may cause drowning hazards →

6 steps to quickly get started with growth hacking

Hyundai's next-generation fuel cell SUV combines the advantages of range and style

Bidding promotion: How do bidders optimize the average price of keywords?

Great Wall Motors launches "Ola Mobility" to develop shared mobility business

Who has been the winner in life in the past few days?

The first giant panda to go abroad was mistaken for a strange pug

12 tips for writing titles for new media operations!

4 common misunderstandings about information flow creativity, skip the conversion and double it immediately!

Grass Crawler: A tiny hidden creature, but a huge threat to humanity

IEA: Accelerating a fair transition in the coal industry

How much does it cost to develop a women’s clothing mini program in Binzhou?

On Chance and Skill in Game Design

Fructose, friend or foe?

How many megabytes of bandwidth does a short video server require for rental?