Deeply debug network requests using WireShark

Deeply debug network requests using WireShark

background

Recently, I found that our product has a certain probability of being very slow when opening an ad link (Webview), with a white screen time of more than 10 seconds. In the process of tracking down the ad, I encountered many interesting things and felt quite rewarding. I would like to share them here, mainly to talk about the methodologies when tracking down bugs. Of course, I can't be too abstract, and I still have to bring some practical knowledge, such as the use of WireShark.

Bug Reproduction

The first thing to do after encountering a bug is of course to reproduce it. After some testing, I found that the bug almost only occurs on old models such as iPhone6, while my 7Plus has basically no problem. There is a certain probability of occurrence under 4G and Wifi, and Wifi seems to be more frequent.

In fact, experienced developers should have a clue when they see this. This is probably not a client bug, but more likely due to the low quality of the advertiser's web page or unstable network environment. But as a reliable programmer, how can you report such unfounded speculation to your superiors?

Separation of concerns

We know that loading a web page can be composed of two parts of time, one is the local processing time, and the other is the network loading time. The watershed between the two should be the UIWebview's shouldStartLoadWithRequest method. Before this method is called, the local processing time is consumed, and after the call, the network loading request is consumed. So we can divide things into two parts:

From the didSelectedRowAtIndexPath when the cell receives the click event to the shouldStartLoadWithRequest of the UIWebview.

From shouldStartLoadWithRequest to webViewDidFinishLoad of UIWebview.

Since the bug only occurs occasionally, it is impossible to debug it with Xcode for a long time, so we should also write a simple tool to persist the logs of each time, and keep the function calls, time consumption, specific parameters, etc. of each step. In this way, once it is reproduced, the logs in the phone can be read by connecting to the computer.

Local processing

The local processing takes a relatively short time, but the logic is not simple at all. In my opinion, the process from displaying UITableview to processing click events is enough to reflect the technical strength of a team. It is no exaggeration to say that the team that can make this small business the best must involve core knowledge points such as the selection design and specific implementation of MVC/MVVM architecture, the encapsulation of the network layer and the persistence layer, and the modular splitting of the project. I will take the time to write some articles to talk about these as soon as possible, so I will not go into details here.

After spending some time to organize the business process and do statistics, I really gained some results. The client logic is to send the request after the pushViewController animation is executed, wasting about 0.5s of animation time, which could have been used to load the web page.

Network Request

With the help of the log, I also found that although local processing wastes time, this time is relatively stable, about 1s. The greater time consumption comes from the network request part. Generally, there will be a short white screen time when opening a web page. During this time, the system will load HTML and other resources and render them, and there will be a chrysanthemum rotating on the interface.

When the white screen disappears depends on when the system finishes loading the web page, which we cannot control. But the time when the chrysanthemum disappears is known, and our logic is written in webViewDidFinishLoad. This may not be accurate, because the webViewDidFinishLoad method is also called when the web page is redirected, causing the client to mistakenly think that the loading has been completed. For a more accurate approach, please refer to: How to accurately determine whether WebView has been loaded. Of course, this is only more accurate. As far as UIWebview is concerned, it is almost impossible to accurately determine whether the network has been loaded (thanks to @JackAlan for his practice).

Therefore, network loading can be divided into two parts: one is the pure white screen time, and the other is the time when the web page appears but the chrysanthemum is still spinning. This is because the webViewDidFinishLoad method is called only after a Frame (can be HTML or iFrame) is fully loaded (including CSS/JS, etc.), so there is a situation where the web page has been rendered but the JS request is still being executed. On the user side, it is reflected that the web page can be seen but the chrysanthemum is still spinning. If this situation lasts too long, it will cause users to be impatient, but it is more acceptable than the pure white screen time.

At the same time, we can also be sure that if the web page has been loaded, but the JS request is still continuing, this is caused by the poor quality of the advertiser's web page. The loss should be borne by them, and we can't do anything about it. The long white screen is the problem we should focus on.

summary

In fact, we can report to the leader after analyzing this. The network loading time is divided into three stages. The first stage is the local processing time, which wastes performance but is relatively stable. The second stage is the white screen time of the web page. During this period, the system's UIWebView is requesting resources and rendering. The third stage is the chrysanthemum rotation time after loading the web page, which is generally less time-consuming and we cannot control it.

We also know that UIWebView provides very few APIs, and it is completely a black box mode from the start of the request to the end of the web page loading, and it is almost impossible to get started. But as a programmer with pursuits, ideals, ambitions, and skills, how can I give up easily?

WireShark

Charles is the most commonly used tool for debugging the network on the client side, but it can only debug HTTP/HTTPS requests and cannot do anything with the TCP layer. To understand the details of the HTTP request process, we must use a more powerful (and certainly more complex) weapon, which is the protagonist of this article, WireShark.

Generally speaking, the more powerful a tool is, the uglier it looks. WireShark is no exception, with an uglier appearance.

But don't worry, we don't need much. The blue shark logo in the top red box indicates that the network data is monitored, and the red button can be guessed to stop recording. Unlike Charles, which only monitors HTTP requests, WireShark can debug to the IP layer or even more detailed, so it has more data packets, and it will be overwhelmed by thousands of requests in a few seconds, so I suggest that users slightly control the monitoring time, or we can enter the filter conditions in the second red box to reduce interference, which will be described in detail below.

WireShark can monitor the local network card and the mobile phone network. When using WireShark to debug a real device, you don't need to connect to a proxy, just connect it to the computer via USB, otherwise you won't be able to debug the 4G network. We can use the rvictl -s device UDID command to create a virtual network card:

rvictl -s 902a6a449af014086dxxxxxx346490aaa0a8739

Of course, it is still quite troublesome to check the UDID of a mobile phone. As a lazy person, how can I do it without using the command line?

instruments -s | awk '{print $NR}' | sed -n 3p | awk '{print substr($0,2,length($0)-2)}' | xargs rvictl -s

In this way, as long as you connect to the mobile phone, you can directly obtain the UDID.

After running the command, you will see a prompt that the rvi0 virtual network card is successfully created. Double-click the rvi0 line.

Packet capture interface

We mainly focus on two contents. The big red box above is the data flow, which includes TCP, DNS, ICMP, HTTP and other protocols, with colorful colors. Generally speaking, black content means an error has occurred and needs to be paid special attention to, while other content is for auxiliary understanding. After repeated debugging several times, you can basically remember the corresponding meanings of various colors.

The small red box below mainly contains the detailed data of a certain packet, which will be divided according to different protocol layers. For example, the packet No. 99 I selected is a TCP packet, and its IP header, TCP header and TCP Payload can be clearly seen. These data can be analyzed in more detail when necessary, but generally do not need to be paid attention to.

Generally speaking, the data packets of a request will be very large, there may be thousands of them. How to find the request you are interested in? We can use the filtering function mentioned above. WireShark's filtering uses a set of self-defined syntax. If you are not familiar with it, you need to check it online or use the auto-complete function to "read the meaning of the word".

Since we want to check the specific details of the HTTP request, we must first find the requested URL, and then use the ping command to get its corresponding IP address. This approach is generally fine, but it does not rule out that some domain names will be optimized, such as returning different IP addresses when requesting DNS resolution for different IPs to ensure the speed of ***. In other words, the results of DNS resolution on the mobile phone are not always consistent with those on the computer. In this case, we can confirm it by checking the DNS data packet.

For example, from the figure we can see that the domain name res.wx.qq.com resolves to a large number of IP addresses, but only the first two are actually used.

After resolving the address, we can do simple filtering, enter ip.addr == 220.194.203.68:

This will only show the communication with the host 220.194.203.68. Note the SourcePort in the red box, which is the client port. We know that HTTP supports concurrent requests, and different concurrent requests must occupy different ports. So the two packets seen in the figure are not necessarily the relationship between request and response. They may belong to two different ports and have nothing to do with each other. They just happen to be closest in time.

If you only want to display data for a certain port, you can use: ip.addr == 220.194.203.68 and tcp.dstport == 58854.

If you only want to see the GET requests and responses of the HTTP protocol, you can use ip.addr == 220.194.203.68 and (http.request.method == "GET" || http.response.code == 200) to filter.

If you want to see data on packet loss, you can use ip.addr == 220.194.203.68 and (tcp.analysis.fast_retransmission || tcp.analysis.retransmission)

The above are the commands that I used more frequently during debugging, for reference only. Interested readers can capture packets and experiment on their own, so I won't post pictures one by one.

Case 1: DNS resolution

After capturing packets many times, I started to analyze the packets corresponding to those web pages with long white screens, and indeed found many problems, such as here:

You can clearly see a bunch of black error messages, but if you debug these packets, you will fall into a trap. DNS is a UDP-based protocol, and there will be no TCP retransmission, so these black packets must be retransmissions of previous lost packets, so don't worry about them. If you only look at the blue DNS requests, you will find that several requests were sent in succession but there was no response, and the resolved IP address was not obtained until the 12th second.

From the fact that the address of the recipient of the DNS request starts with 172.24, we can see that this is the intranet DNS server. I don't know why it got stuck for a long time.

Case 2: Handshake response delay

The following figure shows a typical TCP handshake scenario. You can also see that after the SYN handshake packet in the first figure was sent, it took one second for the ACK to be received. Of course, the reason is unclear and can only be explained as network jitter.

Then I captured the packet again on the 4G network:

This time things are even more outrageous. The SYN handshake packet sent in the second second is repeatedly lost (it may also be that the server did not respond, or the ACK was lost). In short, the client keeps retransmitting the SYN packet.

What is more interesting is that we observe TSval, which indicates the timestamp when the packet is sent. We observe these values ​​and find that the interval time was 1s in the first few times, and then became 2s, 4s and 8s. This reminds me of the concept of RTO.

We know that RTT represents the time from the initiation of a network request to the receipt of a response, and it is a value that changes dynamically with the network environment. TCP has the concept of a window. If the first data packet in the window cannot be sent, the window cannot slide backwards. The client uses the receipt of ACK as a sign that the data packet has been successfully sent, but what if the ACK is not received? Of course, the client will not wait forever, it will set a timeout. Once this time is exceeded, it will be considered that the data packet is lost and retransmitted.

This timeout is called RTO. Obviously, it must be slightly larger than RTT, otherwise it will falsely report packet loss. But it cannot be too large, otherwise it will waste time. Therefore, a reasonable RTO must be dynamically adjusted with RTT, always ensuring that it is larger than RTT but not too large. Observing the screenshot above, we can see that in some cases RTT can be very small, as small as only a few milliseconds. If RTO is also set to a few milliseconds, it will be unreasonable, which will increase the pressure on the client and the routers along the way. Therefore, RTO will also set a lower limit, and different operating systems may have different implementations, such as 200ms on Linux. At the same time, RTO will also set an upper limit. For specific algorithms, please refer to this article and this article.

It should be noted that RTO changes dynamically with RTT, but if RTO is reached and timeout retransmission occurs, RTO will no longer change with RTT (RTT at this time cannot be calculated) and will increase exponentially. This is why the interval in the above screenshot changes from 2s to 4s and then to 8s.

Similarly, we found that the handshake took 20 seconds, but we could not give an exact reason and could only explain it as network jitter.

Summarize

Through TCP-level packet capture, we not only learned how to use WireShark, but also reviewed the relevant knowledge of the TCP protocol, and analyzed the problem more deeply. Starting from the initial network problem, we refined the excavation and concluded that the white screen time was too long and the web page loading was too slow. Finally, we specifically calculated the number of HTTP requests, the time taken for each stage such as DNS resolution, TCP handshake, and TCP data transmission. From this point of view, the culprit for the slow loading of web pages is not the quality of the advertiser's web page, but the instability of the network. Although no effective solution was finally obtained, at least the cause of the problem was clarified and a convincing explanation was given.

<<:  17 tips for optimizing Android app build speed

>>:  Android digital jumping TextView implementation

Recommend

5 core steps to acquire app users

Many startups are interested in trying to discove...

Moments ads have been upgraded. Now you can chat with advertisers in Moments

Yesterday, a magical advertisement appeared on We...

Global annotation and graffiti

Ô´Âë¼ò½é ¿ ÉÒÔʵÏÖÈ«¾ÖÅú×¢£¬ÔÚÈκνçÃ涼¿ ÉÒÔ½«Åú...

Popular science drama "100,000 Whys: Winter Olympics Special"

Introduction to the resources of the popular scie...

27 essential tips for making short videos

Recently I heard a friend share how to make short...

How much does it cost to customize a supplement mini program in Jining?

The launch of mini programs has brought convenien...

Are those 9-yuan WeChat group courses that are all over the place worth buying?

I don’t know if you’ve felt this way recently. Th...

Apple releases iOS 11.4 beta 4 with information sync feature

Recently, Apple pushed iOS 11.4 beta 3 to develop...

Community Operation | 0 cost, how to efficiently build a strong community system

Since 2015, the first year of the community , the...

How to do user operation?

For product operators, they should not only be ve...

Dragon Boat Festival Marketing Activities Guide!

Holidays have always been important marketing nod...