1. Background How to measure and simulate "weak network" is of great significance to the development of mobile apps, such as saving testing costs, facilitating problem reproduction, and speeding up product launch. The general method is to use "packet loss rate" and "network delay" to define and measure "weak network". 2. Process of mobile phone access to server To discuss this issue, we must first understand the process of mobile phone access to the server. - First, the mobile phone must obtain wireless link allocation from the base station through the wireless network protocol before it can communicate with the network.
- Wireless network base stations and base station controllers will distribute signals to mobile phones to complete mobile phone connections and interactions.
- After obtaining the wireless link, network attachment, encryption, and authentication will be performed. The core network will check whether you can connect to this network, whether you have activated a package, whether you are roaming, etc. The core network has SGSN and GGSN, and in this step, the protocol conversion between the wireless network protocol and the wired Ethernet is completed.
- In the next step, the core network will help you select the APN, allocate the IP, and start billing.
- Further down are the steps of traditional network: DNS query and response, establishing TCP link, HTTP GET, RTTP RESPONSE 200 OK, HTTP RESPONSE DATA, LAST HTTP RESPONSE DATA, and starting UI display.
This is the whole process of a mobile phone accessing a server through a wireless network. There are several problems that bother developers during the whole process: How does a wireless network assign wireless links to mobile phones? The core network has access points (APN). What is the difference between CMNET and CMWAP here? Is it just the protocol difference? What is the difference in data forwarding? Is there any difference when a data packet is transmitted on different networks? How can users find the right server as quickly as possible? How can content be loaded quickly and effectively and displayed at the first time? The focus of these questions lies in several connection points: - Radio link allocation. This is a physical real connection.
- IP layer link. This is a logical virtual connection.
- TCP layer link. This is a logical virtual connection.
- HTTP layer link. This is a logical virtual connection.
- The user is online. This is a logical virtual connection.
3.2 One-second rule Based on the above situation, a major feature of wireless network is formed: second-level state management and second-level state transition. Both operations are performed between hundreds of milliseconds and several seconds, which is too short to maintain the connection, but too long to switch from no connection to connection. In contrast, the state management of wired networks, such as IP allocation and TCP connection release, is done in minutes, while state transitions are done in milliseconds. These communication mechanisms, coupled with high latency and high packet loss in wireless networks, make it a huge challenge to ensure that mobile Internet products provide stable and predictable service quality: The data transmission delay of the wireless part on the 2G network is several hundred ms, while the transmission delay of the wireless part on the 4G network is reduced to tens of ms. The core network state conversion and protocol conversion are 30~100ms. The delay on the IP backbone network is related to the physical distance and the quality of operator interconnection. It is 50-400ms across operators and 5-80ms within the same operator. This also depends on the network congestion situation. The bit error rate of wireless networks is two orders of magnitude higher than that of wired networks, and the fluctuations in different time periods are also very large. How to optimize services based on the characteristics of mobile networks? This is the one-second rule we have summarized: the prescribed actions to be completed within one second. - 2G network: complete DNS query and establish connection with backend server within 1 second
- 3G network: first word display within 1 second (first word time)
- Wi-Fi network: first screen display completed within 1 second (first screen time)
- These indicators need to be measured at the terminal and must be related to the user experience: the first word time and the first screen time must be intuitively felt by the user.
4. Optimization ideas 4.1 Service Guarantee Principles From the above analysis, it can be seen that it is very challenging to ensure that mobile Internet products provide stable and predictable service quality. The following principles may be helpful: - Interface design optimization. In theory, interface optimization does not belong to the optimization of APP weak network, but the problem of API performance is indeed exposed when the network conditions are poor. Everyone is talking about the quality of the server and the performance of the equipment. In fact, for a good server, most of the delays in request speed are in IO. Including disk read and write IO, SQL query IO, etc. Common optimization points: slow query monitoring, multiple query optimization, common interface cache, etc.
- Image-related strategies.
- Using a faster image format is not strictly speaking a weak network optimization, but a faster image format is really important! Here we recommend using the WebP format. (WebP format is an image format developed by Google to speed up image loading. The compressed image size is only about 2/3 of JPEG, and can save a lot of server bandwidth resources and data space. But WebP is a lossy compression. Compared with encoding JPEG files, encoding WebP files of the same quality requires more computing resources.)
- Different images are sent to different networks. For example (for an original 600X480 image): 2/3G uses low-resolution images -> sends 300X240 images with a resolution of 80, 4G uses normal-resolution images -> sends 600X480 images with a resolution of 80, WiFi uses high-resolution images (it is best to determine based on the network speed, as WiFi can be slow) -> sends 600X480 images with a resolution of 100.
- Reconnect after disconnection. This may be the most important feature, because there are too many reasons that cause data connection to be disconnected in wireless networks. CDN can be used here. (CDN is a distributed content distribution network built on a data network. The role of CDN is to use streaming media server cluster technology to overcome the shortcomings of insufficient output bandwidth and concurrent capacity of a single-machine system, which can greatly increase the number of concurrent streams supported by the system and reduce or avoid the adverse effects of single point failures.)
- Since creating a connection is a very expensive operation, the number of data connection creations should be minimized, and tasks should be performed in batches in one request. If small data packets are sent multiple times, they should be sent within 2 seconds. When accessing different servers in a short period of time, reuse wireless connections as much as possible.
- Optimize DNS queries. DNS queries should be reduced as much as possible, domain name hijacking and DNS pollution should be avoided, and users should be dispatched to the "optimal access point".
- Reduce data packet size and optimize packet volume. Reduce data packet size and packet volume by compression, header reduction, message merging, etc.
- Control the packet size to no more than 1500 to avoid fragmentation. This includes logical link control fragmentation, GGSN fragmentation, and IP fragmentation. When the packet size exceeds the maximum size allowed by GGSN, GGSN has the following three processing methods: fragmentation, discard, and rejection.
- Optimize TCP socket parameters, including whether to close fast recycling, initial RTO, initial congestion window, socket buffer size, Delay-ACK, Selective-ACK, TCP_CORK, congestion algorithm (westwood/TLP/cubic), etc. The significance of doing this is that since the QoS of access networks such as 2G/3G/4G/WIFI/company intranet varies greatly, in order to obtain better service quality under different networks, the values of the above parameters may vary greatly.
- Optimize ACK packets. In the case of a weak network, the ACK packets in the TCP protocol are very expensive, and the delay can even reach the second level. The congestion control, fast retransmission, and fast recovery features of the TCP protocol are very dependent on the ACK packets fed back by the receiving end. It is conceivable that if the delay of the ACK packet received by the sender is too long, it will seriously affect the efficiency of the TCP protocol. However, if too many ACKs are sent, it will occupy too many precious wireless resources. In mobile network communications, "how to reduce the delay of data packets while reducing ACK packets on a reliable connection" is a hot topic of research. The basic idea is to balance the number of redundant packets and ACK packets to achieve the purpose of reducing delay and improving throughput. For example, the communication between SGSN and GGSN is realized: the two communicate through the UDP protocol. In the absence of new data packets, the sender retries the sent packets at a certain time interval. After reaching the maximum number of retries, the packet is discarded.
- TCP's congestion control algorithm is designed based on the assumption that "packet loss means network congestion". Obviously, this assumption is not appropriate in a wireless network environment. But in a wireless network environment, can congestion control be completely discarded when designing a reliable UDP protocol? Here are some other articles that propose several TCP-friendly congestion control algorithms in a wireless network environment. If you are interested, you can check them out.
- Flexible use of long connections/short connections, support for different protocols (TCP/UDP, http, binary protocols, etc.), support for different ports, etc.
- Let users feel fast. This is no longer a technical method, but a psychological game, a way to improve user experience. For example:
- A progress bar that does not start at 0. Regardless of the loading progress of the web page or the network conditions, the loading progress always starts at 50% and stays at around 98%.
- Display text first and then load images. In Webview, the loading speed of images or multimedia is definitely much slower than that of text. Since different webviews have different display and rendering effects, we can let the webview display text first and then display images. This gives users a feeling of being able to preview the entire webpage overview first.
4.2 Access Scheduling Optimization The first thing to consider when optimizing access scheduling is to reduce the impact of DNS. DNS in mobile networks has the following characteristics: - The backbone network cannot identify which city the mobile user is in, and the dispatching of various places in the east, west, south and north is not fully utilized. Currently, some national DNSs carry more than 40% of the total network users.
- Many knockoff phones have incorrect local DNS settings.
- There are also some problems that wired networks will encounter, such as terminal DNS resolution abuse, domain name hijacking, DNS pollution, aging, fragility, etc. However, for these problems, the desktop will have better self-healing properties, while it is more difficult to solve them on mobile phones.
There are two main solutions to DNS problems: - Reduce DNS requests, queries, and updates, that is, DNS caching
- Configure the server list in the terminal and access the IP directly without DNS
But this is not enough, because users may come from different operators at home and abroad, and the scheduling strategy needs to be further optimized: - DNS cache requires multiple access points to be established and different domain names to be used to distinguish
- The IP list needs to be updated to adapt to different network conditions, and active scheduling is required. For example, at the beginning, we only served mobile users well, and the access quality of mobile users was prioritized, because most users were concentrated in mobile. Now there are three operators in China, and the proportion of user distribution is slowly approaching, so we need to distinguish them clearly. Smartphones use wifi, but we don’t know whether they are connected to China Telecom, China Unicom, or another operator, so you can’t set the scenario in advance and then if then, and you must solve it through background scheduling capabilities.
Further optimization will produce a fusion method: - Do the domain name resolution first, and the client directly connects to the resolved IP, you can use the http protocol or tcp socket
- Multi-port, multi-protocol combination: Different protocols have different restrictions, some can only be HTTP, some can only be TCP socket, all kinds of environments must be adapted, the client cannot support only one protocol
- Terminal speed test: There are more and more access points. If you want to choose which one to access, you can use the terminal speed test to select the fastest one. Of course, you can do a speed test every time you create a new connection, but it may take a long time to establish a connection. We can establish a connection for the user first, and then do dynamic scheduling in the background based on long-term speed monitoring and current speed test results. In other words, the first connection may not be optimal. After the connection is established, the speed is dynamically measured and then transferred to the fastest access point. A further step is to establish a network profile and the idea of terminal learning.
Regarding the granularity of speed test sampling, it is useless to use IP segments for mobile Internet. A better granularity is the network element level. For example, there are more than 20 wap gateways in Guangdong, and the situation of each gateway is different. This is a more appropriate granularity. Finally, let me emphasize a principle for all access scheduling: do not hard-code the scheduling logic on the client side; it must be completed by the background. 4.3 Protocol Optimization Here is a brief list of protocol parameter optimization, which is some experience summarized in the long-term operation process. When launching mobile Internet services, it can be used as an operation standard to avoid many detours: - Disable TCP fast recycling
- Init RTO is no less than 3 seconds
- The initial congestion control window is not less than 10. Because most pages are less than 10kB, many requests have already ended in the slow start phase. Changing it to 10 can reduce the transmission delay of small page resources. The larger the content, the less obvious the effect of this option.
- Socket buffer > 64k
- TCP sliding window variable
- Control the packet size below 1400 bytes to avoid fragmentation
The principles of protocol optimization can be summarized as follows: - Connection Reuse
- Concurrent connection control
- Timeout control
- Baotou Streamline
- Content Compression
- Choose a more efficient protocol. Whether it is TCP, HTTP, UDP, long connection, GZIP, SPDY, WUP or WebP, each protocol and solution has its reasons. There is no best one. Only whether it is suitable for your product and service characteristics requires everyone to verify and choose in the operation process.
4.4 WAP access point optimization Regarding WAP access point optimization, some people may say that our App is a high-end and classy application, so does it mean that we don't need to do WAP optimization? In fact, our statistics show that 5%-20% of users currently choose *WAP (CMWAP, 3GWAP, CTWAP) as their access point, which even includes some iPhone terminals. In fact, the WAP gateway is essentially a proxy, not completely backward. It is also evolving with the advancement of technology. In the future, there may be integrated gateways and content billing gateways in the networking architecture to replace the current WAP gateways, so it is recommended that they should also be considered. The following are some issues that need to be paid attention to when doing WAP optimization: - Fee reminder page
- 302 redirect processing
- X-Online-Host Usage and Processing
- Packet size limit
- Hijacking and Caching
- Correctly obtain the resource package size
4.5 Business Logic Optimization 1. Simplify logic: Use logos to update content with complicated interactions as much as possible. For example, we did a test on the old version of mobile QQ: If I have 100 friends, it takes 3.5 minutes to log in and update the friend list using mobile QQ. This is definitely unreasonable. It is recommended to use signaling status to notify whether an update is needed, and use cache reasonably. For example, when playing games, if your friends send you a lot of stars, should you let the user click them one by one or in batches? From the perspective of optimization, it is definitely better to click in batches, and from the perspective of user experience, it is also more comfortable. On the other hand, extending the cache time of the domain icon can also effectively optimize the number of visits. After we extended the cache time of the Tencent mobile icon from 120 minutes to 2 days, the number of visits was optimized by about 35%. 2. Flexible availability: This means that when the network quality is good, a high-definition large image is provided. When the network quality is not good, a small image is shown to the user first, and the original image is pulled after a click. To give an extreme example, if there is an earthquake and 20% of the base station is destroyed, the user needs to report safety to his family. At this time, the product must be optimized, such as sending only text, reasonably reducing network consumption by 3%. In addition, when the response is very slow, it is necessary to give the user some reasonable page prompts, such as reminding the user that the message will be sent in 5 seconds, so you don’t have to keep refreshing the screen, which can also reduce the impact of access on the background service and the network. Five practical demonstrations 5.1 An example of optimal scheduling design Having said so much above, here is an example to help you understand it more intuitively. Here is a DNS system design to achieve optimal scheduling. Its topology is as follows: TGCP SDK responsibilities: - Use HTTP Get/Post method to obtain the optimal access point list of the server and DNSvr itself from DNSvr. The query parameters of Get/Post method include uin/openid, client version number, MD5 of IP list (note the order of IP), domain name list, VIP, ServiceID, etc.
- Cache access to the IP list of the server and DNSvr, as well as other metadata (such as IP list, etc.), with APN as the primary key.
- When certain conditions are met, the cached IP list should be actively updated, such as cache expiration.
Tconnd's responsibilities: - Routes query requests to the active DNSvr;
DNSvr's responsibilities: - The "optimal access point" of the client is determined based on static and dynamic strategies. Static strategy: Determine the IP list based on uin/openid, client version number or mandatory rules; Dynamic strategy: Lighthouse dynamically determines the user's server access point based on speed test data.
- Supports blacklisting certain IPs manually or automatically. Automatic method: The server's access tconnd reports to DNSvr whether it is alive (it needs to report to multiple points, including reporting with public IP). If no report is received within a certain period of time or the report message clearly states that all logical servers have been hung up, the corresponding IP will be automatically blacklisted. If the business is restored, the corresponding IP will be automatically activated. If the project team accesses TGW, whether a certain IP and port are available requires considering the mapping relationship between the process and the VIP.
- Cache the calculation results of the lighthouse in tcaplus. At this time, DNSvr is required to determine the country, province, operator and gateway based on the client IP (this can be achieved by accessing the MIG IP library). If the calculation results of the lighthouse are cached, when the cache times out, the corresponding data must be pulled from the lighthouse again.
Responsibilities of a lighthouse: - Based on the client IP and server access point IP, the optimal access point list is returned, including the IP sorting, as well as the country, province, operator, APN, and gateway accessed by the client.
Tcaplus's responsibilities: - Save the access IP list and port, static policy, or cache the calculation results of the lighthouse;
The main process: Client batch resolution process for domain names - TGCP uses APN and domain name list as keywords to query the cache. If it exists and has not expired, the IP address will be directly returned to the user. If a forced resolution domain name list is specified, this step will be skipped;
- TGCP uses the pre-configured or cached IP to initiate a query request to the DNSvr. If the result is successfully returned, go to step 3. Otherwise, retry other IPs in the IP list. If all fail, access the DNSvr using the domain name. Note: If the result format is incorrect, retry using the last IP, and do not change the IP to retry.
- DNSvr compares the MD5 of the client IP list and the latest IP list. If they are equal, it tells the client that it does not need to update the local cache. Otherwise, TGCP writes the IP list of the access server and DNSvr to the local. Note: When accessing the server, these IPs have a higher priority than the IPs statically configured on the client.
The process of the client using the domain name to access the server - If there is a valid IP locally (that is, there is an IP list corresponding to the APN and it has not expired), the IP is used to access the server.
- Otherwise, initiate the "client batch resolution domain name process" and then access the server.
The server access tconnd active reporting status process: - Tconnd periodically reports heartbeat messages to DNSvr, which contain information about whether the access point is available.
- If DNSvr does not receive a heartbeat message within a certain period of time or the corresponding access point is unavailable, the corresponding IP and port will be blacklisted, and the blacklisted IP will no longer be sent to the client.
Note: During actual deployment, the access Tconnd needs to be reported to multiple DNSvr access tconnds. The process of actively pushing the access point list to the client - When TGCP connects to Tconnd accessed by the server, Tconnd will initiate a request to DNSvr to verify the quality and timeliness of the current access IP. If the IP list changes, Tconnd will send the latest IP list to the client for cache.
- The next time TGCP accesses the server, the latest IP list is used.
Process of client accessing DNSvr failure - If access to DNSvr fails (including IP+domain name), if the local IP is configured, access the server directly using the IP, otherwise use the domain name to access.
Optimize transport layer protocol design Based on the reliable UDP supported by tconnd, the following logic is added: - Data compression;
- Data encryption;
- Merge multiple packets;
- Supports streaming data transmission, which is convenient for controlling the size of each UDP packet and facilitating data encryption and compression;
- Optionally supports improved congestion control algorithms;
- Even if no ACK packet is received, it is necessary to actively retry the sent data packet;
5.2 Some optimizations under Hybird development To deal with the loading speed under weak network, we must first determine where our entire APP is loaded and how fast it is loaded, and where the longest loading path is, so that we can make targeted optimizations and modifications. 5.2.1 WebView If it is a webview web page embedded in an APP, the optimization of web page experience has been around for a long time. We can use Chrome's developer mode, adjust to Network mode, set the network condition to 3G to request the web page, then we can see where the speed of loading a web page is mainly consumed, as shown in the following figure: Of course, there are many ways to speed up HTML. - Use gulpgrunt for packaging and compression: jscss resource compression, CSS Sprites merging, etc.
- Use font-awesome to replace images: fonts are well compatible, infinitely magnified, and commonly used images are available
|