How to speed up deep learning on mobile apps? You will know after reading this article

How to speed up deep learning on mobile apps? You will know after reading this article

Currently, mobile applications that use deep learning technology usually rely directly on cloud servers to complete all DNN computing operations. However, the disadvantage of this is that the cost of data transmission between mobile devices and cloud servers is not small (manifested in system latency and mobile device power consumption). Currently, mobile devices usually have a certain computing power for DNN. Although the computing performance is not as good as that of cloud servers, it avoids the overhead of data transmission.

The authors of the paper proposed a cutting method based on the granularity of the model network layer, which divides the computational workload required by the DNN and fully utilizes the hardware resources of cloud servers and mobile devices to optimize latency and power consumption. Neurosurgeon vividly describes this cutting method: cutting the DNN model like a surgeon.

For all personal intelligent assistants that use deep learning technology to process image, video, voice, and text data, the current common practice in the industry is to use the powerful GPU cluster resources on cloud servers to complete the computing operations of the application (hereinafter referred to as the existing method).

This is also the approach used by current personal intelligent assistants running on mobile devices (such as Siri, Google Now, and Cortana). Nevertheless, we still wonder whether we can also use the computing power of the mobile device itself (rather than relying entirely on cloud services) while ensuring that the latency of the application and the power consumption of the mobile device are within a reasonable range.

The current situation is that the computing power of intelligent applications is completely dependent on high-end cloud servers provided by Web service providers.

Neurosurgeon breaks this conventional wisdom (i.e., computing power for smart applications is entirely dependent on cloud services)! In this excellent paper, the authors show us a new way of thinking: splitting the computation required by the application and using both cloud services and the hardware resources of mobile devices for computing. Here are the results of using Neurosurgeon's approach, which will benefit all of us:

  • System latency was reduced by an average of 3.1 times (up to 40.7 times), making applications more responsive and agile.

  • The average power consumption of mobile devices was reduced by 59.5% (with a maximum reduction of 94.7%). (You may be wondering, can you really reduce power consumption while performing more computing on mobile devices? The answer is a resounding yes!)

  • The data center throughput on the cloud server increased by an average of 1.5 times (up to 6.7 times), which is a huge improvement over existing methods.

Below we will first give an example to see how interesting it is to "split the computational workload", then learn how Neurosurgeon automatically detects the "best split point" in different DNN models, and finally show the corresponding experimental results to confirm Neurosurgeon's claimed capabilities.

(It is worth mentioning that the current mainstream platforms already support and implement deep learning calculations on mobile devices. Apple added deep learning-related development tools in iOS10, Facebook released Caffe2Go last year to enable deep learning models to run on mobile devices, and Google also recently released the deep learning development tool TensorFlow Lite for Android.)

Data transfer is not without cost

The latest SoC chips are impressive. This paper uses NVIDIA's Jetson TK1 platform (with a 4-core ARM processor and a Kepler GPU) (see Table 1 below), and the combined computing power of hundreds of thousands of devices is staggering.

If you compare the configuration on the server (as shown in Table 2 below), you will find that the configuration of mobile devices is still far inferior to that of the server.

Next, we will analyze an AlexNet model (a deep CNN network model) commonly used for image classification tasks. The input of this model is a 152KB image. First, we will compare the operation of the model in two environments: performing all computing operations on a cloud server and performing all computing operations on a mobile device.

If we only look at the latency of the application (see Figure 3 below), we can see that as long as the mobile device has an available GPU, performing all computing operations on the local GPU can bring the best experience (shortest latency). When using a cloud server for all computing, statistics show that computing only accounts for 6% of the total time, while the remaining 94% is consumed in data transmission.

Compared with existing methods, using the mobile device's own GPU to perform all calculations can achieve lower system latency under LTE and 3G network conditions; at the same time, under LTE and Wi-Fi network conditions, existing methods are better (less system latency) than simply using the mobile device's CPU to perform all calculations.

Figure 4 below shows the power consumption of cloud servers and mobile phone CPU/GPU under different network conditions:

If the mobile device is connected to a Wi-Fi network, the lowest power consumption solution is to send the corresponding data to the cloud server and let it perform all the calculations. However, if it is connected to a 3G or LTE network, if the mobile device has an available GPU, then the power consumption caused by performing all the calculations on the local GPU will be lower than the solution that requires data transmission and performs all the calculations on the cloud server.

Although cloud servers have stronger computing power than mobile devices, they need to transmit data (which results in considerable system latency and power consumption in some network environments). Therefore, from a system perspective, using cloud servers entirely for computing is not necessarily optimal.

▌ The data transmission and computing requirements of various network layers are not the same

So, is there an optimal split point between these two "relatively extreme" approaches of implementing all computations on cloud servers or on mobile devices? In other words, there may be a compromise between data transmission and implementation of computations. The following figure is a simple sketch to help you understand what to do next.

An intuitive segmentation method is to use the network layer in DNN as the boundary. Taking AlexNet as an example, if we can calculate the output data volume and computational complexity of each layer, we will get the following statistical chart (Figure 5 below):

As can be seen from Figure 5, for the earlier convolutional layers of the AlexNet model, the amount of data output decreases rapidly as the number of layers increases. However, the amount of computation begins to increase gradually in the middle and later parts of the model, reaching the highest level in the fully connected layers.

Next, it is worth considering what would happen if we split the layers in the AlexNet model? (That is, processing the first n layers of the model on the mobile device to get the output of the nth layer, then transferring that output to the cloud server for further calculations, and finally transferring the output to the mobile device.)

In Figure 6 below, you can see the latency and power consumption of each layer of the AlexNet model, with the best split points represented by stars.

The segmentation algorithm based on the network layer can greatly improve the delay time and power consumption of the model. For the AlexNet model, when the GPU of the mobile device is available and the Wi-Fi network is available, the best segmentation point is in the middle of the model.

The above is only a segmentation method for the AlexNet model. However, is this segmentation method also applicable to other DNN models used to process images, videos, voices, and texts? The author of the article made corresponding analysis on various DNN models as shown in Table 3:

For models with convolutional layers used in the field of Computer Vision (CV), the best split point is usually in the middle of the model. For ASR, POS, NER and CHK network models (mainly used in the fields of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP)), which usually only have fully connected layers and activation layers, it is only possible to find the best split point.

The best way to segment a DNN model depends on the topological and structural layers in the model. DNN models in the CV field are sometimes best segmented in the middle of the model, while for DNN models in the ASR and NLP fields, segmentation at the beginning or end of the model is often better. It can be seen that the best segmentation point varies with different models, so we need a system that can automatically segment the model and use cloud servers and device GPUs to perform corresponding calculations.

How Neurosurgeon works

For a DNN model, there are two main factors that affect the optimal split point location: one is static factors, such as the structure of the model; the other is dynamic factors, such as the number of connections in the network layer, the load of the data center on the cloud server, and the remaining available power of the device.

Due to the existence of the above dynamic factors, we need an intelligent system to automatically select the best split point in the DNN to ensure that the final system latency and mobile device battery consumption are optimal. Therefore, we designed such a system for intelligently splitting DNN models, namely Neurosurgeon.

Neurosurgeon consists of two parts: one is to create and deploy a model for predicting performance (including latency and power consumption) on a mobile device at one time, and the other is to configure various network layer types and parameters (including convolutional layers, pooling layers, fully connected layers, activation functions, and regularization terms) on the server. The former part is independent of the specific DNN model structure used. The prediction model mainly predicts latency and power consumption based on the number and type of network layers in the DNN model without having to execute the specific DNN model.

The prediction model is stored in the mobile device and is then used to predict the latency and power consumption of each layer in the DNN model.

When the DNN model is running, Neurosurgeon can dynamically find the best split point. First, it analyzes the type and parameters of each network layer in the DNN model and executes it, and then uses the prediction model to predict the delay time and power consumption of each network layer on mobile devices and cloud servers. Based on the predicted results, combined with the current network layer itself and the load of the data center, Neurosurgeon selects the best split point. Adjusting the split point can achieve the optimization of end-to-end delay time or power consumption.

Practical Applications of Neurosurgeon

Table 3 above shows the eight DNN models used to evaluate Neurosurgeon. In addition, the experiments used three network environments (Wi-Fi, LTE, and 3G networks) and two mobile device hardware (CPU and GPU). Under these models and configurations, Neurosurgeon was able to find a segmentation method to speed up the application's response time to an optimal 98.5%.

Experimental results show that Neurosurgeon can reduce application latency by an average of 3.1 times (and up to 40.7 times) compared to current methods that only use cloud servers.

In terms of power consumption, compared with existing methods, Neurosurgeon can reduce the power consumption of mobile devices by an average of 59.5%, and up to 94.7%.

Figure 14 below shows the results of Neurosurgeon's adaptive segmentation and optimization as the network environment changes (i.e., LTE bandwidth changes) (the blue solid line in the figure below). It can be seen that the delay time can be greatly reduced compared to the existing method (the red dotted line in the figure below).

Neurosurgeon also maintains periodic communication with the cloud server to obtain the load of its data center. When the server's data center is heavily loaded, it will reduce the amount of data transmitted to the server and increase the amount of local computing on the mobile device. In short, Neurosurgeon can make appropriate adjustments based on the server's load to achieve the lowest system latency.

We use BigHouse (a server data center simulation system) to compare existing methods with Neurosurgeon. In the experiment, the query data is evenly distributed to the eight DNN models in Table 3 above, and the query completion rate (query inter-arrival rate) is obtained by using the average response time of all the models measured and combining it with the query distribution in Google web search.

Figure 16 above shows that when the mobile device uses a Wi-Fi network, the data throughput brought by Neurosurgeon is 1.04 times that of the existing method. As the quality of the connection network deteriorates, Neurosurgeon will make the mobile device bear more computing workload, and the throughput of the data center on the cloud server will increase: compared with the existing method, the throughput of the data center increases to 1.43 times when connected to the LTE network, and increases to 2.36 times under 3G network conditions.

Source

Neurosurgeon: collaborative intelligence between the cloud and mobile edge Kang et al.,ASPLOS'17

http://web.eecs.umich.edu/~jahausw/publications/kang2017neurosurgeon.pdf

<<:  Android security protection journey---Analysis of several solutions for applying "anti-debugging" operations

>>:  It turns out that it is Dialog that causes memory leaks in Android

Recommend

Activity Analysis丨10 Yuan Storm Activity Analysis and Skill Sharing

China Merchants Bank’s Palm Storm was first launc...

iPhone 7 is not good? Analysts say it is the last glory

After iPhone 7 received a major upgrade, everyone...

Vocational education industry promotion methods and delivery platforms

With the increase of competitive pressure, educat...

Ten rules for open source success

[[151765]] Everyone wants it, and many are trying...

Urticaria is common in spring. How to get rid of it?

The weather is getting warmer and the days are ge...

Why is the Okolin electric toothbrush not worth buying?

Recently, there have been more and more publicity...

iOS - Solution to NSTimer circular reference

Occurrence scenario In Controller B there is a NS...

Do you think you can operate well just by reading the dry goods? Stop dreaming!

Just like I have heard a lot of truths, but I sti...

The most comprehensive guide to B station information flow advertising is online

BillBill is abbreviated as Bilibili . As the plat...

15 words to explain CPC, CPM, CPA, CPS, CPL...I finally understand them all!

Many newcomers to the industry always fail to dis...