Inference is completed in less than 1ms on iPhone 12, Apple proposes MobileOne, an efficient mobile backbone network

Inference is completed in less than 1ms on iPhone 12, Apple proposes MobileOne, an efficient mobile backbone network

Efficient neural network backbones for mobile devices are usually optimized for metrics such as FLOPs or parameter count. However, when deployed on mobile devices, these metrics may not correlate well with network latency.

Based on this, researchers from Apple conducted extensive analysis of different indicators by deploying multiple mobile-friendly networks on mobile devices, explored the architecture and optimization bottlenecks of existing efficient neural networks , and provided methods to alleviate these bottlenecks. The study designed an efficient backbone architecture MobileOne, whose variant has an inference time of less than 1 ms on the iPhone12 and a top-1 accuracy of 75.9% on ImageNet.

Paper address: https://arxiv.org/abs/2206.04040

The MobileOne architecture not only achieves SOTA performance, but also speeds up many times on mobile devices. Among them, the best model variant achieves comparable performance to MobileFormer on ImageNet while being 38 times faster. MobileOne's top-1 accuracy on ImageNet is 2.3% higher than EfficientNet at similar latency.

In addition, the study also demonstrated that MobileOne can be generalized to multiple tasks - image classification, object detection, and semantic segmentation, with significantly improved accuracy and significantly reduced latency compared to existing efficient architectures deployed on mobile devices.

Method Overview

The researchers first analyzed the correlation between common metrics (FLOPs and parameter counts) and mobile device latency, and analyzed the impact of different design choices in the architecture on mobile phone latency.

Indicator relevance

The most commonly used cost metrics for comparing two or more model sizes are parameter count and FLOPs. However, they may not correlate well with latency in real mobile applications, and this study conducts an in-depth analysis to benchmark efficient neural networks.

The study used Pytorch implementations of recent models and converted them to ONNX format. The study converted each model into a coreml package using Core ML Tools and then developed an iOS application to measure model latency on iPhone12.

As shown in Figure 2 below, the study plotted latency vs. FLOPs and latency vs. parameter count. The researchers observed that many models with higher parameter counts had lower latency. At similar FLOPs and parameter counts, convolutional models such as MobileNets have lower latency than the corresponding Transformer models.

The researchers also estimated the Spearman rank correlation in Table 1 (a) below and found that latency is moderately correlated with FLOPs and weakly correlated with parameter count for efficient architectures on mobile devices, and even less so on desktop CPUs.

The key bottleneck of activation function

To analyze the impact of activation functions on latency, the study built a 30-layer convolutional neural network and benchmarked it on an iPhone 12 using different activation functions, which are commonly used in efficient CNN backbone networks. All models in Table 3 below have the same architecture except for the activation function, but their latencies are very different.

This difference is mainly caused by recently proposed activation functions such as SE-ReLU, Dynamic Shift-Max, and DynamicReLUs. Only ReLU activation function is used in MobileOne. Architecture Blocks The two key factors affecting runtime performance are memory access cost and parallelism.

In multi-branch architectures, memory access costs increase significantly because the activation functions from each branch must be stored to calculate the next tensor in the graph. If the network has a small number of branches, such memory bottlenecks can be avoided. Architecture blocks that force synchronization (such as the global pooling operation used in the Squeeze-Excite block) also affect the overall running time due to synchronization costs. To demonstrate hidden costs such as memory access costs and synchronization costs, this study makes extensive use of skip connections and Squeeze-Excite blocks in a 30-layer convolutional neural network, and Table 1b shows their impact on latency.

Based on this, the study adopted an architecture without branches at inference time to reduce memory access costs, and used Squeeze-Excite blocks in the largest variant of MobileOne to improve accuracy. Finally, the MobileOne architecture is shown in the figure below.

To improve performance, the model was scaled in several aspects: width, depth, and resolution. The study did not scale up the input resolution with increased FLOPs and memory consumption, which is detrimental to runtime performance on mobile devices.

Since the new model does not have a multi-branch architecture at inference time, it does not incur data movement costs. Compared to multi-branch architectures (such as MobileNet-V2, EfficientNets, etc.), Apple's new model is able to aggressively expand model parameters without incurring high latency costs.

Increasing the number of parameters allows the model to generalize well to other computer vision tasks such as object detection and semantic segmentation. Table 4 compares the new model with recent work on overparameterization in terms of training time, and shows that the MobileOne-S1 variant outperforms RepVGG-B0 by about 3 times.

Experiments and Results

Getting accurate latency measurements on mobile devices can be difficult. On iPhone 12, there is no command-line access or functionality to preserve all compute structures for just model execution. There is also no way to break down round-trip latency into categories like network initialization, data movement, and network execution. To measure latency, the study developed an iOS app using Swift to benchmark these models. The app runs the models using Core ML.

During the benchmark, the app runs the model multiple times (1000 by default) and accumulates statistics. To achieve the lowest latency and highest consistency, all other apps on the phone are closed.

As shown in Table 8 below, the study reports the full model round-trip latency. Most of this time may not come from the execution process of the model itself, but in real applications, these delays are unavoidable. Therefore, the study includes them in the reported latency. In order to filter out interruptions from other processes, the study reports the minimum latency of all models.

In addition, the study also reported the performance of several models in the object detection task on the MS COCO dataset and the semantic segmentation task on the Pascal VOC and ADE 20k datasets. MobileOne generally performs better than other models. The specific results are shown in Table 9 below.

Interested readers can read the original paper for more research details.

<<:  An interface was launched in 4 hours, and the practice of Ctrip's efficient and unified hotel data service platform was realized

>>:  Design and implementation of folding panel component

Recommend

Data Operations: How to build a data indicator system?

Many people use data indicators to measure projec...

How to prevent user churn starting from the user life cycle?

I've been thinking about some things about us...

7 Common Mistakes in App Store Keyword Selection

In the process of mobile game or mobile applicati...

Boxuegu-Data Analysis Course for Everyone

Course Catalog: ├──Chapter 1 Overview of the Care...

Practical traffic diversion skills for Douyin (Part 2)

Yesterday I explained to you in detail how to use...

Don’t be confused! 4 career suggestions for novice B-side designers

I have found that many newcomers who join B-side ...

Marketing strategies for domestic brands

In the "Laotan Pickled Cabbage" inciden...

Pinduoduo’s bargaining logic and strategic marketing methods!

Many people often receive price-cutting links fro...