Huawei is the first to implement local natural language image search function on mobile phones

Huawei is the first to implement local natural language image search function on mobile phones

We are used to search engines, but we are often helpless when it comes to finding local files on our mobile phones: it is normal for everyone to store thousands of photos on their smartphones nowadays, and sometimes finding a specific photo is like looking for a needle in a haystack.

However, at the Huawei P60 series launch conference this year, a new feature appeared - Smart Image Search. Based on the lightweight application of multimodal large model technology on the terminal side, the mobile phone has the ability to search for pictures using natural language for the first time. Since it is natural language, it means that you can speak human language to the mobile phone when looking for pictures.

What if you can't remember when or where the photo you want to find was taken, and you only roughly remember the people or objects in the photo? Just write a few words you can think of in the search box, and Smart Search will find it for you:

Or you can use voice to wake up Xiaoyi and describe the content of the photos you want to find in one sentence. For example, if you search for "photos of skiing in Changbai Mountain last year", you can directly find all the photos of that trip in your phone:

Going a step further, you can also search for descriptive concepts like "cyclists" and "outdoor gatherings."

Compared with the previous tag-based photo search method, Smart Image Search can make the phone smarter, more responsive, and more efficient in outputting results. Combined with multimodal semantic model technology, Huawei's Smart Image Search has pre-trained hundreds of millions of image and text data in the cloud, and has the ability to understand more general semantics. More importantly, the model is deployed on the end side, and the search calculation is completed locally, further protecting privacy and security.

This makes people wonder whether Huawei's mobile phones already have the ability to break graphic verification codes?

Why is it so difficult to search for images using natural language on mobile phones?

In the past, on many mobile phones, you could find specific photos by entering keywords such as time, people, and places. This is achieved by AI algorithms recognizing image types and text, as well as inherent file tags such as geographic information, but generally speaking, they can only recognize limited categories.

When using the "conventional" method, you need to search through short tags and their combinations, such as "scenery", "cat", "food", etc. The number of short tags supported by the mobile phone is limited, which can only meet a small part of your intentions. Most of the time, you still need to manually search in the album, which is very inefficient.

picture

The labels are all set for you, but the selection is limited.

This is not surprising, because at most the backend is a simple image recognition model, the degree of freedom of search is close to zero, and of course it cannot understand human intentions. When using such a system, sometimes the result becomes that we have to guess which tags the model can recognize.

To achieve "smart image search", the AI ​​model needs to be able to understand both natural language and image data at the same time. On mobile phones, we also need to use a series of compression algorithms to make the best use of limited computing power and speed up reasoning as much as possible from an engineering perspective.

Specifically, regardless of the difficulty of deploying to mobile phones, if we want to achieve semantic-level understanding of images and texts and let the phone "understand photos" by itself, we need to go through a three-step process.

picture

Unstructured data generated by the physical world, such as images, voices, and texts, will be converted into structured multidimensional vectors in AI algorithms. Vectors identify relationships, and retrieval is the process of calculating the distance between vectors. Generally, the closer the distance, the higher the similarity.

To build intelligent image search technology, we first need to train a multimodal semantic model. Through comparative learning, we can make text and pictures with the same semantics very close, and data with different semantics very far apart, thereby converting multimodal data such as natural language and pictures into vectors in the same semantic space. Secondly, we need to use the multimodal model to encode the images to be retrieved. Finally, when we enter a sentence, the phone will quickly locate the images that match the content through the retrieval system.

In the entire semantic image search process, the primary challenge is how to better match semantically identical images and texts. Multimodal models need to encode users’ personal images, and it is best to deploy the model on the mobile phone side. Deploying to the mobile phone side means that the multimodal model must be compressed and accelerated, which will require a lot of engineering practice.

The industry's first multimodal semantic model for mobile phones

Behind this "Smart Image Search" is Huawei's lightweight multimodal semantic model, which allows mobile phones to efficiently understand natural language and the meaning of photos, and realizes the industry's first lightweight multimodal semantic model that can be deployed on the end. Compared with the traditional tagging method, its experience is much better. We no longer need to guess the label of the picture, and we can directly enter the natural language to retrieve the corresponding picture. It is no exaggeration to say that it makes the local image search on mobile phones jump from difficult to easy to use.

Foundation: Multimodal Models

In the field of artificial intelligence, Transformer is a milestone technology. It has not only given rise to NLP technology breakthroughs such as ChatGPT, but also has very good results in the field of vision. By using Transformer to represent text and images at the same time, and then using weakly supervised comparative learning to bring images and text with the same semantics closer and images and text with different semantics farther apart, we can obtain a good multimodal model.

The key point here is contrastive learning. As shown in the figure below, conventional contrastive learning methods encode images and texts into different vectors and map the vectors into a joint multimodal semantic space. Because data representations of different modalities may differ and cannot be directly compared, data of different modalities are first mapped into the same multimodal space, which is beneficial for subsequent model training.

picture

From arxiv: 2102.12092.

In multimodal contrastive learning, the objective function is to make the positive sample pairs (the blue box part in the above figure, I1T1, I2T2...) have a high similarity, and the negative sample pairs (the white box part in the above figure) have a low similarity. Through this training, natural language can naturally match images, and data from different modalities can be aligned.

In order to improve the alignment effect between modalities, Huawei improves the relevance of positive samples, denoises negative samples at the algorithm level, and uses larger and higher-quality data sources to improve the accuracy of model representation, thereby improving the accuracy and recall rate of semantic search. Compared with the previous method of labeling images based on a limited set of labels and then searching by labels, the method of searching based on semantic representation can greatly improve the flexibility of image retrieval.

Optimization: Extreme compression of the model

Multimodality is a very hot AI research field at present, but apart from Huawei's "Smart Image Search", no one has been able to deploy the entire reasoning process on mobile phones. The engineering difficulty lies in how to compress the multimodal model to run on edge devices such as mobile phones with little loss of effect.

Here, perhaps we need to consider the parameter efficiency of the entire model architecture and optimize the model structure to achieve the best effect with the least amount of computation. For example, EfficientNet in convolutional neural networks and Multi Query Attention in Transformer models all try to optimize the model structure to achieve better parameter efficiency. The same is true for "Smart Image Search". By optimizing the multimodal model architecture, the overall training can achieve better results.

In addition to technical updates, the greater challenge of deploying to mobile terminals lies in engineering. Unlike the training and inference models we usually do on GPUs, mobile terminals lack convenient and efficient operator implementations, and optimization is also difficult. For example, the CPU chips on mobile phones are basically based on the reduced instruction set Arm architecture, so the machine learning compiler needs to consider a lot of instruction-level parallel processing when optimizing to maximize the use of limited computing power.

To optimize and adapt the underlying hardware, Huawei decomposes the model's massive matrix multiplications onto devices such as the mobile phone's CPU and NPU, and builds underlying operators that can run efficiently on the mobile phone through methods such as graph-computing fusion, thereby supporting efficient reasoning of the entire model.

In summary, Huawei's "Smart Image Search" lightweight model deploys a multimodal model to mobile phones for the first time through compensatory designs such as more data, better algorithms, and some model lightweight technologies, thus providing a better image search experience.

Practical: Vector search engine

We hope to quickly find pictures that match our expectations through different clues such as picture information and spatiotemporal dimensions, and pictures and natural language requests are vectors in multimodal semantic space. Therefore, Huawei has developed a lightweight vector search engine for end-side scenarios, which supports vector index construction for massive data and supports one-stop fusion search of spatiotemporal (time, location) semantics, which can easily and efficiently find matching photos through query semantic features.

The above figure is a simple vector search engine. It is assumed that the mobile multimodal model has encoded the pictures in the album into vectors and stored them persistently in the "Vector Database". Whenever a user has a search request, the request will pass through the "Embedding Model", that is, the text encoder part of the multimodal model, to encode the text into a vector, which will search from the vector database to find a batch of the closest images.

In order to achieve better image search results on mobile devices, Huawei has made a series of innovations and optimizations to its self-developed lightweight vector search engine.

When building an index, if the offline regular full-volume build method commonly used on the cloud side is adopted, the power consumption will be significantly increased. Therefore, Huawei adopts the incremental real-time writing method on the mobile phone side. And for the purpose of reliability, the incrementally written data will be persisted in the index.

At the same time, in order to improve the efficiency of index loading and retrieval, the index format is also specially customized. The semantic vector search uses information such as location and time as part of the index, which can quickly implement conditional filtering during retrieval and return the results most relevant to the query statement. As long as the keywords fall into common search conditions such as time and location, the acceleration brought by the innovation of the index format can be more than ten times faster than pure database retrieval.

However, the customized index format also brings some difficulties, that is, the newly written index data is not necessarily saved at the end of the index. For example, if a new photo is taken at the Forbidden City, the index of the photo vector needs to be inserted into the part of the index at the same location, which means that all previous indexes need to be overwritten. Especially when the data is increasing, if I have 100,000 pictures, do I need to rewrite more than 100,000 indexes every time I take a photo?

Here, Huawei has once again found an innovative solution, which uses index segmentation and compression merging for optimization. Index segmentation can significantly reduce the time for a single index insertion. By performing compression and merging regularly, it can recover the memory/disk resources occupied by deleted data, thereby significantly reducing resource overhead.

After a series of optimizations, the ability of smart image search is no longer limited to high-configuration flagship phones. In addition to the current P60 series and Mate X3, more devices will gradually gain this capability with the upgrade of HarmonyOS 3.1 in the future.

Smart Search: Creating a system-level entry point for the HarmonyOS ecosystem

Of course, in the latest HarmonyOS 3.1 version, smart image search is only a small part of the many new capabilities. In terms of search alone, Huawei has brought a lot of black technology.

In addition to smart image search, Huawei Smart Search will continue to cross the boundaries of different apps, end devices, cloud and local to achieve truly global search. Combined with Huawei's long-term practice of "software, hardware, core and cloud integration" capabilities, the end-side pre-installed AI model can achieve millisecond-level response speed, eliminate the delay of cross-terminal linkage, and realize a "multi-device integration" search experience.

Now is the era of mobile Internet dominated by apps. A large part of searches has shifted from web pages to more closed apps. However, Huawei Smart Search can obtain global content from a one-stop entrance, breaking the information silos.

After breaking through all boundaries, Huawei also achieved highly efficient service flow and smarter "intention search" capabilities through AI technology. Let the search engine understand people's intentions and provide the most appropriate intelligent services. The search box on the mobile phone is no longer a simple query tool.

Do you remember the first version of HarmonyOS released by Huawei at the Developer Conference four years ago? At that time, HarmonyOS was defined as a full-scenario distributed system. Now HarmonyOS has developed a rich ecosystem. The next step is to unify it: to achieve a framework through the improvement of system-level capabilities and control the overall situation.

This integration, when applied to search, means a wide range of capabilities that can be called and responded to without any awareness. Huawei internally calls this capability “full search.” Smart search may become the system-level entrance to the HarmonyOS ecosystem, bringing functions and services that go far beyond the definition of the search box.

When needs are no longer limited by devices and forms, and everything is centered on people, this is what the Internet of Everything era should look like. This also makes us full of expectations for the next HarmonyOS product.

<<:  Meta's Twitter competitor is now available on the App Store, and can sync Instagram followers and accounts

>>:  Let’s talk about the rise and fall of Skype

Recommend

Which cooking oil is the healthiest? | 117三行

Eat a variety of oils at the same time, don't ...

Are you still missing an ecosystem to become a “great” Internet company?

Is it reliable for startups to build their own ec...

Cartoon | @Students, this summer safety guide "pats" you

Summer vacation is coming soon! Every summer vaca...

What happened to this iOS?

My phone has gone black and turned black three ti...

Google Can't Solve Android Fragmentation Problem Yet

Despite Google's fragmentation fix being rele...

How to get users to actively download your product?

What does scenario-based app promotion mean? It i...