AI makes mobile phone tasks run automatically! The latest research from Chinese universities simplifies the operation of mobile devices

AI liberates the hands of carbon-based organisms and can even let your phone play by itself !

You heard it right—it’s mobile task automation.

With the rapid development of AI, this has gradually become an emerging hot research field.

Mobile task automation uses AI to accurately capture and analyze human intentions, and then efficiently executes a variety of tasks on mobile devices (mobile phones, tablets, and car terminals) , providing unprecedented convenience and support for users with cognitive limitations, physical conditions, or in special situations.

Help visually impaired users navigate, read or shop online
Assisting the elderly to use mobile phones and bridge the digital divide
Help car owners send text messages or adjust the car environment while driving
Complete common repetitive tasks in daily life for users
…

Mom will no longer be annoyed by having to set up multiple calendar events.

Recently, the team of Professor Cai Zhongmin and Associate Professor Song Yunpeng from the Key Laboratory of Intelligent Networks and Network Security of the Ministry of Education (MOE KLINNS Lab) of Xi'an Jiaotong University (the team's main research directions are intelligent human-computer interaction, hybrid enhanced intelligence, and intelligent power systems, etc.) has innovatively proposed VisionTasker , a vision-based task automation solution for mobile devices based on the team's latest AI research results.

This research not only provides ordinary users with a smarter mobile device experience, but also demonstrates care and empowerment for groups with special needs.

Vision-based automation of tasks on mobile devices

The team proposed VisionTasker , a two-stage framework that combines vision-based UI understanding and LLM task planning to gradually automate mobile tasks.

This solution effectively eliminates the dependence of the presentation UI on the view hierarchy and improves the adaptability to different application interfaces.

It is worth noting that using VisionTasker does not require a lot of data to train a large model.

VisionTasker starts working when the user proposes a task requirement in natural language, and the Agent begins to understand and execute the instructions.

The specific implementation is as follows:

1. User interface understanding

VisionTasker uses visual methods to understand UI to parse and interpret user interfaces.

First, the agent identifies and analyzes the elements and layout on the user interface, such as buttons, text boxes, and text labels.

Then, the recognized visual information is converted into natural language descriptions to explain the interface content.

2. Mission planning and execution

Next, the agent uses the large language model to navigate and make task plans based on the user's instructions and interface description information.

Break user tasks into actionable steps, such as taps or swipes, to automate task completion.

3. Continue to iterate the above process

After each step is completed, the agent will update its dialogue and task planning based on the latest interface and historical actions to ensure that the decision at each step is based on the current context.

This is an iterative process that continues until the task is judged complete or the preset limits are reached.

Not only can users free their hands from the interaction, but they can also monitor the progress of the task through visible prompts and interrupt the task at any time to maintain control of the entire process.

The first step is to identify widgets and text in the interface, detecting elements such as buttons, text boxes and their positions.

For buttons without text labels, the CLIP model is used to infer their possible functions based on their visual design.

Subsequently, the system divides the blocks according to the visual information of the UI layout, splits the interface into multiple blocks with different functions, and generates a natural language description for each block.

This process also includes matching text to widgets, ensuring the functionality of each element is properly understood.

Ultimately, all of this information is converted into natural language descriptions, providing clear and semantically rich interface information for the large language model, enabling it to effectively perform task planning and automated operations.

Experimental Evaluation

In the experimental evaluation section, the project provides a comparative analysis of three UI understandings, namely:

GPT-4V
VH (View Hierarchy)
VisionTasker Method

△Comparative analysis of three UI understanding methods

The comparison shows that VisionTasker has significant advantages over other methods in multiple dimensions.

In addition, it also shows good generalization ability when dealing with cross-language applications.

△ Common UI layouts used in Experiment 1

It shows that VisionTasker's vision-based UI understanding method has obvious advantages in understanding and interpreting UI, especially when faced with diverse and complex user interfaces.

△ Single-step prediction accuracy across four datasets

The article also conducted a single-step prediction experiment to predict the next action or operation to be performed based on the current task status and user interface.

The results show that VisionTasker achieves an average accuracy of 67% across all datasets, which is more than 15% higher than the baseline method.

Real-world tasks: VisionTasker vs humans

During the experiment, the researchers designed 147 real multi-step tasks to test the performance of VisionTasker, covering 42 commonly used applications in China.

At the same time, the team also set up a human comparison test, in which 12 human evaluators manually performed these tasks, and then the results of VisionTasker were compared.

The results show that VisionTasker can achieve completion rates comparable to humans in most tasks and outperform humans in some unfamiliar tasks.

△The results of the actual task automation experiment "Ours-qwen" refers to the use of open source Qwen to implement the VisionTasker framework, and "Ours" means using Wenxin Yiyan as LLM

The team also evaluated the performance of VisionTasker under different conditions, including using different large language models (LLMs) and programmed demonstration (PBD) mechanisms.

VisionTasker achieves human-comparable completion rates on most intuitive tasks, being slightly below humans on familiar tasks but better than humans on unfamiliar tasks.

△VisionTasker shows how to complete tasks step by step

in conclusion

As a mobile task automation framework based on vision and big models, VisionTasker overcomes the current reliance of mobile task automation on the view hierarchy.

Through a series of comparative experiments, it is proved that it surpasses the traditional programming demonstration and view hierarchy methods in user interface performance.

It has demonstrated efficient UI representation capabilities on four different datasets, showing wider applicability; and in 147 real-world tasks on Android phones, it has shown task completion capabilities that surpass humans, especially in the processing of complex tasks.

In addition, by integrating the Programming by Demonstration (PBD) mechanism, VisionTasker has a significant performance improvement in task automation.

At present, this work has been published in the form of a formal paper at UIST (The ACM Symposium on User Interface Software and Technology), the top human-computer interaction conference held in Pittsburgh, USA from October 13 to 16, 2024.

UIST is a top-level academic conference of CCF Class A in the field of human-computer interaction, focusing on human-computer interface software and technology innovation.

Original link: https://dl.acm.org/doi/10.1145/3654777.3676386
Project link: https://github.com/AkimotoAyako/VisionTasker

<<: Apple AI is online, ChatGPT is free to use! The first M4 Mac is born, Cook: This is the best AI all-in-one machine in the world

>>: iOS 18.1 battery life test is here, Cook says the upgrade speed is very fast!

Don’t just look at the down content! The new national standard has been implemented. When buying down jackets this year, pay attention to this →

Blog

Detailed explanation of the characteristics of Guangdiantong and WeChat Moments advertising channels

Have you been sitting for a long time during the Chinese New Year? What are the harms of sitting for a long time? You will be surprised after reading this

Expert of this article: Li Siwei, Secretary Gener...

European, American, Japanese and Korean car companies are returning to Russia, and Chinese cars have been specifically banned from sale. Is there a strong connection between the two?

As the local situation gradually eases, European,...

HTC One with WP8 GDR3 system to be unveiled at the end of the year: supports Wi-Fi data recovery and rotating lock screen

[September 9 news] Screenshots of the WP8 version...

AI makes mobile phone tasks run automatically! The latest research from Chinese universities simplifies the operation of mobile devices

Vision-based automation of tasks on mobile devices

Experimental Evaluation

Real-world tasks: VisionTasker vs humans

△VisionTasker shows how to complete tasks step by step

in conclusion

Don’t just look at the down content! The new national standard has been implemented. When buying down jackets this year, pay attention to this →

Detailed explanation of the characteristics of Guangdiantong and WeChat Moments advertising channels

Inventory | 30 essential refined operation tools for marketing, operations, and products

3 angles to explore the rules of Zhihu hot list

How to create a hit short video in the beauty category on Kuaishou?

13 billion liters of water disappeared? Freshwater lake suddenly turned into saltwater lake? What happened?

Apple's Safari browser is already 13 years old!

This silly big dog has saved more than 4,000 people

Sighing, shaking legs, crossing legs... these 7 little actions are very harmful to your health, change them quickly!

New Media Operation: How to write a 10w+ title?

Recommend

Online event promotion, increased 100,000 users in seven days!

How to apply the thinking logic of full-stack operations?

Can you lose weight even when lying down? Sleep an extra hour a day and lose 24 pounds in three years

Bing launches "Menu Favorites" feature on mobile search client

How to promote products on Douyin? How to promote products through TikTok?

Have you been sitting for a long time during the Chinese New Year? What are the harms of sitting for a long time? You will be surprised after reading this

European, American, Japanese and Korean car companies are returning to Russia, and Chinese cars have been specifically banned from sale. Is there a strong connection between the two?

"Arts and Sciences" in the Television Industry

Top Thai media group visits GAC Aion and praises it as the "most valuable" Chinese brand

Google's social dream is shattered. Do we really need new mobile social software?

Douyin promotion method: 6 types of content that Douyin limits traffic!

Southerners please stay away, northerners please come and guess what fruit this is?

In 2020, move from KOL marketing to influencer marketing!

10 tech products that became popular before they were released in 2015

HTC One with WP8 GDR3 system to be unveiled at the end of the year: supports Wi-Fi data recovery and rotating lock screen