AI makes mobile phone tasks run automatically! The latest research from Chinese universities simplifies the operation of mobile devices

AI makes mobile phone tasks run automatically! The latest research from Chinese universities simplifies the operation of mobile devices

AI liberates the hands of carbon-based organisms and can even let your phone play by itself !

You heard it right—it’s mobile task automation.

With the rapid development of AI, this has gradually become an emerging hot research field.

Mobile task automation uses AI to accurately capture and analyze human intentions, and then efficiently executes a variety of tasks on mobile devices (mobile phones, tablets, and car terminals) , providing unprecedented convenience and support for users with cognitive limitations, physical conditions, or in special situations.

  • Help visually impaired users navigate, read or shop online
  • Assisting the elderly to use mobile phones and bridge the digital divide
  • Help car owners send text messages or adjust the car environment while driving
  • Complete common repetitive tasks in daily life for users

Mom will no longer be annoyed by having to set up multiple calendar events.

Recently, the team of Professor Cai Zhongmin and Associate Professor Song Yunpeng from the Key Laboratory of Intelligent Networks and Network Security of the Ministry of Education (MOE KLINNS Lab) of Xi'an Jiaotong University (the team's main research directions are intelligent human-computer interaction, hybrid enhanced intelligence, and intelligent power systems, etc.) has innovatively proposed VisionTasker , a vision-based task automation solution for mobile devices based on the team's latest AI research results.

This research not only provides ordinary users with a smarter mobile device experience, but also demonstrates care and empowerment for groups with special needs.

Vision-based automation of tasks on mobile devices

The team proposed VisionTasker , a two-stage framework that combines vision-based UI understanding and LLM task planning to gradually automate mobile tasks.

This solution effectively eliminates the dependence of the presentation UI on the view hierarchy and improves the adaptability to different application interfaces.

It is worth noting that using VisionTasker does not require a lot of data to train a large model.

VisionTasker starts working when the user proposes a task requirement in natural language, and the Agent begins to understand and execute the instructions.

The specific implementation is as follows:

1. User interface understanding

VisionTasker uses visual methods to understand UI to parse and interpret user interfaces.

First, the agent identifies and analyzes the elements and layout on the user interface, such as buttons, text boxes, and text labels.

Then, the recognized visual information is converted into natural language descriptions to explain the interface content.

2. Mission planning and execution

Next, the agent uses the large language model to navigate and make task plans based on the user's instructions and interface description information.

Break user tasks into actionable steps, such as taps or swipes, to automate task completion.

3. Continue to iterate the above process

After each step is completed, the agent will update its dialogue and task planning based on the latest interface and historical actions to ensure that the decision at each step is based on the current context.

This is an iterative process that continues until the task is judged complete or the preset limits are reached.

Not only can users free their hands from the interaction, but they can also monitor the progress of the task through visible prompts and interrupt the task at any time to maintain control of the entire process.

The first step is to identify widgets and text in the interface, detecting elements such as buttons, text boxes and their positions.

For buttons without text labels, the CLIP model is used to infer their possible functions based on their visual design.

Subsequently, the system divides the blocks according to the visual information of the UI layout, splits the interface into multiple blocks with different functions, and generates a natural language description for each block.

This process also includes matching text to widgets, ensuring the functionality of each element is properly understood.

Ultimately, all of this information is converted into natural language descriptions, providing clear and semantically rich interface information for the large language model, enabling it to effectively perform task planning and automated operations.

Experimental Evaluation

In the experimental evaluation section, the project provides a comparative analysis of three UI understandings, namely:

  • GPT-4V
  • VH (View Hierarchy)
  • VisionTasker Method

△Comparative analysis of three UI understanding methods

The comparison shows that VisionTasker has significant advantages over other methods in multiple dimensions.

In addition, it also shows good generalization ability when dealing with cross-language applications.

△ Common UI layouts used in Experiment 1

It shows that VisionTasker's vision-based UI understanding method has obvious advantages in understanding and interpreting UI, especially when faced with diverse and complex user interfaces.

△ Single-step prediction accuracy across four datasets

The article also conducted a single-step prediction experiment to predict the next action or operation to be performed based on the current task status and user interface.

The results show that VisionTasker achieves an average accuracy of 67% across all datasets, which is more than 15% higher than the baseline method.

Real-world tasks: VisionTasker vs humans

During the experiment, the researchers designed 147 real multi-step tasks to test the performance of VisionTasker, covering 42 commonly used applications in China.

At the same time, the team also set up a human comparison test, in which 12 human evaluators manually performed these tasks, and then the results of VisionTasker were compared.

The results show that VisionTasker can achieve completion rates comparable to humans in most tasks and outperform humans in some unfamiliar tasks.

△The results of the actual task automation experiment "Ours-qwen" refers to the use of open source Qwen to implement the VisionTasker framework, and "Ours" means using Wenxin Yiyan as LLM

The team also evaluated the performance of VisionTasker under different conditions, including using different large language models (LLMs) and programmed demonstration (PBD) mechanisms.

VisionTasker achieves human-comparable completion rates on most intuitive tasks, being slightly below humans on familiar tasks but better than humans on unfamiliar tasks.

△VisionTasker shows how to complete tasks step by step

in conclusion

As a mobile task automation framework based on vision and big models, VisionTasker overcomes the current reliance of mobile task automation on the view hierarchy.

Through a series of comparative experiments, it is proved that it surpasses the traditional programming demonstration and view hierarchy methods in user interface performance.

It has demonstrated efficient UI representation capabilities on four different datasets, showing wider applicability; and in 147 real-world tasks on Android phones, it has shown task completion capabilities that surpass humans, especially in the processing of complex tasks.

In addition, by integrating the Programming by Demonstration (PBD) mechanism, VisionTasker has a significant performance improvement in task automation.

At present, this work has been published in the form of a formal paper at UIST (The ACM Symposium on User Interface Software and Technology), the top human-computer interaction conference held in Pittsburgh, USA from October 13 to 16, 2024.

UIST is a top-level academic conference of CCF Class A in the field of human-computer interaction, focusing on human-computer interface software and technology innovation.

Original link: https://dl.acm.org/doi/10.1145/3654777.3676386
Project link: https://github.com/AkimotoAyako/VisionTasker

<<:  Apple AI is online, ChatGPT is free to use! The first M4 Mac is born, Cook: This is the best AI all-in-one machine in the world

>>:  iOS 18.1 battery life test is here, Cook says the upgrade speed is very fast!

Recommend

They love candied haws more than nectar. Are these bees stupid?

Author: Shi Jun Reviewer: Yao Jun (Associate Rese...

Glasses also have a “shelf life”, so be sure not to let them expire!

Do glasses have a shelf life? On the surface, gla...

A complete guide to online promotion methods!

1. External promotion As the name suggests, exter...

Is it good or bad for foreign objects to invade the human body?

Fish bones, melon seeds, peanuts, batteries, toys...

How to spend less money and do the most effective advertising?

How to spend less money? When we talked about bra...

World Resources Institute: Accelerating Climate Resilient Infrastructure

The report released by the Intergovernmental Pane...

WOT Zhang Xingye: Practice of Weex technology in Meizu small applications

【51CTO.com original article】Seven years of hard w...