How to improve the interactive experience of artificial intelligence? Let’s first understand this ternary theory

How to improve the interactive experience of artificial intelligence? Let’s first understand this ternary theory

Abstract: There is no doubt that AI products will gradually penetrate into people's work, life, and entertainment, bringing revolutionary changes to all walks of life. In the future, the boundaries between products, between products and the environment, and between products and users will be very blurred. People will seamlessly jump and closely connect in multiple devices, forming a whole of "you have me, I have you". In the era of artificial intelligence, "native hardware", "AI engine" and "smart app" are the three elements that constitute a complete intelligent experience and service closed loop.

Figure 1: The ternary theory of artificial intelligence

introduction

It has been more than 60 years since Artificial Intelligence was formally proposed at Dartmouth in 1956. However, it was not until AlphaGo defeated Lee Sedol and Ke Jie lost to AlphaGo three times that "artificial intelligence" became a hot word and entered the public eye. In fact, in the past one or two years, major technology giants have already made in-depth layouts in the field of artificial intelligence. From virtual assistants Siri and Microsoft Xiaobing to smart speakers and smart driving of various giants, artificial intelligence products are gradually integrating into our lives. In this era of artificial intelligence, which is seen as subverting everything, what are the pain points of products? How will the interaction change? What kind of interaction design can allow users to get an ultimate experience when using artificial intelligence products?

Through the experience of some artificial intelligence products on the market and the analysis of the implementation process of the "AI Tour Guide" project (the smart tour guide customized by NetDragon for the first Digital China Summit, which can provide guests with indoor wayfinding, conference information query, encyclopedia knowledge answers, photo taking and other smart services), some pain points are found:

Pain points of current AI product experience

1. It relies heavily on native hardware

Intelligent interaction can be understood as a process of perception->computation->execution feedback. Unlike the input (mouse or touch) in the interaction of the graphical user interface (GUI), perception is the biggest feature of artificial intelligence interaction. Limited by factors such as permissions, processes, and device capabilities, it is difficult for both apps and AI engines to seamlessly access the underlying sensors and computing units at any time. Without hardware-level sensors to perceive people and the surrounding environment as information input, the experience cannot be optimized.

2. Lack of initiative and spontaneity

At present, smart home hardware is the most widely used field of artificial intelligence, such as the smart speakers launched by major manufacturers. When starting a conversation with a robot, the user needs to click the button on the robot, and each time a command is issued, it must be awakened once, and then a one-to-one single-threaded conversation is carried out. It is not difficult to find that this is an "unnatural voice interaction", and its essence is just a change of manual control method. For several existing smart speakers (Xiaomi Xiaoai, Tmall Genie, Himalaya Xiaoya, Baidu Xiaodu, Dingdong 2nd Generation), we have sorted out the feedback from Tmall and JD consumers on voice interaction. It can be clearly seen that users are dissatisfied with the need to wake up frequently:

Figure 2 Pain points of smart speakers

In the early stages of the AI ​​Tour Guide project, there were also technical and experience confusions:

  • Technology: Due to the noisy venue, the success rate of voice-activated tour guides to interact will be greatly reduced;
  • Experience: Why do we need to wait until users ask for help before giving feedback? As a venue service provider, can we proactively discover and understand every user who needs help?

After re-examining the scenario, the guide machine canceled the voice awakening solution and instead obtained the image of the person, determined whether the user entered the near-field interaction triggering zone based on the depth distance, and determined whether the user had the intention to interact based on face recognition (based on time and filtering people passing by from the side), and then actively asked the user: Dear guest, how can I help you?

Understanding users and proactively serving are the advantages of AI products, and also a gap that designers need to overcome. Upgrading from a passive command-accepting model to an proactive service-oriented intelligent product model, and from a user-led model to an proactive service model, is more in line with the "natural interaction" of future AI.

3. Accuracy and efficiency of information acquisition

Voice User Interface (VUI) is the interaction between people and computers through natural language, and is also the mainstream interaction method of current artificial intelligence products.

From the perspective of human senses, the amount of information received by vision is much higher than that of hearing. In terms of the form of content information, the graphical user interface (GUI) is mainly pictures and texts, relying on vision, while the voice user interface (VUI) is mainly voice and text, relying on hearing.

The brain can only receive up to 100 Mbps of information through the eyes and 1 Mbps through the cochlea.[1]

If images are used as information carriers, visual reading can convey five times more information than auditory reading. Another special feature of the eyes is that they can scan three different places in one second. [2]

On the other hand, due to the lack of context awareness, that is, human cognition, artificial intelligence is still unable to understand the context well and give accurate predictions of the next step based on who the user is, the user's emotions, the current environment, and previous memories.

Pure voice interaction is flawed in terms of user experience, and the efficiency and accuracy of information acquisition need to be further improved.

The core of AI product interaction

From the PC Internet era to the mobile Internet era, product interaction is still mainly based on the graphical user interface (GUI). However, in the era of artificial intelligence, the relationship between people and products (smart apps, wearable devices, smart hardware) has become closer and deeper. Human-computer interaction will expand from a simple single-threaded mode between people and screens to multi-threaded modes such as voice interaction, gesture interaction, and augmented reality interaction, entering an era of "natural interaction". Natural user interface is an emerging paradigm shift in human-computer interaction interface. By studying the real-world environment and situations, and using emerging technical capabilities and perception solutions to achieve more accurate and optimized interaction between physical and digital objects, the natural user interface can be invisible or the learning process of interaction can be invisible. Its core focus is on traditional human abilities (such as touch, vision, speech, handwriting, and action) and more important and higher-level processes (such as cognition, creativity, and exploration) [3]. Based on the pain points of the current artificial intelligence experience and the core of future human-computer interaction, the three elements of artificial intelligence interaction are proposed: "native hardware", "AI engine", and "smart app". The three elements are integrated and linked together, making the experience more natural.

The Triadic Theory of AI Interaction

1. Native Hardware

There are two requirements for image capture in the "AI Tour Guide" project PRD document:

Recognize faces and take photos with virtual characters, determine the user's gender, and do some additional processing in decoration;

  • Capture user actions and interact with the virtual tour guide;
  • Based on these two requirements, it is found that the conventional front camera of the tour guide machine cannot meet the requirements:
  • The range of imaging available is limited;
  • Unable to obtain the depth value of the depth camera;
  • Unable to capture user actions;

Therefore, the developer configured the RGB Camera Depth/IR Camera with the same configuration as Kinect2 in the guide machine to form an RGB field of view (FOV) that meets the requirements in a large space:

Figure 3 Camera FOV perspective view

Chips, sensors, computing units, and execution units can handle perception, processing, and feedback in intelligent interaction very well. At present, various sensing devices can accurately detect various environmental information such as distance, light, volume, face, movement, temperature, humidity, etc. The information collected by the sensors forms an information space, which is a virtual space connecting people and physical space. The "New Generation Artificial Intelligence Development Plan" issued by the State Council [4] also emphasizes the construction and use of this space.

By automatically recording user usage data, automatically analyzing user usage habits, and automatically giving users the best recommendations, all of this relies on native hardware. Just as hardware with high viscosity and close to life scenarios has become the best entry point for giant companies to deploy smart products, such as mobile phones, watches, car-mounted devices, speakers, headphones, TVs, refrigerators, etc.

Of course, future hardware also needs an upgrade. Relying solely on a graphical interface or voice as input and output will reduce the accuracy and efficiency of information acquisition. Hardware needs to support multi-dimensional information input or display such as hearing, vision, touch, and image. The graphical user interface combined with voice, or even mixed reality, holographic projection, etc., can make artificial intelligence interaction more three-dimensional and instinctive, and all this is inseparable from the native hardware's more efficient execution, processing terminal chips, and more dimensional sensors.

2. AI Engine

Here, AI engine specifically refers to the application of core algorithms of artificial intelligence (deep learning algorithms, memory prediction model algorithms, etc.) in various fields: speech recognition, image recognition, natural language processing and user profiling.

Speech recognition: converting natural human sounds into responsive text or commands and converting text into speech and reading it out as required.

Image recognition: Computer vision, as we often call it, is commonly used in the fields of printed text recognition, face recognition, facial features positioning, face comparison and verification, face retrieval, identity card optical character recognition (OCR), business card OCR recognition, etc.

Natural language processing: Since understanding natural language requires extensive knowledge about the outside world and the ability to manipulate this knowledge, natural language cognition is also considered an AI-complete problem. Natural language processing (NLP) is one of the most difficult problems in artificial intelligence.

User portrait: User portrait is a labeled user model abstracted from information/data such as user social attributes, living habits, and consumption behavior. This is also the crystallization of content and big data.

The AI ​​engine provides core computing technology for artificial intelligence products and is an indispensable "unit". Speech recognition and natural language processing are used in the intelligent dialogue of the "AI Tour Guide":

Figure 4 Voice dialogue framework

Speech recognition technology has matured, and many third-party platforms have provided SDKs. Natural language understanding is an AI-hard problem[5] and is also the core difficulty of current intelligent conversational interaction. Machines face the following five challenges in understanding natural language.

  • Language Diversity
  • The polysemy of language
  • Errors in language expression
  • Knowledge dependence of language
  • The context of language

Thanks to deep learning algorithms, the technologies in the above problem areas have developed rapidly. I believe that after greater breakthroughs in cognitive computing (communication, decision-making, and discovery), AI engines will help humans in more areas.

3. Smart App

Smart APP represents the human-machine interface. People are the ultimate perceivers of interaction, so the medium through which users can obtain smart experience and services is crucial in the interaction. The traditional APP interface is limited to the mobile device screen, and the emerging smart speakers directly remove the graphical interactive interface. Both have limitations.

During the implementation of the "AI Tour Guide", in order to allow users to experience the characteristics of the Silk Road, multiple application services (smart APPs) were placed in the tour guide machine, allowing users to feel the charm of the summit through sight, hearing and touch.

[[412349]]

Figure 5: Tour guide AI virtual photo

APPs in the intelligent era must be able to input data in multiple dimensions, such as recognizing voice, gestures, images, and the physical environment, etc. They must also be able to display information in multiple dimensions, including hearing, vision, touch, holographic images, etc., making the interactive form more emotional and "human-like."

In the future, artificial intelligence will definitely bring breakthroughs to human-computer interaction. Traditional human-computer interaction technologies (mouse, keyboard, touch screen, etc.) make it difficult for people to interact with computers as efficiently and naturally as between people. With the improvement of native hardware capabilities and the development of artificial intelligence technologies such as voice recognition, image analysis, gesture recognition, semantic understanding, and big data analysis, artificial intelligence products will better perceive human intentions and drive the development of human-computer interaction. The combined use of the three elements of artificial intelligence, "native hardware", "AI engine" and "smart app", will also have certain guiding significance in the development of future artificial intelligence product interactions.

Figure 6: The ternary theoretical framework of artificial intelligence

Maybe in the future there will be a scene like this:

  • On Christmas Eve, you drive home. When you get to the basement, the in-car device asks you: It's a bit cold. Would you like a cup of coffee when you get home? You tell it the flavor you want, then park the car and go upstairs. When you open the door and enter the house, the smart speaker automatically plays "Jingle Bells" and tells you that the coffee will be ready in 2 minutes.

References

[1] Answer from an excellent answerer on Zhihu's topic of neuroscience and brain science to the question "Which receives information faster, the ears or the eyes?"

[2] From the book The Future of Artificial Intelligence

[3] Glonek G, Pietruszka M. Natural user interfaces (NUI): review. J Appl Comput Sci, 2012, 20: 27–45

[4] Notice of the State Council on Issuing the Next Generation Artificial Intelligence Development Plan http://www.gov.cn/zhengce/content/2017-07/20/content_5211996.htm

[5] https://en.wikipedia.org/wiki/Natural_language_understanding

[6] Baidu Artificial Intelligence Interaction Design Institute http://aiid.baidu.com/

<<:  Pay attention to these 10 interactive details to improve the registration and login process experience

>>:  iOS 14.7 official version released, iOS 14.7 update notes

Recommend

What is website positioning? How to position the website?

In this Internet age, more and more people want t...

Advertising creativity is highly homogenized. How can we break the deadlock?

What should we do if our creative ideas become ho...

The color TV industry is still the same and there is no story to tell

The color TV market in the first half of 2016 (th...

English Speaking: Breaking through American Oral Thinking

English Speaking: Breaking through American Oral ...

Look! It was first recorded in mainland China

The laughing gull was recorded in mainland China ...

New media operation title guide: 5 methods and 3 practical skills

“A good title is half the work.” If you can choos...

Front-end - the most artistic programmer

[[154705]] The rise of front-end engineers A long...

Operation personnel should understand user portraits

User portrait is a relatively new term. Initially...