Abstract: There is no doubt that AI products will gradually penetrate into people's work, life, and entertainment, bringing revolutionary changes to all walks of life. In the future, the boundaries between products, between products and the environment, and between products and users will be very blurred. People will seamlessly jump and closely connect in multiple devices, forming a whole of "you have me, I have you". In the era of artificial intelligence, "native hardware", "AI engine" and "smart app" are the three elements that constitute a complete intelligent experience and service closed loop. introductionIt has been more than 60 years since Artificial Intelligence was formally proposed at Dartmouth in 1956. However, it was not until AlphaGo defeated Lee Sedol and Ke Jie lost to AlphaGo three times that "artificial intelligence" became a hot word and entered the public eye. In fact, in the past one or two years, major technology giants have already made in-depth layouts in the field of artificial intelligence. From virtual assistants Siri and Microsoft Xiaobing to smart speakers and smart driving of various giants, artificial intelligence products are gradually integrating into our lives. In this era of artificial intelligence, which is seen as subverting everything, what are the pain points of products? How will the interaction change? What kind of interaction design can allow users to get an ultimate experience when using artificial intelligence products? Through the experience of some artificial intelligence products on the market and the analysis of the implementation process of the "AI Tour Guide" project (the smart tour guide customized by NetDragon for the first Digital China Summit, which can provide guests with indoor wayfinding, conference information query, encyclopedia knowledge answers, photo taking and other smart services), some pain points are found: Pain points of current AI product experience1. It relies heavily on native hardwareIntelligent interaction can be understood as a process of perception->computation->execution feedback. Unlike the input (mouse or touch) in the interaction of the graphical user interface (GUI), perception is the biggest feature of artificial intelligence interaction. Limited by factors such as permissions, processes, and device capabilities, it is difficult for both apps and AI engines to seamlessly access the underlying sensors and computing units at any time. Without hardware-level sensors to perceive people and the surrounding environment as information input, the experience cannot be optimized. 2. Lack of initiative and spontaneityAt present, smart home hardware is the most widely used field of artificial intelligence, such as the smart speakers launched by major manufacturers. When starting a conversation with a robot, the user needs to click the button on the robot, and each time a command is issued, it must be awakened once, and then a one-to-one single-threaded conversation is carried out. It is not difficult to find that this is an "unnatural voice interaction", and its essence is just a change of manual control method. For several existing smart speakers (Xiaomi Xiaoai, Tmall Genie, Himalaya Xiaoya, Baidu Xiaodu, Dingdong 2nd Generation), we have sorted out the feedback from Tmall and JD consumers on voice interaction. It can be clearly seen that users are dissatisfied with the need to wake up frequently: In the early stages of the AI Tour Guide project, there were also technical and experience confusions:
After re-examining the scenario, the guide machine canceled the voice awakening solution and instead obtained the image of the person, determined whether the user entered the near-field interaction triggering zone based on the depth distance, and determined whether the user had the intention to interact based on face recognition (based on time and filtering people passing by from the side), and then actively asked the user: Dear guest, how can I help you? Understanding users and proactively serving are the advantages of AI products, and also a gap that designers need to overcome. Upgrading from a passive command-accepting model to an proactive service-oriented intelligent product model, and from a user-led model to an proactive service model, is more in line with the "natural interaction" of future AI. 3. Accuracy and efficiency of information acquisitionVoice User Interface (VUI) is the interaction between people and computers through natural language, and is also the mainstream interaction method of current artificial intelligence products. From the perspective of human senses, the amount of information received by vision is much higher than that of hearing. In terms of the form of content information, the graphical user interface (GUI) is mainly pictures and texts, relying on vision, while the voice user interface (VUI) is mainly voice and text, relying on hearing. The brain can only receive up to 100 Mbps of information through the eyes and 1 Mbps through the cochlea.[1] If images are used as information carriers, visual reading can convey five times more information than auditory reading. Another special feature of the eyes is that they can scan three different places in one second. [2] On the other hand, due to the lack of context awareness, that is, human cognition, artificial intelligence is still unable to understand the context well and give accurate predictions of the next step based on who the user is, the user's emotions, the current environment, and previous memories. Pure voice interaction is flawed in terms of user experience, and the efficiency and accuracy of information acquisition need to be further improved. The core of AI product interactionFrom the PC Internet era to the mobile Internet era, product interaction is still mainly based on the graphical user interface (GUI). However, in the era of artificial intelligence, the relationship between people and products (smart apps, wearable devices, smart hardware) has become closer and deeper. Human-computer interaction will expand from a simple single-threaded mode between people and screens to multi-threaded modes such as voice interaction, gesture interaction, and augmented reality interaction, entering an era of "natural interaction". Natural user interface is an emerging paradigm shift in human-computer interaction interface. By studying the real-world environment and situations, and using emerging technical capabilities and perception solutions to achieve more accurate and optimized interaction between physical and digital objects, the natural user interface can be invisible or the learning process of interaction can be invisible. Its core focus is on traditional human abilities (such as touch, vision, speech, handwriting, and action) and more important and higher-level processes (such as cognition, creativity, and exploration) [3]. Based on the pain points of the current artificial intelligence experience and the core of future human-computer interaction, the three elements of artificial intelligence interaction are proposed: "native hardware", "AI engine", and "smart app". The three elements are integrated and linked together, making the experience more natural. The Triadic Theory of AI Interaction1. Native HardwareThere are two requirements for image capture in the "AI Tour Guide" project PRD document: Recognize faces and take photos with virtual characters, determine the user's gender, and do some additional processing in decoration;
Therefore, the developer configured the RGB Camera Depth/IR Camera with the same configuration as Kinect2 in the guide machine to form an RGB field of view (FOV) that meets the requirements in a large space: Chips, sensors, computing units, and execution units can handle perception, processing, and feedback in intelligent interaction very well. At present, various sensing devices can accurately detect various environmental information such as distance, light, volume, face, movement, temperature, humidity, etc. The information collected by the sensors forms an information space, which is a virtual space connecting people and physical space. The "New Generation Artificial Intelligence Development Plan" issued by the State Council [4] also emphasizes the construction and use of this space. By automatically recording user usage data, automatically analyzing user usage habits, and automatically giving users the best recommendations, all of this relies on native hardware. Just as hardware with high viscosity and close to life scenarios has become the best entry point for giant companies to deploy smart products, such as mobile phones, watches, car-mounted devices, speakers, headphones, TVs, refrigerators, etc. Of course, future hardware also needs an upgrade. Relying solely on a graphical interface or voice as input and output will reduce the accuracy and efficiency of information acquisition. Hardware needs to support multi-dimensional information input or display such as hearing, vision, touch, and image. The graphical user interface combined with voice, or even mixed reality, holographic projection, etc., can make artificial intelligence interaction more three-dimensional and instinctive, and all this is inseparable from the native hardware's more efficient execution, processing terminal chips, and more dimensional sensors. 2. AI EngineHere, AI engine specifically refers to the application of core algorithms of artificial intelligence (deep learning algorithms, memory prediction model algorithms, etc.) in various fields: speech recognition, image recognition, natural language processing and user profiling. Speech recognition: converting natural human sounds into responsive text or commands and converting text into speech and reading it out as required. Image recognition: Computer vision, as we often call it, is commonly used in the fields of printed text recognition, face recognition, facial features positioning, face comparison and verification, face retrieval, identity card optical character recognition (OCR), business card OCR recognition, etc. Natural language processing: Since understanding natural language requires extensive knowledge about the outside world and the ability to manipulate this knowledge, natural language cognition is also considered an AI-complete problem. Natural language processing (NLP) is one of the most difficult problems in artificial intelligence. User portrait: User portrait is a labeled user model abstracted from information/data such as user social attributes, living habits, and consumption behavior. This is also the crystallization of content and big data. The AI engine provides core computing technology for artificial intelligence products and is an indispensable "unit". Speech recognition and natural language processing are used in the intelligent dialogue of the "AI Tour Guide": Speech recognition technology has matured, and many third-party platforms have provided SDKs. Natural language understanding is an AI-hard problem[5] and is also the core difficulty of current intelligent conversational interaction. Machines face the following five challenges in understanding natural language.
Thanks to deep learning algorithms, the technologies in the above problem areas have developed rapidly. I believe that after greater breakthroughs in cognitive computing (communication, decision-making, and discovery), AI engines will help humans in more areas. 3. Smart AppSmart APP represents the human-machine interface. People are the ultimate perceivers of interaction, so the medium through which users can obtain smart experience and services is crucial in the interaction. The traditional APP interface is limited to the mobile device screen, and the emerging smart speakers directly remove the graphical interactive interface. Both have limitations. During the implementation of the "AI Tour Guide", in order to allow users to experience the characteristics of the Silk Road, multiple application services (smart APPs) were placed in the tour guide machine, allowing users to feel the charm of the summit through sight, hearing and touch.
APPs in the intelligent era must be able to input data in multiple dimensions, such as recognizing voice, gestures, images, and the physical environment, etc. They must also be able to display information in multiple dimensions, including hearing, vision, touch, holographic images, etc., making the interactive form more emotional and "human-like." In the future, artificial intelligence will definitely bring breakthroughs to human-computer interaction. Traditional human-computer interaction technologies (mouse, keyboard, touch screen, etc.) make it difficult for people to interact with computers as efficiently and naturally as between people. With the improvement of native hardware capabilities and the development of artificial intelligence technologies such as voice recognition, image analysis, gesture recognition, semantic understanding, and big data analysis, artificial intelligence products will better perceive human intentions and drive the development of human-computer interaction. The combined use of the three elements of artificial intelligence, "native hardware", "AI engine" and "smart app", will also have certain guiding significance in the development of future artificial intelligence product interactions. Maybe in the future there will be a scene like this:
References[1] Answer from an excellent answerer on Zhihu's topic of neuroscience and brain science to the question "Which receives information faster, the ears or the eyes?" [2] From the book The Future of Artificial Intelligence [3] Glonek G, Pietruszka M. Natural user interfaces (NUI): review. J Appl Comput Sci, 2012, 20: 27–45 [4] Notice of the State Council on Issuing the Next Generation Artificial Intelligence Development Plan http://www.gov.cn/zhengce/content/2017-07/20/content_5211996.htm [5] https://en.wikipedia.org/wiki/Natural_language_understanding [6] Baidu Artificial Intelligence Interaction Design Institute http://aiid.baidu.com/ |
>>: iOS 14.7 official version released, iOS 14.7 update notes
Honda released its terminal car sales in China in...
Douyin will allocate traffic based on the perform...
In this Internet age, more and more people want t...
What should we do if our creative ideas become ho...
I was surprised to hear that ApplePay has officia...
The color TV market in the first half of 2016 (th...
"Sharing economy" is the most popular w...
English Speaking: Breaking through American Oral ...
The laughing gull was recorded in mainland China ...
last night Beijing Winter Paralympics opens grand...
“A good title is half the work.” If you can choos...
The Central Meteorological Observatory issued a b...
[[154705]] The rise of front-end engineers A long...
The world is really on the brink of danger. When ...
User portrait is a relatively new term. Initially...