Let you know the development history of speech recognition technology

Let you know the development history of speech recognition technology

Author: Yang Jun, Unit: China Mobile Xiong'an Industrial Research Institute

Labs Guide

I believe everyone is familiar with speech recognition. In recent years, the application of speech recognition technology has emerged in an endless stream and has become more intelligent. From the beginning when we simply asked "Who are you?" to now being able to have multiple rounds of conversations with us and understand our meaning and even our mood, speech recognition has achieved great development. Most people may think that speech recognition is a technology that has only appeared in recent years, but it is not. Let's take a look at the history of speech technology.

Part 01 The 70-year history of speech recognition

In 1952, Bell Labs invented the automatic digital recognition machine, and scientists had a vague concept of intelligent speech. Perhaps at that time scientists were already imagining everything we have achieved today.

In 1964, IBM launched a digital speech recognition system at the World Expo. Speech technology has since stepped out of the laboratory and become known to more people. The dream of Bell Labs has also become the dream of more people.

In 1980, Sound Dragon launched its first voice recognition product, Dragon Dictate, which was the first voice recognition product for consumers. Although the dream came true for the first time, its high price of $9,000 greatly increased the difficulty of popularizing intelligent voice technology.

In 1997, IBM launched its first speech recognition product, Via Voice. In the Chinese market, IBM adapted local dialects such as Sichuan, Shanghai, and Guangdong, and Via Voice was truly exposed to and used by more consumers.

In 2011, Apple first added the intelligent voice assistant Siri to the iPhone 4S. Since then, intelligent voice has been deeply integrated with mobile phones and has entered the daily lives of consumers. Subsequently, major domestic mobile phone manufacturers have followed suit and provided mobile phone consumers with a variety of voice recognition functions.

Since then, the application of voice recognition technology has not been limited to mobile phones, but has expanded to various scenarios. From various smart homes, such as smart robots, smart TVs, smart humidifiers, etc., to smart cars now, major traditional manufacturers and new car-making forces have actively deployed smart cockpits. It can be seen that intelligent voice technology has been widely used in all aspects of our food, clothing, housing and transportation.

Part 02 Introduction to Speech Recognition Technology

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the vocabulary content in human speech into computer-readable input. Speech recognition technology is an important branch of artificial intelligence, involving many disciplines such as signal processing, computer science, linguistics, acoustics, physiology, psychology, etc. It is a key link in human-computer natural interaction technology.

Part 03 Basic Process of Speech Recognition

ASR: Automatic Speech Recognition, a technology that converts human speech into text.

NLU: Natural Language Understanding (NLU) is a general term for all methods, models or tasks that support machines to understand text content.

NLG: Natural Language Generation (NLG) is an automated process of generating language texts through computers under specific interactive goals. Its main purpose is to automatically construct high-quality language texts that can be understood by humans.

The figure above shows the basic process of speech recognition. After the user issues a command, the mic collects audio and completes the conversion of sound into waveform. By comparing the waveform with the waveform of human pronunciation, the specific syllables spoken can be identified. Syllables are combined into words and sentences, and then the most matching words are analyzed in combination with big data. Then the NLU module starts working to analyze the intent, domain and other information of the sentence. After analyzing the intent, the dialogue manager DM (Dialog Manager) starts to query the background data for what feedback should be given to the user. Then it is handed over to the NLG module to generate natural language based on the information found. Finally, the TTS module converts the text back into a waveform and plays the sound.

The above process involves many subjects and knowledge. Due to space constraints, I will not describe them one by one. Here I will select ASR for a relatively detailed study.

Part 04 A brief analysis of the implementation principle of ASR

Let's first look at the ASR sound source. When a user issues a command, such as "I love you", the microphone will collect the audio to the storage device. When we open it with audio processing software (such as Audacity), we can find that the audio is a waveform.

However, this waveform does not have any intuitive and meaningful information. Its height only represents the volume of the sound, and the horizontal axis is only time. Speech recognition itself is an analysis technology based on big data. The basis of analysis is the accuracy of data. The volume of the sound and the length of time of pronunciation are difficult to have any statistical significance, so we need to process the audio at this time. (This waveform is a waveform of four sentences of "I love you").

A commonly used processing method is Fourier transform. Through Fourier transform, we can convert the waveform graph in the time dimension into a waveform graph in the frequency dimension.

Why do we need to process it into frequency dimension?

Because we all know that the sounds that humans make and can hear are roughly within a frequency band. This involves knowledge of biology and acoustics. Our human body structure is roughly the same. Let's take it for granted here that despite individual differences and gender differences, the frequencies of the sounds we make will not differ greatly. In this way, we process the sound waveform graph that has no statistical significance into a frequency graph.

However, we cannot lose the time dimension. After we segment the sound (this involves sound preprocessing, framing, etc., which will not be expanded for now), we can compare it with the local acoustic model to see what phonemes are emitted in each frame. In Chinese, a phoneme refers to a letter that we pronounce. For example, "I" is composed of two phonemes: w and o.

Now we know how to process sounds from audio files into phonemes. Then, through linguistics, statistics and other technologies, combined with specific contexts, phonemes are combined into words, and words are combined into sentences, so as to recognize the user's spoken sentences. The general process of ASR is completed.

The above method is actually a relatively simple part of various speech recognition technologies. In practical applications, it may also include various technologies, such as the MFCC method for acoustic feature extraction, noise reduction, framing, windowing, endpoint detection and other technologies for the above sound preprocessing.

Part 05 Outlook of speech recognition and related technologies and what we can do

With the improvement of hardware technology and the popularization of 5G technology, we can process massive amounts of data at the back end and provide users with more reliable and smooth services by relying on the stability and low latency of 5G technology. It is foreseeable that in the near future, speech recognition and related technologies will become smarter and more stable. As a domestic telecom operator with an absolute advantage in the number of user bases, China Mobile can rely on its 5G advantages and scale advantages to provide better services to users, provide strong guarantees for smart cities, and make more contributions to national development.

<<:  WeChat is working hard to update three major system update content highlights

>>:  Practical application of user experience optimization on special effects side - Package size

Recommend

Operational list for private domain traffic diversion on TikTok!

The article is short and only contains essential ...

How do K12 online education companies build their own distribution systems?

The current situation of online education compani...

In order to come to you, Konjac waited for 4 years

Oden, a street snack originated in Japan, was int...

The most comprehensive! Understand how KOLs promote products in one article!

There are hot topics in the marketing industry ev...

Qualcomm: Android phone users will be able to use face recognition next year

According to 9To5Mac on August 15 Beijing time, t...

After a person is buried, where do the atoms that make up his body go?

"It rains heavily during the Qingming Festiv...

It clearly looks like a big "rat", so why is Capybara so popular?

Hidden in the treasure house of biodiversity on E...

Extinct duck species

Geese and ducks are almost one of the most famili...