How does the game audio and video SDK solve the problem of echo cancellation?

Social networks have come a long way. First, text + emoticons, then pictures + short videos, and now it is the era of audio and video social networking that has risen strongly and become a trend. Audio and video social networking will be the development trend of social networks. After all, audio and video are the most natural way for humans to socialize. In a natural social environment, echo greatly affects the communication experience. Echo cancellation has always been a difficult point in audio and video technology.

Game voice is a typical application of audio and video social networking in the gaming field. If you don’t want users who hate wearing headphones to give up your game, it is necessary to implement echo cancellation in real-time game voice.

Whether it is a competitive or casual game, real-time game voice has become a standard feature. Competitive games, including MMORPG, MOBA, and FPS, require fast-paced and high-frequency teamwork, and game voice is as indispensable as wartime communication equipment. Casual games, including chess and cards and Werewolf, require slow-paced but zero-distance online social interaction. Players can chat with multiple people in real time in the social ties established through the game. Real-time game voice allows players to play cards as if they were sitting in the same room.

In the industry, echo cancellation technology is recognized as a tough nut to crack. It is essentially the engineering of a complex mathematical problem. Audio engineers are often from mathematics or physics majors rather than computer majors, and engineers without relevant experience have no idea where to start. Products that do a good job in echo cancellation technology include Tencent QQ and Microsoft Skype, and open source projects include WebRTC and Speex. Before these open source projects, echo cancellation technology was a unique skill of large companies, and other teams could only rely on their own exploration and accumulation. After these open source projects, WebRTC and Speex provide open source AEC modules, which have become good teaching materials in the industry.

How AEC Works

The principle of echo cancellation has been introduced in many articles. Here I will only briefly introduce the author's practice at work. In simple terms, the sound signal at the far end is first played out through the speaker, then passes through multiple propagation and reflection paths in the room, and is finally collected by the microphone together with the sound at the near end. If echo cancellation is not performed, the far end will play the re-collected far end sound signal, and there will be a certain delay time with the original far end sound. This is the principle of echo generation.

It is actually very difficult to eliminate echo. It is a bit like pouring red ink into blue ink, mixing them together, and then asking to separate the red ink from the blue ink. For the acquisition end, whether it is the sound at the near end or the sound played by the speaker, they are all sounds collected from the air without distinction. For the machine, there is no difference between the sound played by the far-end signal and the near-end sound, just like there is no difference between red ink and blue ink for water. The work of echo cancellation is to separate the far-end echo and the near-end sound, which are indistinguishable. This work is actually much more difficult than you think.

Fortunately, we are not without any way to find the boundary between the far-end echo and the near-end sound.

The far-end sound signal and the echo are related. Some friends may suddenly realize that we can just subtract the far-end sound from the collected sound. However, it is not that simple.

The far-end sound signal is not equivalent to the echo. The far-end sound is played from the speaker to the collection end, and it has gone through the echo feed path of the speaker-room-microphone (LRM). When propagating in the LRM echo feed path, the far-end sound is reflected many times on the one hand, and superimposed many times on the other hand, and finally becomes different from the far-end sound signal. We use a function to represent this difference:

fe=f(fs)

in,

fs=far-end signal;

fe=far-end echo (far-end echo);

If this function can be solved, then a model can be built based on the correlation between the far-end sound signal and the far-end echo. This model is a simulation of the echo feeder LRM and is highly approximate to the echo feeder LRM.

When this model is stable, the far-end sound signal fs is input, and the signal fe that is highly close to the far-end echo can be output. The inverted signal is generated by the filter and superimposed with the collected sound signal to eliminate the echo signal. This is the basic principle of echo cancellation AEC.

The solution obtained by this function is unlikely to be exactly the same as the far-end echo, but can only be highly approximated. The closer the solution obtained by this function is to the far-end echo, the better the echo cancellation effect.

Mute, single talk and double talk

Although real-time voice calls are in duplex mode, they can be divided into different situations: mute, single talk and double talk. Different echo cancellation strategies should be adopted for different situations.

1) Mute

That is, a situation where no one speaks.

Echo cancellation is only required in the voice segment. There will be no echo in the non-voice segment, so echo cancellation is not required, and even voice information does not need to be sent, which can reduce the bit rate and save bandwidth costs.

Therefore, it is very important to accurately detect voice activity. The voice detection algorithm is called VAD (Voice Activity Detection). Different manufacturers have different VAD implementation methods. We use the pitch period to implement VAD, which effectively improves the accuracy of VAD judgment and avoids misjudging non-speech segments as speech segments.

2) Single lecture

That is, only the far end is speaking.

Since only the far end is speaking, the voice signal collected from the microphone contains only the far end echo, but not the near end voice. Echo cancellation in the single-talk situation is relatively easy to handle, and a more aggressive processing strategy can be adopted.

If it is determined that single talk is a high probability event, all speech signals can be directly eliminated, and then comfort noise can be appropriately filled in. Generally speaking, in the case of single talk, using a linear adaptive filter to track the echo feed path can effectively eliminate the echo, and can suppress about 18dB of echo.

3) Double speaking

There are situations where multiple parties are speaking at the same time.

Since multiple parties are talking at the same time, the voice signal collected from the microphone contains the far-end echo and the near-end voice, which are mixed together. Echo cancellation in a double-talking situation is very difficult: on the one hand, the near-end voice signal must be protected from being damaged, and on the other hand, the echo must be eliminated as cleanly as possible.

There is not only the problem of "separating red ink from blue ink", but also the dilemma of "killing the rat with a knife". Generally speaking, when the far-end echo is about 6dB~8dB higher than the near-end voice, if the far-end echo is to be eliminated, the near-end voice will certainly be damaged to some extent.

In addition, if the far-end echo is more than 18dB higher than the near-end voice, for example, the speaker is too close to the microphone, and the far-end echo completely covers the near-end voice, then the echo cancellation effect will definitely be poor. In this case, a more radical strategy can be adopted to eliminate the far-end echo and the near-end voice together, and then fill in the comfort noise appropriately.

Therefore, the echo cancellation module must be able to distinguish these three situations so that different algorithms can be used for different situations. VAD can distinguish between non-speech segments and speech segments. How to distinguish between single speech and double speech will be discussed below.

AEC Implementation

Echo cancellation mainly includes two steps: linear adaptive filtering and nonlinear processing.

Linear adaptive filtering is to solve fe=f(fs), establish a speech model of the far-end echo, and perform the first round of echo cancellation.

Nonlinear processing is divided into two steps: residual echo processing and nonlinear shearing processing. Residual echo processing performs a second round of echo cancellation to process residual echo; nonlinear shearing processing is a relatively aggressive shearing process for speech signals whose attenuation reaches a threshold.

Linear adaptive filtering and nonlinear clipping can be learned from academic papers and open source projects. Residual echo processing is difficult, and generally requires the team to explore, accumulate and innovate on their own. It is precisely because of this that the threshold for voice technology is so high.

(Click on the image to enlarge)

The Principle and Implementation of Echo Cancellation

Linear Adaptive Filtering

Based on the correlation between the far-end sound signal and the far-end echo, a speech model of the far-end echo is established, and the far-end echo is estimated using it, with the goal of obtaining an estimate of the far-end echo as close as possible. We can regard the echo feed path LRM as an "environmental filter".

After its processing, the far-end sound signal is transformed into a far-end echo. Echo cancellation is to build an "algorithm filter" and continuously adjust the coefficients of the filter based on the voice model of the far-end echo to make the estimated value closer to the real echo. The closer the estimated value is to the real echo, the better the echo cancellation effect.

After the adaptive filter converges, the echo feed function fe=f(fs) that needs to be solved is obtained. When the filter converges and stabilizes, the far-end sound signal fs is input, and a relatively accurate estimate of the far-end echo signal fe can be output. Subtract the estimated value fe of the far-end echo signal from the collected signal to obtain the actual voice signal to be sent.

There are two difficulties in implementing a linear adaptive filter:

Fast convergence
In the convergence phase, the collected sound signal requires only the far-end echo signal, and cannot be mixed with the near-end voice signal. The near-end voice signal and the far-end reference voice signal have no correlation, which will disturb the convergence process of the adaptive filter.
Therefore, our strategy is to make the adaptive filter convergence time as short as possible, so short that the only signal collected during the convergence process is the far-end echo signal, so that the adaptive filter convergence effect will be very good. After convergence, the filter is stable and can be used to filter the far-end echo signal.
Dynamic Adaptation
After converging and stabilizing, the adaptive filter must automatically adapt to changes in the echo feed path at any time. The adaptive filter must be able to determine whether the echo feed path has changed, and be able to relearn and model it, continuously adjust the coefficients of the filter, enter a new convergence process, and finally quickly approach the new echo feed path.
This situation is very common in mobile gaming scenarios. Users play games while walking on their phones. The echo feed path around the game voice is constantly changing, and the adaptive filter must automatically reconverge to adapt to the new echo feed path.

These two difficulties are a pair of contradictory characteristics, requiring the adaptive filter to be able to maintain a high degree of coefficient stability after rapid convergence on the one hand, and to be able to keep the status updated at any time to track the changes in the echo feed path on the other hand.

Nonlinear processing

Residual echo processing

Using adaptive filters to eliminate echoes cannot completely eliminate echoes 100%, and residual echoes need to be further eliminated.

Generally speaking, the strategy of residual echo elimination is to use the correlation between the residual echo processed by the adaptive filter and the far-end reference speech signal to further eliminate the residual echo. The greater the correlation, the more residual echo there is, and the greater the degree of further elimination of the residual echo is needed; conversely, the smaller the correlation, the less residual echo there is, and the less the degree of further elimination of the residual echo is needed.

Therefore, firstly, a correlation matrix between the residual echo and the reference signal is calculated to obtain an attenuation factor reflecting the degree of elimination; then the residual echo is multiplied by the attenuation factor to further eliminate the residual echo.

After the linear adaptive filtering is completed, the correlation between the residual echo and the far-end echo signal collected by the microphone can be used to detect whether it is in a single-talk or double-talk state. According to the single-talk or double-talk state, the attenuation factor can be further adjusted.

If the far-end is in a single-talk state, since there is no sound signal at the near end (no one is talking), the echo can be suppressed as much as possible to make the attenuation factor as small as possible; if the two-talk state is in a dual-talk state, since the linear adaptive filter eliminates the echo without damaging the near-end voice quality as much as possible, the echo suppression amount will not be too large, so the attenuation factor is relatively large.

The algorithm for eliminating residual echo is very difficult. There is little reference in papers or open source projects. Each manufacturer implements it through a private algorithm, and many manufacturers even choose not to implement it.

Non-linear shearing

After completing the above processing, the remaining echoes are generally smaller, but it is possible that there are still some residual perceptible small echoes. In order to further eliminate these small echoes, further suppression processing should be performed based on the attenuation obtained in the previous processing.

Here you need to set a threshold for the attenuation. Generally speaking, this attenuation threshold should be set conservatively (higher).

If the attenuation reaches or exceeds the set threshold, it means that the echo cancellation is relatively large, and the collected voice signals are likely to be all echo signals. In this case, all voice signals are directly eliminated and filled with comfort noise to prevent the sound from fluctuating. Such a large attenuation is generally achieved when the far-end is in a single-talk state, or in a dual-talk state where the far-end echo signal is much larger than the near-end voice signal.

In normal dual-talk mode, in order to protect the sound quality of near-end speech, the adaptive filter will not perform significant echo cancellation. Therefore, as long as the attenuation reaches or exceeds the set threshold, eliminating all collected speech signals will not affect the normal listening effect.

If the attenuation does not exceed the set threshold, then no further echo cancellation is performed. This situation may be a double-talk state, and the sound quality of the local voice must be protected to prevent the local voice from being mistakenly killed as an echo.

There are generally two approaches in the industry: one is to allow some damage to the near-end sound and to completely eliminate the far-end echo, and the other is to allow some far-end echo to remain without damaging the near-end sound. If the echo is eliminated too much, it will cause a discontinuous listening experience. Echo cancellation is to find a balance between these two approaches.

The author's work experience shows that in the audio and video social industry, echo cancellation is a technical feature that customers pay close attention to. At the same time, echo cancellation is also the most difficult technology in audio and video social networking, without a doubt. Even top games like Honor of Kings attach great importance to the effect of echo cancellation. In the gaming industry, where user experience is the lifeline, especially in today's increasingly important mobile games, the quality of echo cancellation technology often determines whether users continue to play your game.

author

Xian Niu, a technical expert at Zego Technology, holds a master's degree in computer science from Beijing University of Posts and Telecommunications and a master's degree in business administration from the University of Hong Kong. He has been engaged in the research of voice and video cloud service technology for many years, focusing on interactive live broadcast technology and real-time game voice.

【Specially recommended】

What kind of changes has live video streaming undergone along the way?

How will online claw machines, live quiz shows, mini-program live broadcasts, AI and other trends drive the development of live broadcast technology?

On March 17, ZEGO Meetup hosted by ZEGO Technology will join hands with 4 technical experts in the live broadcast industry to discuss the technology and future of live video broadcasting. Here are:

The Decade of Live Video

《Lianmai interactive live broadcast x WeChat applet》

The technical practice of XiuSe Live

《AI Empowers Live Broadcasting: Anti-Fraud Technology Protects Live Broadcasting Platforms》

Full of useful information, just waiting for you!

Please click on the event registration: http://www.huodongxing.com/event/5429947400400

<<: A cheap iPhone X at half the price won't save Apple

>>: Don't argue with Apple's technical consultants

Why does iPhone have the best quality among all mobile phones?

Blog

"Canned yellow peaches" are sold out! Can they relieve COVID-19 symptoms? Are these 4 methods of recuperation and disease prevention reliable?

Blog

What kind of flowers did Chang Yu want?

Blog

Let’s take a look at BAT’s layout in the AR field. How would you rate it?

Blog

How much does it cost to develop a Shannan Chemical mini program? What is the quote for Shannan Chemical Mini Program development?

Blog

These 5 driving habits really waste fuel! Many experienced drivers don’t know...

Amid internal and external troubles, is Samsung considering splitting up and rebuilding itself or cutting off its arms in the short term?

Blog

What are the specific circumstances under which the US allowed the Pentagon to postpone the implementation of the Huawei ban?

What are the specific circumstances under which t...

Query traffic information

Source code introduction Select the operator acco...

Douyin's hot content to increase followers course: 50 million big accounts reveal the secrets of increasing followers for the first time [Taught by the partner of Poison Tongue Movie]

Douyin's hot content to increase followers co...

How does the game audio and video SDK solve the problem of echo cancellation?

How AEC Works

Mute, single talk and double talk

1) Mute

2) Single lecture

3) Double speaking

AEC Implementation

Linear Adaptive Filtering

Nonlinear processing

Residual echo processing

Non-linear shearing

author

Why does iPhone have the best quality among all mobile phones?

"Canned yellow peaches" are sold out! Can they relieve COVID-19 symptoms? Are these 4 methods of recuperation and disease prevention reliable?

What kind of flowers did Chang Yu want?

Let’s take a look at BAT’s layout in the AR field. How would you rate it?

How much does it cost to develop a Shannan Chemical mini program? What is the quote for Shannan Chemical Mini Program development?

These 5 driving habits really waste fuel! Many experienced drivers don’t know...

No KOC, no community

This is the Chaka Salt Lake, which is enough to feed people for 70 years! Satellite view →

iPhone is also affected. Why can’t fingerprint recognition be performed on sweaty hands?

Amid internal and external troubles, is Samsung considering splitting up and rebuilding itself or cutting off its arms in the short term?

Recommend

618 Tencent News & Tencent Video e-commerce promotion advertising!

Insurance lessons everyone needs: Buy the right insurance easily to provide the best protection for your family

Who can be the game changer in the face of the system flaws of smart TVs?

What are the specific circumstances under which the US allowed the Pentagon to postpone the implementation of the Huawei ban?

Query traffic information

Douyin's hot content to increase followers course: 50 million big accounts reveal the secrets of increasing followers for the first time [Taught by the partner of Poison Tongue Movie]

26 satellites in one rocket! Lijian-1 Yao-2 carrier rocket successfully launched

XML 4 parsing examples

When will the 2022 Paralympic Games close? How many countries are participating?

From wild growth to intensive cultivation? Chinese white goods companies collectively focus on the high-end market

Qvod officially returns to the familiar taste but not the original formula

Turn your marketing campaign into a profitable business

Video tutorial on techniques and methods for shooting short videos

Large-scale offline reasoning based on Ray

China Automobile Dealers Association: The new four modernizations index of passenger cars in March 2020 was 53.7