Social networks have come a long way. First, text + emoticons, then pictures + short videos, and now it is the era of audio and video social networking that has risen strongly and become a trend. Audio and video social networking will be the development trend of social networks. After all, audio and video are the most natural way for humans to socialize. In a natural social environment, echo greatly affects the communication experience. Echo cancellation has always been a difficult point in audio and video technology. Game voice is a typical application of audio and video social networking in the gaming field. If you don’t want users who hate wearing headphones to give up your game, it is necessary to implement echo cancellation in real-time game voice. Whether it is a competitive or casual game, real-time game voice has become a standard feature. Competitive games, including MMORPG, MOBA, and FPS, require fast-paced and high-frequency teamwork, and game voice is as indispensable as wartime communication equipment. Casual games, including chess and cards and Werewolf, require slow-paced but zero-distance online social interaction. Players can chat with multiple people in real time in the social ties established through the game. Real-time game voice allows players to play cards as if they were sitting in the same room. In the industry, echo cancellation technology is recognized as a tough nut to crack. It is essentially the engineering of a complex mathematical problem. Audio engineers are often from mathematics or physics majors rather than computer majors, and engineers without relevant experience have no idea where to start. Products that do a good job in echo cancellation technology include Tencent QQ and Microsoft Skype, and open source projects include WebRTC and Speex. Before these open source projects, echo cancellation technology was a unique skill of large companies, and other teams could only rely on their own exploration and accumulation. After these open source projects, WebRTC and Speex provide open source AEC modules, which have become good teaching materials in the industry. How AEC WorksThe principle of echo cancellation has been introduced in many articles. Here I will only briefly introduce the author's practice at work. In simple terms, the sound signal at the far end is first played out through the speaker, then passes through multiple propagation and reflection paths in the room, and is finally collected by the microphone together with the sound at the near end. If echo cancellation is not performed, the far end will play the re-collected far end sound signal, and there will be a certain delay time with the original far end sound. This is the principle of echo generation. It is actually very difficult to eliminate echo. It is a bit like pouring red ink into blue ink, mixing them together, and then asking to separate the red ink from the blue ink. For the acquisition end, whether it is the sound at the near end or the sound played by the speaker, they are all sounds collected from the air without distinction. For the machine, there is no difference between the sound played by the far-end signal and the near-end sound, just like there is no difference between red ink and blue ink for water. The work of echo cancellation is to separate the far-end echo and the near-end sound, which are indistinguishable. This work is actually much more difficult than you think. Fortunately, we are not without any way to find the boundary between the far-end echo and the near-end sound. The far-end sound signal and the echo are related. Some friends may suddenly realize that we can just subtract the far-end sound from the collected sound. However, it is not that simple. The far-end sound signal is not equivalent to the echo. The far-end sound is played from the speaker to the collection end, and it has gone through the echo feed path of the speaker-room-microphone (LRM). When propagating in the LRM echo feed path, the far-end sound is reflected many times on the one hand, and superimposed many times on the other hand, and finally becomes different from the far-end sound signal. We use a function to represent this difference: fe=f(fs) in, fs=far-end signal; fe=far-end echo (far-end echo); If this function can be solved, then a model can be built based on the correlation between the far-end sound signal and the far-end echo. This model is a simulation of the echo feeder LRM and is highly approximate to the echo feeder LRM. When this model is stable, the far-end sound signal fs is input, and the signal fe that is highly close to the far-end echo can be output. The inverted signal is generated by the filter and superimposed with the collected sound signal to eliminate the echo signal. This is the basic principle of echo cancellation AEC. The solution obtained by this function is unlikely to be exactly the same as the far-end echo, but can only be highly approximated. The closer the solution obtained by this function is to the far-end echo, the better the echo cancellation effect. Mute, single talk and double talkAlthough real-time voice calls are in duplex mode, they can be divided into different situations: mute, single talk and double talk. Different echo cancellation strategies should be adopted for different situations. 1) MuteThat is, a situation where no one speaks. Echo cancellation is only required in the voice segment. There will be no echo in the non-voice segment, so echo cancellation is not required, and even voice information does not need to be sent, which can reduce the bit rate and save bandwidth costs. Therefore, it is very important to accurately detect voice activity. The voice detection algorithm is called VAD (Voice Activity Detection). Different manufacturers have different VAD implementation methods. We use the pitch period to implement VAD, which effectively improves the accuracy of VAD judgment and avoids misjudging non-speech segments as speech segments. 2) Single lectureThat is, only the far end is speaking. Since only the far end is speaking, the voice signal collected from the microphone contains only the far end echo, but not the near end voice. Echo cancellation in the single-talk situation is relatively easy to handle, and a more aggressive processing strategy can be adopted. If it is determined that single talk is a high probability event, all speech signals can be directly eliminated, and then comfort noise can be appropriately filled in. Generally speaking, in the case of single talk, using a linear adaptive filter to track the echo feed path can effectively eliminate the echo, and can suppress about 18dB of echo. 3) Double speakingThere are situations where multiple parties are speaking at the same time. Since multiple parties are talking at the same time, the voice signal collected from the microphone contains the far-end echo and the near-end voice, which are mixed together. Echo cancellation in a double-talking situation is very difficult: on the one hand, the near-end voice signal must be protected from being damaged, and on the other hand, the echo must be eliminated as cleanly as possible. There is not only the problem of "separating red ink from blue ink", but also the dilemma of "killing the rat with a knife". Generally speaking, when the far-end echo is about 6dB~8dB higher than the near-end voice, if the far-end echo is to be eliminated, the near-end voice will certainly be damaged to some extent. In addition, if the far-end echo is more than 18dB higher than the near-end voice, for example, the speaker is too close to the microphone, and the far-end echo completely covers the near-end voice, then the echo cancellation effect will definitely be poor. In this case, a more radical strategy can be adopted to eliminate the far-end echo and the near-end voice together, and then fill in the comfort noise appropriately. Therefore, the echo cancellation module must be able to distinguish these three situations so that different algorithms can be used for different situations. VAD can distinguish between non-speech segments and speech segments. How to distinguish between single speech and double speech will be discussed below. AEC ImplementationEcho cancellation mainly includes two steps: linear adaptive filtering and nonlinear processing. Linear adaptive filtering is to solve fe=f(fs), establish a speech model of the far-end echo, and perform the first round of echo cancellation. Nonlinear processing is divided into two steps: residual echo processing and nonlinear shearing processing. Residual echo processing performs a second round of echo cancellation to process residual echo; nonlinear shearing processing is a relatively aggressive shearing process for speech signals whose attenuation reaches a threshold. Linear adaptive filtering and nonlinear clipping can be learned from academic papers and open source projects. Residual echo processing is difficult, and generally requires the team to explore, accumulate and innovate on their own. It is precisely because of this that the threshold for voice technology is so high. (Click on the image to enlarge) The Principle and Implementation of Echo Cancellation Linear Adaptive FilteringBased on the correlation between the far-end sound signal and the far-end echo, a speech model of the far-end echo is established, and the far-end echo is estimated using it, with the goal of obtaining an estimate of the far-end echo as close as possible. We can regard the echo feed path LRM as an "environmental filter". After its processing, the far-end sound signal is transformed into a far-end echo. Echo cancellation is to build an "algorithm filter" and continuously adjust the coefficients of the filter based on the voice model of the far-end echo to make the estimated value closer to the real echo. The closer the estimated value is to the real echo, the better the echo cancellation effect. After the adaptive filter converges, the echo feed function fe=f(fs) that needs to be solved is obtained. When the filter converges and stabilizes, the far-end sound signal fs is input, and a relatively accurate estimate of the far-end echo signal fe can be output. Subtract the estimated value fe of the far-end echo signal from the collected signal to obtain the actual voice signal to be sent. There are two difficulties in implementing a linear adaptive filter:
These two difficulties are a pair of contradictory characteristics, requiring the adaptive filter to be able to maintain a high degree of coefficient stability after rapid convergence on the one hand, and to be able to keep the status updated at any time to track the changes in the echo feed path on the other hand. Nonlinear processingResidual echo processingUsing adaptive filters to eliminate echoes cannot completely eliminate echoes 100%, and residual echoes need to be further eliminated. Generally speaking, the strategy of residual echo elimination is to use the correlation between the residual echo processed by the adaptive filter and the far-end reference speech signal to further eliminate the residual echo. The greater the correlation, the more residual echo there is, and the greater the degree of further elimination of the residual echo is needed; conversely, the smaller the correlation, the less residual echo there is, and the less the degree of further elimination of the residual echo is needed. Therefore, firstly, a correlation matrix between the residual echo and the reference signal is calculated to obtain an attenuation factor reflecting the degree of elimination; then the residual echo is multiplied by the attenuation factor to further eliminate the residual echo. After the linear adaptive filtering is completed, the correlation between the residual echo and the far-end echo signal collected by the microphone can be used to detect whether it is in a single-talk or double-talk state. According to the single-talk or double-talk state, the attenuation factor can be further adjusted. If the far-end is in a single-talk state, since there is no sound signal at the near end (no one is talking), the echo can be suppressed as much as possible to make the attenuation factor as small as possible; if the two-talk state is in a dual-talk state, since the linear adaptive filter eliminates the echo without damaging the near-end voice quality as much as possible, the echo suppression amount will not be too large, so the attenuation factor is relatively large. The algorithm for eliminating residual echo is very difficult. There is little reference in papers or open source projects. Each manufacturer implements it through a private algorithm, and many manufacturers even choose not to implement it. Non-linear shearingAfter completing the above processing, the remaining echoes are generally smaller, but it is possible that there are still some residual perceptible small echoes. In order to further eliminate these small echoes, further suppression processing should be performed based on the attenuation obtained in the previous processing. Here you need to set a threshold for the attenuation. Generally speaking, this attenuation threshold should be set conservatively (higher). If the attenuation reaches or exceeds the set threshold, it means that the echo cancellation is relatively large, and the collected voice signals are likely to be all echo signals. In this case, all voice signals are directly eliminated and filled with comfort noise to prevent the sound from fluctuating. Such a large attenuation is generally achieved when the far-end is in a single-talk state, or in a dual-talk state where the far-end echo signal is much larger than the near-end voice signal. In normal dual-talk mode, in order to protect the sound quality of near-end speech, the adaptive filter will not perform significant echo cancellation. Therefore, as long as the attenuation reaches or exceeds the set threshold, eliminating all collected speech signals will not affect the normal listening effect. If the attenuation does not exceed the set threshold, then no further echo cancellation is performed. This situation may be a double-talk state, and the sound quality of the local voice must be protected to prevent the local voice from being mistakenly killed as an echo. There are generally two approaches in the industry: one is to allow some damage to the near-end sound and to completely eliminate the far-end echo, and the other is to allow some far-end echo to remain without damaging the near-end sound. If the echo is eliminated too much, it will cause a discontinuous listening experience. Echo cancellation is to find a balance between these two approaches. The author's work experience shows that in the audio and video social industry, echo cancellation is a technical feature that customers pay close attention to. At the same time, echo cancellation is also the most difficult technology in audio and video social networking, without a doubt. Even top games like Honor of Kings attach great importance to the effect of echo cancellation. In the gaming industry, where user experience is the lifeline, especially in today's increasingly important mobile games, the quality of echo cancellation technology often determines whether users continue to play your game. authorXian Niu, a technical expert at Zego Technology, holds a master's degree in computer science from Beijing University of Posts and Telecommunications and a master's degree in business administration from the University of Hong Kong. He has been engaged in the research of voice and video cloud service technology for many years, focusing on interactive live broadcast technology and real-time game voice. 【Specially recommended】 What kind of changes has live video streaming undergone along the way? How will online claw machines, live quiz shows, mini-program live broadcasts, AI and other trends drive the development of live broadcast technology? On March 17, ZEGO Meetup hosted by ZEGO Technology will join hands with 4 technical experts in the live broadcast industry to discuss the technology and future of live video broadcasting. Here are: The Decade of Live Video 《Lianmai interactive live broadcast x WeChat applet》 The technical practice of XiuSe Live 《AI Empowers Live Broadcasting: Anti-Fraud Technology Protects Live Broadcasting Platforms》 Full of useful information, just waiting for you! Please click on the event registration: http://www.huodongxing.com/event/5429947400400 |
<<: A cheap iPhone X at half the price won't save Apple
>>: Don't argue with Apple's technical consultants
Kong Wei: Reading "Pride and Prejudice"...
There is no doubt that Facebook has a lot to offe...
When doing website optimization, you are actually...
Here's what happened. From yesterday afternoo...
In the blink of an eye, it’s already the Little N...
Do you usually drink milk tea or other beverages?...
Some time ago, multiple Samsung Note7 battery exp...
【WeChat circle assessment period adjustment】 V2.0...
Is space travel for all people really going to be...
In social situations, people often say "it...
According to statistics, the number of WeChat min...
As the world's largest automobile consumer, c...
Recently, following the Prime Minister's visi...
In the 1970s, American imaging giant Kodak invente...
IDC recently released "China's Hospital ...