Author : Wu Xinshuang, Family Operations Center Labs GuideIn App development, we often encounter the need to convert wave-like animated speech recognition into text. So how do we actually achieve this function? This article will give a detailed introduction to the Speech framework solution from the technical framework and visual implementation levels. 1Speech framework and usage processAt present, the speech recognition function in the App is mainly divided into two situations: local recognition and online recognition. Online recognition relies on the platform's ability to process speech data. It has high recognition accuracy and obvious advantages, but its disadvantages are that the stability and efficiency of recognition are slightly lower. The local recognition solution has high recognition stability and efficiency, but the recognition accuracy is not as good as the online recognition method. The Speech framework to be introduced in this article is a mature framework for local speech recognition, which is suitable for scenarios with low recognition accuracy but high recognition efficiency. In order to facilitate function maintenance and calling, it is recommended to encapsulate the recognition capability of the framework into a modular manager. A manager for the recognition capability of the Speech framework can be defined as follows: @interface HYSpeechVoiceDetectManager : NSObject From the above code, we can see that the encapsulation function contains the entire speech recognition process: 1. Determine the use authority of the voice and Speech framework (isHasVoiceRecodingAuthority and isHasSpeechRecognizerAuthority) 2. Initialize the speech recognition parameters and related classes (setupTimer) 3. Turn on speech recognition and receive the recognized text information in real time (setupTimer and startTransfer) 4. Force the destruction of Speech framework data and reset the audio configuration data (endTransfer). The following will describe the above four steps. 1.1 Determine the permission to use the voice and Speech frameworkBecause the first condition for successful recognition is to have obtained the permission to use the voice and Speech framework, it is necessary to obtain the permission when initializing the framework function. The code for obtaining the permission to use the Speech framework is as follows: -(void)isHasSpeechRecognizerAuthority{ After obtaining SFSpeechRecognizerAuthorizationStatus in the above code, you can get the permission information of the Sepeech framework. For the operation of obtaining the permission to use the voice, the relevant code is as follows: -(void)isHasVoiceRecodingAuthority:(authorityReturnBlock)hasAuthorityBlock{ The voice permission information can be obtained through videoAuthStatus in the above code. In actual use, the voice permission should be obtained first and then the Sepeech framework permission information should be obtained on this basis. Only when the user has both permissions can the next "voice recognition parameters and related class initialization" link be entered. 1.2 Speech recognition parameters and related class initializationBefore initializing the Sepeech framework, an audio stream input channel needs to be established. AVAudioEngine is an indispensable node for this input channel. Through AVAudioEngine, audio signals can be generated and processed, and audio signal input operations can be performed. AVAudioInputNode is the splicing channel of the audio stream. This channel can splice segments of speech in real time to complete the dynamic audio stream data. AVAudioEngine and AVAudioInputNode are used together to complete the preparation work for obtaining audio stream data. The initialization work can only take effect after AVAudioEngine executes the method - (void)prepare. The relevant code is as follows: AVAudioEngine *bufferEngine = [[AVAudioEngine alloc]init]; As shown in the above code, the complete real-time audio data stream information of SFSpeechAudioBufferRecognitionRequest can be obtained through the bufferInputNode callback interface. The Sepeech framework requires a key class - recorder (AVAudioRecorder) when used. This class is used to set important information such as the voice collection format, audio channel, bit rate, data cache path, and complete voice collection and other functions. Only by calling the - (BOOL)record method of AVAudioRecorder can the voice recognition function start normally. Refer to the following code: [self endTransfer]; As shown in the above code, the first line of code in the process of initializing speech recognition should first execute endTransfer. The main function of this interface is to initialize the audio parameters to the default state: forcefully destroy the audio stream input channel, clear the voice recorder, and clear the audio recognition task. Initializing audio parameters is a relatively important step. Its purpose is to prevent speech recognition anomalies caused by other functional modules in the project modifying information such as audio input parameters. The relevant details will be described below. 1.3 Turn on voice recognition and receive recognized text messages in real timeThe recognition conversion function -(void)startTransfer will always output the text information converted by speech recognition. This capability mainly depends on the SFSpeechRecognizer class. We can obtain the text information after recognition conversion from the real-time parameter SFSpeechRecognitionResult _Nullable result returned by the callback interface. It should be noted that the text information output is valid only when the callback is error-free. The reference code is as follows: SFSpeechAudioBufferRecognitionRequest *bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init]; As shown in the above code, bufferRec is the speech recognizer in the Sepeech framework. This class can be used to convert audio data into localized text information, where voiceTextCache is the information after the real-time conversion of the speech. 1.4 Forced destruction of Speech framework data and reset of audio configuration dataWhen the recognition is finished, you need to call the -(void)endTransfer method to force the closing of the audio stream channel, delete the voice buffer file, stop the voice listener, etc. It is also necessary to reset the audio mode and parameters to default. The relevant code is as follows: -(void)endTransfer{ 2. Principle of implementing the wave effect of speech recognition2.1 Speech Recognition Animation EffectiOS speech recognition requirements often require real-time display of the speech recognition animation process. A common animation effect is to display different amplitudes of translation wave effects in real time according to the volume of speech recognition. The final effect is shown in the following figure: As shown in the final effect picture above, there are a total of 32 wave dots, of which the first 6 and the last 6 are fixed and static dots, and only the middle 20 dots are long column charts that float up and down like waves with the volume size. 2.2 Principles of speech recognition animation implementationThere are two ways to implement the dynamic effects in the figure above: 1. The traditional CoreAnimation framework of the iOS system, and 2. Dynamically updating the dot frame. If the traditional CoreAnimation framework is used, a floating effect animation must be made for each dot, which is not worth the cost in terms of implementation process and system overhead. However, if a simple method of dynamically updating the dot frame is used, it will be more effective. This article will also describe the method of dynamically updating the dot frame. The animation effect in the above figure is easily reminiscent of the sine function graph in mathematics, so the horizontal direction of the wave graph can be defined as the sine X-axis (in fact, the coordinates of iOS are also based on this concept), and the vertical direction is the sine Y-axis (the height is the mapping value of the volume to the coordinate). First, for the 32 floating dots, the local initialization corresponds to 32 X-axis mapping coordinates x, and the coordinate gap is set. When the real-time voice volume volume data is transmitted, the mapped amplitude y can be calculated through the sine function y=F*sin(pi*x-pi*t), where the voice volume limits the maximum volume data, which is strongly related to the sine function amplitude data F. The relevant implementation principle diagram is as follows: So how do you get the real-time voice volume data? In fact, the listener (AVAudioRecorder) in the previous article provides an encapsulation interface for voice volume data - (float)peakPowerForChannel:(NSUInteger)channelNumber, where channelNumber is the channel number of the audio input/output. The data returned by this function is a decibel value ranging from -160 to 0. The larger the sound peak, the closer it is to 0. After obtaining the real-time voice volume data, the corresponding quantization processing can be performed to obtain the mapping value of the amplitude y. The relevant code can be referred to as follows: AVAudioRecorder *monitor 2.3 Implementation of dynamic update of dots at the frame code levelFirst, generate 32 dot views and add them to the current layer. The relevant code is as follows: self.sinXViews = [NSMutableArray new]; The sinXViews in the code is an array that caches 32 dot views. This array is set to facilitate resetting the frame of the dots. OffsetX is the specific coordinate of the dots in the layer, which needs to be calculated in detail according to requirements. After adding 32 dots to the corresponding layer, the core problem that needs to be solved is how to obtain the dot amplitude, because the final effect required is not only the dot sine wave effect, but also the right translation of the entire wave. Therefore, the input data of the sine function y=F*sin(pi*x-pi*t) should include the dot coordinate x and the formatted timestamp data t. The specific implementation code is as follows: -(double)fSinValueWithOriginX:(CGFloat)originX timeS:(NSTimeInterval)timeStamp volume:(CGFloat)voluem{ The sinF in the code is the y-axis amplitude of the dots mapped visually, sinX is calculated based on the input parameter originX and the x-axis coordinate interval of the dots, and timeStamp is the real-time volume acquisition timestamp formatted data. After the above steps, the only step left is to update the dot frame. This step is relatively simple. The code implementation is as follows: for (NSUInteger i = 0; i < self.sinXViews.count; i++) { From the above code, we can see that after traversing self.sinXViews to obtain the current dot view, we first take out the frame data, then calculate the amplitude viewHeight of the dot at the current moment, and finally use viewHeight to update the frame of the dot at the current moment. At this time, the sinusoidal vibration dynamic effect diagram of 20 dots at a certain moment is completed. The above is my experience in implementing the technology in my project. I would like to make a simple summary. If there are any errors, please feel free to point them out. Thank you for reading! References[1] Speech framework information: https://developer.apple.com/documentation/speech [2] iOS audio recording function information: https://developer.apple.com/documentation/avfaudio |
<<: iOS 16.1 beta: The battery progress bar is back! No longer showing full charge
>>: Using OpenGL to generate transition effects in mobile applications
In 2020, the popularity of Douyin live streaming ...
People often ask, which promotion channel is more...
Background knowledge: CAC = cost to acquire a sin...
Now many businesses know that they need to use Do...
On February 29, an internal notice from Baicheng ...
In the Internet age, short videos are king. Short...
Course Contents: 1-Lesson Program-10 minutes to s...
Many people may not know that the meaning of the ...
2019 is already more than halfway through. Lookin...
Starting a business requires costs, and mini prog...
We all know that the mobile phone operating syste...
Five years ago, a trend emerged - traditional ent...
In the past, human resources, especially human re...
Many outstanding people in the Internet industry ...
On the evening of April 28, less than a month aft...