Protecting Audio Privacy During Cell Phone Calls with Sound Masking

background

Malware violates the privacy of mobile users by recording users' voices without authorization. Any application with audio access permissions installed may secretly record any information at any time. At present, privacy permission issues for audio devices such as microphones require major modifications to the privacy control system to solve them. To this end, this paper proposes the SafeChat solution. It does not require changing the operating system settings, but only protects the user's call privacy through sound masking. Specifically, after sound masking, authorized recording applications can recover more secret information than unauthorized recording applications. SafeChat showed a signal strength difference of up to 26db in the experiment, effectively reducing the accuracy of the speech recognition engine.

Traditional solutions

Smartphone users face privacy risks from unauthorized audio recordings, which may lead to the disclosure of personal information. Existing defense methods include providing false audio data or restricting audio recording permissions, but these methods require system modifications. This paper introduces SafeChat, an application-level solution that protects audio privacy through sound masking without modifying existing systems. Sound masking is an obfuscation technique that adds noise to prevent eavesdroppers while ensuring that the intended recipient can understand it. SafeChat achieves application-level protection and solves privacy issues in mobile scenarios by generating special masking sounds and removing them at the receiving end. Unlike ceiling-mounted speaker arrays, SafeChat can distinguish between unauthorized and authorized applications and provide different audio privacy protections. This solution effectively solves the privacy risk of unauthorized audio recording without requiring large-scale system changes.

Solution in this article

Figure 1. Model of the sound masking app SafeChat

The application scenario of this paper is based on the assumption that the two parties in the conversation are Alice and Bob, and the eavesdropper is Eve. Bob initiates an audio call with Alice, then Alice needs to turn on the external speaker (speaker) to play the masking sound/noise from Bob, and then Bob and Alice talk at the same time. Alice's mobile phone microphone receives the mixed sound of Bob's masking sound and Alice's private sound information. This mixed sound can only be restored by Bob, who knows the generation algorithm and confusion algorithm of the masking sound. Therefore, it is necessary to design a suitable masking sound and ensure that it is eliminated at the intended receiver. Due to the multipath and resonance characteristics of audio transmission and the distortion of the device speaker/microphone, the masking sound recorded by Alice's microphone is not equal to the original Bob's masking sound. After mixing with Alice's secret sound information, it cannot be guaranteed to be eliminated at Bob's end. In order to enable Bob to remove the masking sound and restore the real sound information after receiving the mixed sound, a suitable masking sound signal and its removal algorithm are needed. This paper envisions estimating the channel response from the speaker to the microphone through adaptive filtering, and then removing the additional masking interference from the residual noise by continuous interference cancellation (SIC) to recover the secret information. This paper is inspired by the fact that existing solutions require additional user-defined policies, and both authorized and unauthorized applications will receive the same copy of the recording during a call, and whether the call information is confidential cannot be defined in the application.

System Design

Figure 2. System design

The masking sound and the secret message are sent to the microphone via different paths, so special signal processing is required to remove the masking sound from the audio recorded at the intended receiver. Since sound travels through the air and through multiple paths, the recorded sound is actually a combination of multiple delayed and attenuated copies of the original played sound. There are two parts: the mobile app part responsible for mixing + the server responsible for generating the masking sound.

Mobile App

Two paths of sound are sent to the microphone: the masking sound and the secret sound. The channel response H(*) can be used to represent how the sound is combined on the microphone for recording. Therefore, the processed recorded sound is Gaussian ambient noise.

Masker selection

The combination of two signal components, i.e. is the masking interference, which includes several pre-recorded human-spoken sentences to confuse and prevent malware from extracting confidential information. is the masking noise, which is generated as Gaussian noise and filtered out by a 16kHz low-pass filter. The noise frequency is controlled below 16kHz because this frequency range covers most of the human voice frequencies. Adding this Gaussian noise can effectively reduce the signal-to-noise ratio of the secret information. It also helps the receiver Bob to recover the secret and avoid filter-based separation of the masking sound. In this way, only the sound within the user's pitch range can be retained after recovery. Before playing the masking sound, some pilot signals need to be played for synchronization in order to synchronize the time offset between the speaker and the microphone.

Eliminate noise and interference at the target receiver

Figure 3. Signal comparison diagram of eliminating noise and interference at the target receiver

Four parameters are used to represent masking performance: 1) Masker-to-Noise Ratio (MNR): represents the strength of the added masker, the ability to play the masker, and is expressed as, n is the noise component, and represents the background noise. 2) Masker-to-Residual-Noise Ratio (MRR): represents the effectiveness of removing the masker, the ability of the device to remove the masker. The formula is:, represents the residual noise in the recovered signal without the user's voice. 3) Masker-to-Speech Ratio (MSR): the amount of secret information hidden in the malware, the amount of secret information leaked to the malware. The formula is: 4) Speech-to-Recovery Noise Ratio (SRR): the amount of secret information recovered at the intended recipient, the amount of secret information received to the intended recipient. The formula is: MNR and MRR describe the hardware capabilities of the device to mask the sound, i.e., it is independent of, while MSR and SRR capture the performance of SafeChat in preventing unauthorized recording.

Figure 4. Explanation and energy ratio of four sound masking indicators

Experimental Summary

The best performance can be achieved when the four indicators are as high as possible. But a high SRR means a low MSR. SNR is negatively correlated with the volume at which the masking sound is played. Therefore, the masking sound must be large enough (high MNR) to reduce the noise ratio of the secret information. But the volume of the masking sound cannot be infinitely amplified, and when the masking sound is too large, the assumption of linear channel response will be invalid. A common problem with removing signals with SIC: playing the masking interference at too high a volume will increase its residual error with SIC removal. Experiments have found that the system achieves the best performance when the signal removed first (i.e., the masking noise) has a greater signal strength than the signal removed later (i.e., the masking interference). Moreover, the energy ratio of the masking interference is always fixed at 10dB lower than the volume of the masking noise, which may be based on the experimental assumption that the energy ratio of the secret voice is 13DB lower than the masking sound. The larger the volume, the more accurate the channel response estimation. However, when the masking sound is played at the maximum volume, the channel response becomes nonlinear, leaving a large residual error.

Figure 5. Effect of masking noise and interfering sound volume

Equipment calibration

SafeChat needs to find a reasonable volume to play the masking sound, and a suitable volume ratio to play the masking interference. SafeChat always chooses the microphone with the lowest MNR as the reference, because a low MNR means that the microphone receives less masking sound. When the microphone channel is determined to the benchmark, the volume with the highest MRR will be found. In addition to the device's ability to play and remove masking sounds, the volume of the secret message itself is also important. The masking sound should dynamically adjust its volume based on the secret message. Ensure high MSR and high SRR. When the MSR of the training recording is greater than 13dB and the SRR is greater than 3dB, the user is considered to have been successfully trained. This setting ensures that the energy of the person speaking is 13dB lower than the masking sound.

Security Threat Model

Assumptions

In the security threat model, the operating system of the target phone is assumed to be intact. And the malware can use the speech recognition engine to identify secrets and has a noise processing mechanism. Specifically, the Google Voice API can be used to identify laptop traces under different masking sound settings. The malware can also have common audio preprocessing knowledge, including blind source separation, etc. Due to the redundancy in human speech, the machine can understand human speech only by analyzing statistical features such as MFCC or polynomial residuals, so the malware can use machine learning to recover secret information.

Security

For non-encrypted systems such as SafeChat, most of them simulate their security by ensuring that the signal-to-noise ratio of the data packet received by the eavesdropper is less than the coding/modulation capacity of the data packet, so it is impossible for the eavesdropper to recover the received data packet with masking noise from the information theory. In order to evaluate the security of SafeChat, this paper also evaluates the difference in MSR between the authorized person and the authorized person. However, since there is no guarantee of "coding/modulation capacity", it is impossible to specify the benchmark of MSR. This paper evaluates the privacy/secret protection of SafeChat based on the criterion that the difference between the probability of guessing the secret information and the probability of guessing any secret is negligible.

Experimental Results

Figure 6. Calibrate SafeChat settings on different devices

Figure 7. Experimental setup

The dataset on which the security threat model is based is TIDIGIT. The voice audio of TIDIGITS is played by a laptop next to a mobile phone running SafeChat. The test participants are asked to read a randomly generated 8-digit number.

Figure 8. User learning app interface

After users completed the self-calibration phase, they were asked to go through a training phase, record multiple audio clips, and then fill out survey questions.

Figure 9. Effectiveness of masking sound on different devices

In the experiment, some devices had better masking performance when playing masking sounds at high volumes, while some devices could not effectively remove masking sounds when the volume was too high. (For ease of reading, the test results of Galaxy S4/5 are omitted.) The final experimental results found that when the speaker is loud enough, the voice can be hidden. As for the effectiveness of masking sound removal, SafeChat can produce up to 26db signal strength difference between authorized and unauthorized recording applications. This difference reduces the accuracy of state-of-the-art speech recognition engines (such as Google speech API) to less than 0.1% when understanding unauthorized recordings, while understanding authorized recordings with high accuracy. At the same time, the experiment also found that SafeChat is resilient to common environmental noise. Common environmental noise has little effect on masking sound removal because environmental noise is not related to masking sounds. Noisy background noise actually helps to hide the secrets spoken. The experiment selected Note 4, Nexus 5X, Nexus 6P and Sony Z1 for testing. The effect of 6P is best at the maximum volume, which may be due to the different dynamic ranges of the microphone and speaker. The experiment also found that the volume of the masking sound cannot be set too high, because playing the masking sound at full volume will cause nonlinear distortion and microphone saturation. In order to evaluate SafeChat's performance in combating human identification of secrets, the experiment recruited more than 317 users to identify the audio recorded by 6 test participants. It was finally found that although some subjects could identify some content in the recording clips mixed with masking sounds, the recognition accuracy was still high due to recognition errors. In terms of usability, when users use SafeChat for the first time, it takes 1.6 rounds to pass the initial training phase. Considering that each training takes about 6 seconds (including processing time), the total time overhead of training is less than 10 seconds on average. After passing the training phase, users only need 1.3 rounds to read the 8-digit password at a volume compatible with the SafeChat setting. In the end, the accuracy of identifying the secret information in the masked and recovered recordings was 22% and 93% respectively. However, the echo cancellation function that comes with the mobile phone is implemented at the chip level, so there is a probability that the masking sound will be eliminated and it cannot be prevented.

References

[1]Nazir Saleheen. mSieve: Differential Behavioral Privacy in Time Series of Mobile Sensor Data. In Proceedings of ACM UbiComp '16. 706–717

[2]Yu-Chih Tung, Kang G. Shin, and Kyu-Han Kim. Analog Man-in-the-middle Attack Against Link-based Packet Source Identification. In Proceedings of ACM MobiHoc '16. 331–340

[3]Souvik Sen, Naveen Santhapuri, Romit Roy Choudhury, and Srihari Nelakuditi. [nd]. Successive Interference Cancellation: A Back-of-the-envelope Perspective. In Proceedings of the 9th ACM Hotnets '10. 17:1–17:6.

[4]R. Gary Leonard and George Doddington. [nd]. TIDIGITS Dataset.

<<: iOS 17 update, major features return!

>>: How virtual reality is reshaping digital advertising