🔥

Empower Your System with Speech Enhancement techniques

2024/01/23に公開

1. Introduction

Speech enhancement techniques can improve the quality and clarity of speech by removing background noise. Background noise is typically present in different places, such as offices, streets, cafeterias, and nature. Having clear speech is important for various speech signal systems and devices, such as video conferencing, telecommunication, automatic speech recognition (ASR) [1-2], and hearing aids [3]. To achieve clear speech, we can integrate speech enhancement techniques as a front-end step into applications, effectively reducing the background noise.

This article generally introduces speech enhancement, including its fundamentals, challenges, and practical guidance for selecting the right methods for different needs.

2. What speech enhancement does

Speech enhancement refers to the process of improving the quality and clarity of speech, especially in situations where speech may be compromised by background noise. It aims to make speech more understandable by eliminating background noise. As shown in Fig. 1, clean speech becomes corrupted by noise during recording and can be restored using speech enhancement techniques. Here, an underlying assumption is that the primary audio source is human speech. Handling other audio sources, such as music, is beyond the scope. Speech enhancement techniques include traditional methods [4-5] and deep learning-based methods [6-7]. For example, deep learning-based methods employ neural networks to automatically acquire a mapping from noisy speech to clean speech.


Fig. 1: Overview of speech enhancement

2.1. Audio samples

Here, audio samples are provided to illustrate the performance of recent speech enhancement technologies in diverse conditions, showcasing their ability to effectively improve speech quality and reduce background noise.

Noisy speech is generated by combining clean speech with background noise, characterized by a specific signal-to-noise ratio (SNR). The clean speech is recorded by both male and female speakers in Japanese. Background noise recordings are obtained from both indoor and outdoor environments. The SNR for these samples is deliberately set at 0 dB and 10 dB for severe and moderately noisy scenarios, respectively.

2.2. Applications

Speech enhancement applications can be broadly categorized into two major groups: those designed for human listeners and those for machines processing spoken language.

For human
Hearing aids: Speech enhancement technology is often used in hearing aids to improve the quality and intelligibility of speech for individuals with hearing impairments. It helps in reducing background noise and enhancing the clarity of spoken words.

Enhancement of recorded speech: This includes applications like video conferencing and telecommunication, where speech enhancement techniques are commonly utilized for ensuring clearer speech in recorded content, thereby facilitating better communication.

For machines
Preprocessors for automated systems like ASR: In noisy environments, such as voice assistants, meetings, in-vehicle systems, and smart home devices, speech enhancement techniques can be integrated as a pre-processing step to improve the quality of input speech signals. For example, this can result in better word accuracy and performance of ASR systems, as speech enhancement can help the system better transcribe spoken words.

3. Challenges in speech enhancement

When applied in real-world scenarios, speech enhancement usually faces several challenges, including the variability of background noise in diverse environments, speaker variability, the need to strike a balance between noise reduction and preserving speech quality, and challenges in real-time processing and low-latency requirements for applications.

3.1. Variability of noise and speaker

Variability in both noise and speakers makes it challenging to create algorithms that consistently adapt to different real-world environments. Noise may be unknown because types and intensities can vary. Moreover, non-stationary noise, such as traffic, office, and nature noise, changes over time. Speaker voice characteristics, speaking styles, accents, and other individual traits may also differ.

3.2. Balancing noise reduction and speech quality

When enhancing speech by removing noise, there is often a trade-off between reducing noise and preserving the original speech quality. Aggressive noise reduction can lead to artifacts or the loss of crucial speech information, making it sound unnatural and potentially decreasing the performance of backend systems.

3.3. Latency

Latency refers to the delay between the input speech signal and the enhanced output. In real-time applications like voice calls or live broadcasts, speech enhancement algorithms need to operate with minimal latency. Minimizing latency is crucial for smooth and natural interactions, particularly in applications involving immediate, live, or interactive communication. Prioritizing lower latency in speech enhancement usually involves using simpler and smaller models and reducing temporal context, but this comes at the cost of sacrificing some performance for immediate real-time processing.

4. Choosing the right solutions

Selecting appropriate speech enhancement solutions involves considering various factors.

Performance and metrics
Selecting performance metrics for speech enhancement involves a careful consideration of the application's goals and constraints. For instance, if the primary objective is to enhance speech intelligibility in noisy conditions, metrics like STOI may be more relevant. If the emphasis is on overall speech quality, metrics such as SNR, PESQ, and MOS might be preferred. Furthermore, if reference speech is unavailable, metrics like MOS can be good choices. When speech enhancement functions as frontend processing in applications, application-specific metrics, like the word error rate (WER) of the ASR model, can serve as valuable indicators for evaluating speech enhancement performance. Using a combination of metrics is often beneficial for obtaining a comprehensive understanding of speech enhancement performance.

Compatibility with your systems
Ensure that the solution you choose is compatible with your platform or programming environment. Check the framework and language compatibility, as well as any dependencies required for integration.

Real-time processing
If your application requires real-time processing, selected methods should have low-latency operation. Delay in speech enhancement can be disruptive in applications like voice calls or live broadcasts. For non-real-time processing, the emphasis is on achieving the highest possible enhancement performance without time constraints. In these scenarios, it is feasible to employ more computationally intensive speech enhancement techniques, such as intricate deep learning models with extended temporal context.

Performance under noise variability
In unpredictable noise scenarios, speech enhancement models should be robust against various noises. Conversely, when noise is known, a custom model optimized for the specific noise may be used. In such a case, it is unnecessary to adapt the model for various noise types.

Performance with speaker variability
When dealing with variable or unknown speaker characteristics, it may need to assess the methods' capacity to accommodate diverse speakers. For example, speech enhancement models trained on adult speech data may not effectively handle speech from children and the elderly due to their unique acoustic characteristics. A speech enhancement system in the same language is more likely to achieve better performance than one in a different language.

5. Summary

This article explored how to use speech enhancement techniques in practical applications. While speech enhancement has made significant progress with the emergence of deep learning technologies, there remain challenges and complexities associated with their integration into applications. These challenges include adapting to noise and speaker variability, preserving speech quality while reducing noise, and achieving real-time processing. In practical applications, the selection of appropriate speech enhancement techniques is crucial to meet specific requirements. In this article, we also provide a brief overview of the considerations involved in making such decisions.

References

[1] Chris Donahue, Bo Li, and Rohit Prabhavalkar. "Exploring speech enhancement with generative adversarial networks for robust speech recognition." In ICASSP, pp. 5024-5028, 2018.

[2] Jun Du, Qing Wang, Tian Gao, Yong Xu, Li-Rong Dai, and Chin-Hui Lee. "Robust speech recognition with speech enhanced deep neural networks." In Fifteenth annual conference of the international speech communication association, 2014.

[3] Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, and Paul N. Whatmough. "TinyLSTMs: Efficient neural speech enhancement for hearing aids." arXiv preprint arXiv:2005.11138, 2020.

[4] Kuldip Paliwal, Kamil Wójcicki, and Belinda Schwerin. "Single-channel speech enhancement using spectral subtraction in the short-time modulation domain." Speech communication 52, no. 5: 450-475, 2010.
[5] Yariv Ephraim, Harry .L Van Trees. "A signal subspace approach for speech enhancement." IEEE Transactions on speech and audio processing, no.4: 251-266,1995.

[6] Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. "Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement." In International Conference on Machine Learning, pp. 2031-2041, 2019.

[7] Craig Macartney, and Tillman Weyde. "Improved speech enhancement with the wave-u-net." arXiv preprint arXiv:1811.11307, 2018.

フェアリーデバイセズ公式

Discussion