🐫

【Wave Analytics】What is MFCCs?

2024/04/20に公開

1. MFCCs(Mel-Frequency Cepstral Coefficients)

・MFCCs

In coclution, MFCCs quantify the self-similarity of the high-pass filtered signal at different time scales (musical pitch removed, robust to bandwidth reduction).

It has been used for voice recognition and timbre description.

2. Process flow

  1. Input raw data (Time domain)
  2. DFT:Discrete Fourier Transfrom (Frequency domain)
  3. Mel filter bank processing (Frequency domain)
  4. Logarithmic transform (frequency domain)
  5. DCT:Discrete Cosine Transform (Time domain)

In this process, domain changes 2 times.
First time, domain changes into frequency from time and It create spectrum.
Second time, domain back to time, It create "spectrum of spectrum".

What is this? The easy answer is self-similarity, as showed behind.

So, Let's dive into deeper.

3. Concrete process

3.1 DFT

Convert to frequency components.
Also domain changes to frequency.

3.2 Mel filter bank processign

・Mel filter

Convert to Mel scale. This process extarct low-frequency components, but weve is in frequency domain now.
What happens then? Some ordinaly spectrum has high value and frequency in low frequency, so it means low-pass filter works like high pass filter in frequency domain.

This process extract high frequency component(with input signal), and it makes singal lost about pitch information.
Indeed, pitch information is irrelevant to both voice recognition and timbre description.

3.3 Logarithmic Transfrom

Apply log trans to signal.
Because, DCT of log(DFT and Mel filtered signal) similar to autocorrelation in high frequency range(which happens to be the case) like over about 800Hz.

So, we can interpret it as autocorrelation.
But why don't use autocorrelation itself(It's so easy transform than mfcc)? One possible correct answer is this.

it robust to the removal of the fundamental frequency, whereas autocorrelation isn’t. MFCCs were used in voice recognition tasks. Recognition should be possible when the speaker is on the phone for instance — a case in which it is likely that the fundamental will be missing due to bandpass filtering.
Quotation: Cepstrum and MFCC

3.4 DCT

Simply put, DCT is the real part of DFT.
It provide similar to autocorrelation in high frequency range. The higher frequencies of abs(normalised autocorrelation) and the MFCCs show a significant frame-by-frame correlation.

The result of DCT had been deleted pitch, and robust to reduction of frequency band.

In a word, MFCCs quantify the self-similarity of the high-pass filtered signal at different time scales (musical pitch deleted, robust to reduction of frequency band).

In mathematically, MFCCs are provided as coefficient of result of DCT(Coefficient of each decomposed cosine wave in sepcific time window size).
・MFCCs

Intuitive explaination of image of MFCCs

In an MFCC plot or image, the coefficients are typically arranged with the first coefficient (often called the zeroth coefficient) at the bottom and higher indices going upwards. The first few coefficients (starting from the zeroth) are generally the most significant in terms of capturing the bulk of the information about the signal's spectral envelope. Here’s how to understand their roles:

  1. Zeroth Coefficient
    This is often the overall energy of the frame, which some systems use while others discard, depending on the application's sensitivity to amplitude variations.
  2. First Few Coefficients
    These coefficients (e.g., MFCC1, MFCC2, MFCC3) capture the most dominant features of the signal’s spectral shape. They represent the largest variances within the spectral envelope, which generally correspond to the most perceptually important aspects of the sound—like the basic formant structure in speech.
  3. Higher-Order Coefficients: As you move to higher coefficients, they begin to represent finer and finer detail in the sound. These details are often less critical for tasks like speech recognition, where broad spectral features dominate the recognition process.

3.5 Usefullness

MFCCs quantify the self-similaryty of the high-pass filtered signal(pitch deleted), and it achieved many success of voice recognition tasks.

Thinking from this, we can think that it is able to identify a human's voice from MFCCs(self-similaryty of the high-pass filtered signalat different time scales).

So, is it can reflect 'timbre'? It's still mystery even now.

Summary

This time, I explained about MFCCs.
Thank you for reading.

Reference

(1) Mel-Frequency Cepstral Coefficients Explained Easily
(2) Intuitive understanding of MFCCs
(3) Cepstrum and MFCC

Discussion