
【PreProcessing】How to normalize The Mel Spectrogram


This time, I'll explain how to normalize the Mel spectrogram.

1. What is melspectrogram?

Melspectrogram is a kind of spectrogram. The spectrogram is an image for analyzing sequential data from frequency points by using FFT.
I explained about Mel spectrogram here, please reference it if you need.

2. Code

The main focus of this article is normalization to Mel spectrogram.

def normalize_melspec(X, eps=1e-6):
    mean = X.mean((1, 2), keepdim=True)
    std = X.std((1, 2), keepdim=True)
    Xstd = (X - mean) / (std + eps)

    norm_min, norm_max = (
    fix_ind = (norm_max - norm_min) > eps * torch.ones_like(
        (norm_max - norm_min)
    V = torch.zeros_like(Xstd)
    if fix_ind.sum():
        V_fix = Xstd[fix_ind]
        norm_max_fix = norm_max[fix_ind, None, None]
        norm_min_fix = norm_min[fix_ind, None, None]
        V_fix = torch.max(
            torch.min(V_fix, norm_max_fix),
        V_fix = (V_fix - norm_min_fix) / (norm_max_fix - norm_min_fix)
        V[fix_ind] = V_fix
    return V

Assuming input shape as [batch_size, frequency, time]

X: input melspectrogram
mean = X.mean((1, 2), keepdim=True): calculate mean about dimention 1(freq) and 2(time), and maintain the shape.(It is useful for bloadcasting)
Xstd = (X - mean) / (std + eps): This normalization is used to transform the data so that it has a mean of zero and a standard deviation of one. The addition of a small constant ε (epsilon) to the denominator prevents division by zero if the standard deviation is zero.
norm_min: The minimum value in Xstd across the last dimension (frequency bins) and then across the second-to-last dimension (time frames).
norm_max: The maximum value in Xstd across the last dimension and then across the second-to-last dimension.
fix_ind: A boolean mask identifying which spectrograms have a valid range (where the difference between norm_max and norm_min is greater than eps).
`V': An output tensor initialized to zeros with the same shape as Xstd.

If there are any valid spectrograms (fix_ind.sum() is greater than 0):
V_fix: The subset of Xstd corresponding to the valid spectrograms.
norm_max_fix, norm_min_fix: The max and min values for the valid spectrograms, reshaped for broadcasting.
V_fix: is clamped to the range [norm_min_fix, norm_max_fix].
V_fix: is then normalized to the range [0, 1].
The normalized values are assigned back to the appropriate positions in V.

The function returns the tensor V, which contains the normalized Mel spectrograms.

3. Summary

This function ensures that each Mel spectrogram is normalized to have a range between 0 and 1, which can be helpful for machine learning models to work effectively with the spectrogram data.

This can also be applied to other image data, so please give it a try.
