iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦾

What is a Local Gaussian Model?

に公開

Purpose of this Article

Hello. This article is written as the 3rd day (?) of the Acoustical Society of Japan (ASJ) Student/Young Researchers Forum Advent Calendar 2023. I was wondering what to talk about, but I decided to discuss the Local Gaussian Model (LGM) [N. Q. K. Duong+ 2011]. Do you know about LGM? It looks like this:

\mathbf{x}_{tf} \sim \mathcal{N}_\mathbb{C}(\bm{0}, \lambda_{tf} \mathbf{H}_{f})

This model is a generative model often used in Blind Source Separation (BSS) and is deeply related to the Itakura-Saito divergence (IS-div.) [C. Févotte+ 2009] for monaural signals. In this article, I would like to explain the interpretation and validity of LGM, starting step-by-step from the generative model of time-domain waveforms.

Gaussian Process Representation of Monaural Time-Domain Waveforms

In LGM and IS-div., it is (implicitly) assumed that time-domain waveforms follow a Gaussian Process (GP; explanations can be found in [Akaho 2018] or [Motohashi+ 2019], etc.). In this model, we assume that the time-domain waveform \mathbf{y} \triangleq \{ y_n \in \mathbb{R} \}_{n=1}^N at N time points t_n \in \mathbb{R}_+ (n=1, \dots, N) follows the zero-mean Gaussian distribution below:

\mathbf{y} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{R} \right)

Here, \mathbf{R} represents an N \times N positive semi-definite symmetric matrix (covariance matrix), and the (n_1, n_2) element of \mathbf{R} is generated from a function r(t_{n_1}, t_{n_2}) that takes the corresponding times as arguments. This function r: \mathbb{R}_+ \times \mathbb{R}_+ \rightarrow \mathbb{R} is called a kernel function and determines the shape of the time-domain waveform.

For example, a periodic kernel with angular velocity \omega is defined below and represents a periodic signal with period 1/\omega.

r(t_1, t_2) \propto \exp(-\sin(\pi\omega |t_1 - t_2|)^2)

Here is an example where \mathbf{R} was constructed from this kernel with a period of 200, and three samples of \mathbf{y} were actually drawn. You can see that various waveforms with periodicity are generated.

Additionally, the spectral kernel [Lázaro-Gredilla+ 2010] with angular velocity \omega is defined below and represents a trigonometric function with period 1/\omega.

r(t_1, t_2) \propto \cos(2\pi \omega (t_1 - t_2))

An example of \mathbf{R} and \mathbf{y} using this kernel is shown below. It is interesting how clean trigonometric functions emerge even though we are sampling from a Gaussian distribution.

It is also possible to generate complex signals. For example, if you prepare an appropriate speech signal \mathbf{y}' and calculate its covariance matrix, you can sample that speech signal from a Gaussian distribution.

\mathbf{R} = \mathbf{y}' \mathbf{y}'^{\mathsf{H}} + \epsilon \mathbf{I}

Where \epsilon \in \mathbb{R}_+ is a small positive number to make \mathbf{R} positive definite. Let's actually sample from this as well. First, I prepared the following speech signal as \mathbf{y}'.

The result of calculating \mathbf{R} from this signal and sampling \mathbf{y} is shown below. The time-domain waveform of the speech signal is correctly generated.

It's hard to see in this figure, but since the rank of \mathbf{R} is (nearly) 1, it reproduces the shape of the original signal exactly. However, since one degree of freedom remains, a signal scaled by an appropriate real number relative to the original signal \mathbf{y}' is generated.

In this way, by using Gaussian processes (or covariance matrices), any time-domain waveform can be represented (allowing for scale indeterminacy).

Wiener Filtering is All You Need

The scale indeterminacy in the generative model of time-domain waveforms using Gaussian processes is not a problem in practice. Let's explain this using monaural source separation as an example. Here, we assume that the observed signal \mathbf{y} is represented as the sum of K source signals \mathbf{y}^{(k)} \in \mathbb{R}^N (k=1,\ldots,K).

\mathbf{y} = \sum_{k=1}^K \mathbf{y}^{(k)}

Furthermore, each source signal is assumed to follow a Gaussian distribution with covariance parameter \mathbf{R}^{(k)}. In this case, the mixed sound \mathbf{y} follows a Gaussian distribution with covariance \mathbf{R} \triangleq \sum_{k=1}^K \mathbf{R}^{(k)} due to the reproductive property of the Gaussian distribution.

\mathbf{y} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{R}\right)

The source separation problem can be formulated as a problem of estimating the covariance parameters \mathbf{R}^{(k)} from the mixed sound \mathbf{y} under appropriate constraints. As explained in detail in the next section, source separation based on IS-div. can also be regarded as a special case of this framework.

When the covariance parameters are obtained by an appropriate inference method, the estimated value \hat{\mathbf{y}}^{(k)} of source k is obtained as the posterior expectation of \mathbf{y}^{(k)} calculated from \mathbf{R}^{(1,\ldots, K)} and \mathbf{y} as follows.

\hat{\mathbf{y}}^{(k)} = \mathbb{E}[\mathbf{y}^{(k)} \mid \mathbf{y}, \mathbf{R}^{(1, \ldots, K)}] = \mathbf{R}^{(k)}\mathbf{R} ^{-1} \mathbf{y}

Such a separation method is called Wiener filtering [A. Liutkus+ 2011]. This equation is easy to understand if you think of it as a scalar analogy; it represents masking (allocation) of the observed signal using positive semi-definite symmetric matrices, so the scale of the separated sounds can be uniquely determined using the observed signal \mathbf{y}.

Actually, I would like to restore the source signals \hat{\mathbf{y}}^{(1)} and \hat{\mathbf{y}}^{(2)} from the mixed sound \mathbf{y} of the following speech signals \mathbf{y}^{(1)} and \mathbf{y}^{(2)}.

This time, as an ideal condition, I performed Wiener filtering by calculating \mathbf{R}^{(1)} and \mathbf{R}^{(2)} from the ground-truth source signals. The following figure shows the ground-truth signals in the left column and the separation results in the right column. It can be seen that the individual source signals can be correctly restored, including their scales, from only \mathbf{R}^{(1, 2)} and \mathbf{y}.

In this way, if the covariance matrix of the source signals contained in the mixed sound can somehow be estimated, the source signals can be restored precisely. In other words, even if specific values at a certain time are not known, as long as the expected variance and the relationship with other times are known, the original signal can be uniquely restored. The Acoustic Scene Understanding Team at RIKEN AIP, which I belong to, presented deep source separation based on such Gaussian processes of time-domain waveforms and Wiener filtering at WASPAA 2023 [A. A. Nugraha+ 2023], so please read it if you are interested.

Frequency Representation of Time-Domain Waveforms

Next, let's consider the generative model of time-domain waveforms in the frequency domain. By multiplying the time-domain waveform \mathbf{y} by the discrete Fourier transform matrix \mathbf{F} \in \mathbb{C}^{N\times N}, we obtain its frequency representation \mathbf{x} \triangleq \mathbf{F}\mathbf{y} \in \mathbb{C}^N. In this case, if the original time-domain waveform \mathbf{y} follows a zero-mean complex Gaussian distribution with covariance matrix \mathbf{R}, the distribution followed by its frequency representation \mathbf{x} is a zero-mean complex Gaussian distribution as follows:

\mathbf{F}\mathbf{y} = \mathbf{x} \sim \mathcal{N}_\mathbb{C}\left(\mathbf{0}, \mathbf{F} \mathbf{R} \mathbf{F}^\mathsf{H}\right)

Although the time-domain waveform becomes a complex signal, when applying Wiener filtering, the results match between the time and frequency representations, so there is no practical problem.

Let's also actually sample and look at this. I calculated \mathbf{Q} \triangleq \mathbf{FRF}^\mathsf{H} from the \mathbf{R} of the speech signal that appeared at the end of the section before last.

Applying the inverse discrete Fourier transform to a signal sampled from this distribution results in the following time-domain signal:

When the rank of the covariance matrix \mathbf{R} (and similarly \mathbf{Q}) is (nearly) 1, the sampled signal will be like the original time-domain waveform multiplied by a complex scalar. This part is a straightforward extension of the discussion on time-domain waveforms to complex numbers.

Itakura-Saito Divergence

In parameter estimation problems such as source separation, when estimating the covariance matrix \mathbf{R} or \mathbf{Q}, it is expected that if either can be written in a simpler form, the estimation problem becomes easier. Many (time-)frequency domain source separation methods assume that \mathbf{R} is diagonalized by the discrete Fourier transform matrix.

\mathbf{Q} = \mathbf{FRF}^\mathsf{H} \approx \mathrm{diag}(\mathbf{q})

Here, \mathbf{q} \in \mathbb{R}_+^N is a non-negative vector representing the diagonal elements. Such an assumption holds when the covariance \mathbf{R} of the time-domain waveform is a circulant matrix. Conceptually, this assumption holds when the signal is stationary and there is no correlation between frequency components (= ignoring the phase).

The reason the Short-Time Fourier Transform (STFT) is frequently used in acoustic signal processing can be understood as the signal being considered stationary within a very small time frame, allowing this diagonalization approximation to be used. Hereafter, we let the time signal in the t-th time frame be \mathbf{y}_t, and its frequency representation be \mathbf{x}_{t} \triangleq [x_{t1}, \ldots, x_{tf}, \ldots, x_{tN}]^\mathsf{T}. If we represent the diagonal components \mathbf{q}_t of the covariance of \mathbf{x}_t as [\lambda_{t1}, \ldots, \lambda_{tN}]^\mathsf{T} \in \mathbb{R}_+^N, from the discussion so far, the spectrogram x_{tf} \in \mathbb{C} can be written as following a zero-mean complex Gaussian distribution:

x_{tf} \sim \mathcal{N}_\mathbb{C}(0, \lambda_{tf}) =\exp\left(-\log \lambda_{tf} - \frac{|x_{tf}|^2}{\lambda_{tf}}\right)

This Gaussian likelihood matches the Itakura-Saito divergence [C. Févotte+ 2009] between the power spectrogram |x_{tf}|^2 and the power spectral density (PSD) \lambda_{tf}. Since the matrix calculations that were troublesome with covariance matrices become scalar calculations, the estimation of such PSD is widely solved as an approximation problem. Conversely, Positive Semidefinite Tensor Factorization (PSDTF) [K. Yoshii+ 2013] is known as research that explicitly models the correlation between frequency bins (= a type of phase information).

Instantaneous Mixture Model in the Frequency Domain

Now, I haven't explained what kind of model LGM is yet, but LGM is a generative model for multi-channel acoustic signals. A multi-channel acoustic signal is a signal collected by multiple microphones m=1,\ldots,M located at different positions. Such a group of microphones is called a microphone array. Differences in volume and time occur between microphones depending on the position of the sound source, so by using this spatial information, it is possible to estimate the position of the sound source and the source signal.

Time differences and volume differences correspond to convolving a transfer function \mathbf{b}_m \in \mathbb{R}^N with the source signal \mathbf{y}_{t} in the time domain.

\mathbf{y}_{mt} = \mathbf{b}_m * \mathbf{y}_{t}

Here, \mathbf{y}_{mt} \in \mathbb{R}^N represents the signal observed by the m-th microphone, and * represents the convolution operation. When the length of the transfer function is sufficiently short (when reverberation is short), this convolution corresponds to multiplying a monaural source signal x_{tf} by the frequency representation of the transfer function \mathbf{a}_{f} \in \mathbb{C}^M in the time-frequency domain.

\mathbf{x}_{tf} = \mathbf{a}_f x_{tf}

Here, \mathbf{x}_{tf} \in \mathbb{C}^M represents the M-channel observed spectrogram. When the source signal follows a zero-mean complex Gaussian distribution with PSD \lambda_{tf}, the M-channel observed signal \mathbf{x}_{tf} follows the multivariate complex Gaussian distribution below.

\mathbf{x}_{tf} \sim \mathcal{N}_\mathbb{C}\left(\mathbf{0}, \lambda_{tf} \mathbf{a}_f \mathbf{a}_f^\mathsf{H} \right)

Such a sound propagation model is called a rank-1 spatial model and is widely used because it can be inferred stably and quickly. A typical example of a source separation method using the rank-1 spatial model is Independent Low-Rank Matrix Analysis (ILRMA) [D. Kitamura+ 2016]. ILRMA is a type of method called BSS, which statistically and simultaneously estimates source signals and propagation processes, so it does not require position information of sound sources/microphones or prior supervised learning. Implementations are also available as OSS, so you can easily try it out if you purchase a microphone array.

Local Gaussian Model

The rank-1 spatial model is based on the convolution of time-invariant transfer functions, so its performance degrades if the sound source moves. The LGM [N. Q. K. Duong+ 2011] is known as a model that alleviates this problem. In LGM, it can be interpreted as using multiple transfer functions \mathbf{a}_{lf} \in \mathbb{C}^M (l=1,\ldots, L) to represent a single sound source. If we interpret the observed signal as the sum of source signals following each transfer function, the observed signal follows the multivariate complex Gaussian distribution below.

\mathbf{x}_{tf} \sim \mathcal{N}_\mathbb{C}\left(\mathbf{0}, \lambda_{tf} \mathbf{H}_f \right)

Here, \mathbf{H}_f \triangleq \sum_{l=1}^L \mathbf{a}_{lf}\mathbf{a}_{lf}^\mathsf{H} represents a full-rank spatial correlation matrix, which is the sum of the transfer functions. We have finally reached the LGM in the title. In practice, even if the sound source comes from only one specific direction at each moment, a moving source can be interpreted as a source arriving from multiple directions over the long term. While this model cannot accommodate significant movement of the sound source, it is known to exhibit high performance for the level of a speaker's head movement. Furthermore, bustling noise like a crowd, called diffuse noise, can be interpreted as sound coming from countless directions, and this model is also effective for representing such noise sources.

I am developing a method called Deep Full-rank Spatial Correlation Analysis (Neural FCA) [Y. Bando+ 2021], which learns source separation in an unsupervised manner by combining LGM and Variational Autoencoders (VAE). I have written an explanatory article about this method and VAE below, so please read it if you like.

https://zenn.dev/yoshipon/articles/2d478c63c477e8

LGM and its BSS applications such as FCA [N. Q. K. Duong+ 2011] and Multichannel Non-negative Matrix Factorization (MNMF) [H. Sawada+ 2013] have good peak performance. However, when they were first proposed, they were known for the difficulty of inference within a realistic timeframe due to huge computational costs, as well as their strong dependency on initial values. Nevertheless, with recent Neural FCA based on GPGPU and amortized inference, it has become possible to achieve high performance and realistic inference times. I hope that research integrating deep learning and statistical signal processing will be discussed more broadly and actively in the future.

Conclusion

To explain the Local Gaussian Model (LGM), I have provided a step-by-step explanation starting from the generative model of time-domain waveforms. It was very educational for me, as points I had misunderstood around the transition to complex signals became clear. Since there are still many areas where my explanation or understanding might be insufficient, please feel free to point out any questions or mistakes. Although it is 6 days late for the Advent Calendar, I hope you enjoy it as a companion for your weekend teatime. The Jupyter Notebook used to generate the figures in this article can be viewed below.

https://gist.github.com/yoshipon/145767a5af9ea67457166e584d633130

Also, the Social Intelligence Research Team at AIST (National Institute of Advanced Industrial Science and Technology), to which I belong, is recruiting short-term (and long-term) interns. We are promoting research such as the integration of statistical signal processing and deep learning, and audiovisual integration processing, so please feel free to contact us if you are interested.

http://onishi-lab.jp/joinus/intern/index.html

Acknowledgments

I would like to thank Dr. Ryo Nishikimi for providing many meaningful comments during the writing of this article. I would also like to thank Associate Professor Kazuyoshi Yoshii for his extensive guidance in understanding the content of this article from my master's course to the present.

Discussion