Today
To everyone who participated in the Computational Physics Spring School 2023, thank you for your hard work. Here, we will consider running a 4-dimensional theory based on the GomalizingFlow.jl package. The idea I obtained there was to change the structure of the convolutional network that constitutes the affine coupling layer. I was able to increase the speed of improvement in the acceptance rate during training. This means that learning can be performed more efficiently in a shorter time than before.
Review from Part 1
Let's review Part 1. https://zenn.dev/terasakisatoshi/articles/introduction-to-gomalizingflow-part1
Let L be the number of lattice points on each axis. In a field theory on a d-dimensional lattice, we take the standpoint of discretizing spacetime and cutting it off at a finite size. If we assume periodic boundary conditions, the lattice \mathcal{L} can be thought of as (\mathbb{Z}/L\mathbb{Z})^d.
A field \phi on the lattice \mathcal{L} = (\mathbb{Z}/L\mathbb{Z})^d is a map \phi : \mathcal{L} \to \mathbb{R}, but since \mathcal{L} is a finite set, \phi can be regarded as a multi-dimensional array. For example, if d=2, it is a matrix. Hereafter, please think of \phi as a multi-dimensional array. We will also call \phi a configuration.
Suppose that configuration \phi is D(= L ^ d) (even for convenience) dimensions when arranged as a 1D vector. We divide \phi into two parts. Let's separate them as \phi_e and \phi_o based on whether the index is even or odd. Each can be thought of as a D/2-dimensional vector.
Normalizing Flow reduces the sampling problem from data with a complex distribution to sampling from a simple distribution (e.g., Gaussian distribution) using a differentiable and invertible map (flow).
We define \varphi = T^{o}(\phi):
\begin{aligned}
\varphi_e &= \exp(s(\phi_o)) \odot \phi_e + t(\phi_o), \\
\varphi_o &= \phi_o.
\end{aligned}
Here, \odot represents the element-wise product. Also assume that \exp(\phi_o) applies \exp to each element of \phi_o. s and t are machine learning models with parameters \theta = (\theta^{(s)}, \theta^{(t)}). Mathematically, it is a (generally non-linear) map from D/2 dimensions to D/2 dimensions. The transformation T^e, which changes even-indexed values while keeping odd-indexed values unchanged, can be defined similarly. This flow is called an affine coupling layer.
In the discussion of Part 1, the above transformation was differentiable and invertible. Also, the Jacobian of the transformation (determinant of the Jacobian matrix) can be calculated efficiently.
In GomalizingFlow.jl, we prepare multiple such transformations (T_i^o, T_i^e, i = 1,2,\dots, M). We aim to construct a flow with rich representational power using a composite map that applies T_i^o and T_i^e alternately.
Network Structure
The s=s_i and t=t_i that appeared in the definition of T_i^o are machine learning models, more specifically, neural networks, having \theta_{i} = (\theta_{i}^{(s)}, \theta_{i}^{(t)}) as parameters. For readers interested in the specific implementation, I will describe the model structure.
For simplicity, let's consider the case of a two-dimensional theory, i.e., d = 2. In this case, the configuration \phi is a matrix with a shape of L \times L. If we do "not" use the word matrix, it can be regarded as image data with 1 channel where the spatial shape is (L, L) and the intensity values are real numbers \mathbb{R}. For convenience, we regard data with a spatial size of (L, L) and 1 channel as 3D array data with a shape of (L, L, 1).
When thinking about a transformation for this, convolutional networks come to mind. Let \mathrm{Conv_{1 \to 2}} be a transformation from 1 channel to 2-channel features.
\mathrm{Conv_{1 \to 2}}: \phi \mapsto C(\phi)
As a map m_o=m_o(x) on D dimensions, consider the following map:
\mathbb{R}^D \ni x=(x_1, \dots, x_D) \mapsto (1, 0, 1, 0, \dots) \in \mathbb{R}^D
This leaves the information of odd-indexed entries of the vector and ignores the even-indexed terms. Using this m_o, \phi_o \in \mathbb{R}^{D/2} can be identified with m_o(\phi) \in \mathbb{R}^D. Calculating \mathrm{Conv_{1 \to 2}}(m_o(\phi)) results in an array with shape (L, L, 2). Let's denote the 1st channel data (L, L, 1) \sim (L, L) as s. We mask this with m_o and identify m_o(s) \in \mathbb{R}^{D} with a D/2-dimensional vector. We will write the result as s(\phi_o). Similarly, t(\phi_o) can be defined for t. This allows the calculation of the affine coupling layer.
Regarding the data layout of features
Readers familiar with deep learning frameworks might think that the last dimension represents channels in convolution calculations. This looks similar to the (H, W, C) data layout adopted by TensorFlow, but it's more like the (C, H, W) layout of PyTorch (version 1) in reverse order, i.e., (W, H, C), arranged from right to left. This matches the data layout adopted by Flux.jl, the machine learning framework used by GomalizingFlow.jl. Including the mini-batch size B, the data layout follows the (W, H, C, B) style. In PyTorch, it was (B, C, H, W). The reason for the reverse order is that Flux.jl's implementation is inspired by PyTorch (discussions in Issues, etc., show comparisons with PyTorch) and the fact that Julia's array memory layout is column-major. This is because the loop order is reversed due to the memory layout.
The topic of memory layout is also briefly mentioned in Mr. Nagai's book, "Learn Julia in One Week," so please refer to it if necessary.
Regarding convolution padding
The padding used for convolution calculations adopts circular padding to correspond to the periodic boundary conditions of the field. In other words, when calculating convolution for the left-edge region, information from the right edge is used by wrapping around. The top edge adopts information from the bottom edge. This allows for transformations that maintain the spatial dimensions of the image. In PyTorch, you can easily use it by specifying the padding_mode option. However, since it was not implemented in Flux.jl, I implemented it using array operations such as cutting out the right region of the feature map and attaching it to the left to define a function differentiable with Zygote.jl. I would be happy if you could let me know if there is a better way. Not only I but also the developers of Flux.jl would appreciate it.
Generalizing the 2D Example
The example above was explained for the case of d=2, but it can also be implemented for d=3 using 3D convolutions. Although the required GPU memory increases and the training time becomes longer compared to the d=2 case, I have confirmed that training progresses such that the ESS and acceptance_rate improve. Naturally, I wanted to perform calculations for the d=4 case as well. However, since 4D convolutions were not provided in Flux.jl, and since it is not strictly necessary to use a convolution with a kernel/filter of that specific dimension (in principle, any map that can be calculated within an automatic differentiation framework on a program is acceptable), I considered a method to build it using existing components. I summarized the findings obtained there in the following:
https://twitter.com/TomiyaAkio/status/1583304573208645633
Combinational-convolution for flow-based sampling algorithm
Regarding Combinational-convolution
In a given feature space, we select k < d axes from d axes. We then perform convolution on the selected k axes. It is best to explain the convolution for k axes with the following example.
Suppose the mini-batch size is B, the number of channels is inC, and the spatial size of the feature space is (L1, L2, L3, L4). This setting assumes each channel has 4-dimensional data. For data with the shape of a tuple (L1, L2, L3, L4, inC, B), we create data with the shape (L1, L2, L3, L4, outC, B) having outC channels through the following procedure. Explaining step-by-step focusing on the transition of the data shape:
(L1, L2, L3, L4, inC, B) # We want to perform convolution on the L1 and L2 axes
->
(L1, L2, inC, L3, L4, B) # Swap the axes of the data
->
(L1, L2, inC, (L3, L4, B)) # Treat L3, L4, B as a single axis
->
(L1, L2, inC, (L3 * L4 * B)) # Combine L3, L4, B into one axis. This corresponds to a reshape operation
->
(L1, L2, outC, (L3 * L4 * B)) # Calculate 2D convolution. This yields features with `outC` channels.
->
(L1, L2, outC, L3, L4, B) # Decompose the last axis from (L3 * L4 * B) back to (L3, L4, B). Perform a reshape operation
->
(L1, L2, L3, L4, outC, B) # Move the L3 and L4 axes back.
This executes a 4D convolution where the filter size for the unselected axes is 1. In the above example, it executes a 4D convolution with a filter of (f1, f2, 1, 1), and this operation is realized using 2D convolution.
How to Choose Axes
Originally, I was doing this with the motivation that it would be great if 4D calculations could be done, but as a result of implementing it for d = 2, 3 for debugging purposes, I found that it also contributes to training efficiency.
In GomalizingFlow.jl, we calculate a trivializing map using Normalizing Flow and use it as the proposal probability in a Metropolis-Hastings algorithm to generate samples. At that time, it is desirable to train so that the acceptance probability becomes high. From empirical rules in experiments, as the L axis and dimension d increase, training tends to become more difficult, and the calculation time required to achieve a specific acceptance probability tends to increase. Therefore, a method to train efficiently was also desired.
When using the Combinational-convolution architecture, in the case of d=3, using low-dimensional convolutions instead of standard convolutions can reduce the number of epochs required to obtain a 50% acceptance rate. (The display on the horizontal axis is bugged, but please read it as having run the epochs up to 500.)
Interestingly, in the cases of d = 3, 4, it can be seen that choosing only one axis (corresponding to k=1) allows for more efficient training compared to other selection methods (see the figures below).


Roughly speaking, the k=1 case corresponds to performing convolution using a filter shaped like a cross or a tetrapod. Although I haven't been able to discuss sufficiently why this is good, it might be related to the discrete Laplacian that appears in the action S which determines the distribution of configurations:
\partial^2\phi(n) = \sum_{\mu=1}^d(\phi(n + \hat{\mu}) + \phi(n - \hat{\mu}) - 2\phi(n)).
In other words, when n is fixed, n \pm \hat{\mu} are coordinates extending up, down, left, and right, and that shape is similar to that of the filter in k=1. In a normal convolution, it also calculates pixel values at diagonal positions, so I can imagine various things like whether that was causing problems, but I haven't been able to verify it sufficiently due to constraints on the computational resources I possess (in that sense, it is an open problem).
How to Run
A GPU environment with at least 16 GB of memory is desirable.
$ git clone https://github.com/AtelierArith/GomalizingFlow.jl
$ julia --project=@. -e 'using Pkg; Pkg.instantiate()'
$ cp playground/notebook/julia/4d_flow_4C2.md .
# Convert from md format to jl format using jupytext.
$ jupytext --to jl 4d_flow_4C2.md
$ julia -e 'include("4d_flow_4C2.jl")'
Recently, hardware with abundant memory has appeared alongside advancements in devices. Although more people are buying GPUs to run generative models and large language models, unfortunately, I haven't been able to ride that wave for financial reasons.
Summary
- Touched upon the details of the affine coupling layer in GomalizingFlow.jl.
- Introduced ideas for the flow-based sampling algorithm in 4D theory.
- Explained Combinational-convolution.
- Summarized insights gained from the aforementioned architecture.
- Described how to run it.
Discussion