🧞

【ML】Pointwise Convolution and Depthwise Convolution

2024/05/13に公開

0. Normal Convolution

The stucture of normal convolution is this:

Normal convolution apply a filter(block of weights that has same number of channel as input) for each input channel, and the convolution results are combined (added) pixel by pixel to generate a single feature map.

Then, if you apply multiple filters, you will get that many feature maps. This is the result of a normal convolution calculation.

1. Pointwise Convolution

1.1 What is Pointwise Convolution

Pointwise Convolution(also known as 1x1 convolution) is a little special convolution has only 1x1 size kernel.
A 1x1 kernel size indicates not to convolve any surrounding pixels. This is equivalent to forward propagation of a fully connected layer in the channel direction for each pixel.
(I think this way of thinking is easier to understand)

Quote: 1 x 1 畳み込み (1 x 1 Convolution, 点単位畳み込み層) $^1$

1.2 The way to use

It is mainly used to reduce the dimensionality in the channel direction.
As the model becomes larger and the number of dimensions in the channel direction increases, the number of kernels corresponding to it will be [number of channels * number of output feature maps]. By performing convolution, the number of feature maps at that point can be reduced to 10.

2. Depthwise Convolution

Depthwise Convolution calculate separately on each channel, so outputs haven't relation to any other layers.

In normal convolution, sum calculation of the channel direction is done at same time with calculation sum of produts(It is how make a feature map). However, depthwise convolution only calculate sum of products by a kernel per input channel without sum calculation of the channel direction, so depthwise convolution make Layers as many as the number of channels(elemental layers that were supposed to be summed up into a feature map).

3. Depthwise Separable Convolution

The merged two is Depthwise Separable Convolution.

Quote: 深さ単位分離可能畳み込み (Depthwise Separable Convolution) $^2$

Using Pointwise Convolution after applying Depthwise Separable Convolution works as same as normal convolution(or more) becouse it can learn how sum between each element layer against normal convolution(this only regular sums can be used), and it reducing parametors becouse only use one kernel set with one input channel set.

If point-wise layer is stacked continuously, it may learn relation of filters more deeply.

This method is based to hypothesis that learning the correlation in the spatial direction and the correlation in the channel direction by dividing them into two layers will not only make calculations more efficient, but also improve image recognition accuracy.

Impact of this

Depthwise separable convolution was also used in MobileNet. As a result, we succeeded in reducing the number of parameters in the convolutional layer to about one-ninth while maintaining object recognition performance at the same level.

References

[1] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications", arXiv, 2017.
[2] Masaki Hayashi, "畳み込み層 (Convolution Layer)とその発展型"
[2] Masaki Hayashi, "1 x 1 畳み込み (1 x 1 Convolution, 点単位畳み込み層)"
[3] Masaki Hayashi, "深さ単位分離可能畳み込み (Depthwise Separable Convolution)"