はじめに
タイトルのとおり、ニューラルネットワーク(NN)を勉学するにあたり、その内部でどのような計算が行われているか詳しく確認したかったため、理解した過程を本稿に残します。
前提
- サンプルデータはIrisのデータセット
(4つの特徴量で、花の品種を3分類予測)
- 活性化関数は、Sigmoid関数を用いる
- 損失関数は、平均二乗誤差とする
- 入力層: 4ノード、隠れ層: 2ノード、出力層: 3ノード
- バイアスも考慮
計算の流れ
- ハイパーパラメータ初期設定
- 重みの初期化
- 順伝播 - 隠れ層における線形和関数計算
- 順伝播 - 隠れ層における活性化関数計算
- 順伝播 - 出力層における線形和関数計算
- 順伝播 - 出力層における活性化関数計算
- 出力層の損失
- 逆伝播 - 出力層における偏微分計算 (共通部分)
- 逆伝播 - 出力層における損失関数の入力による偏微分計算
- 逆伝播 - 出力層における損失関数の重みによる偏微分計算
- 逆伝播 - 出力層における損失関数のバイアスによる偏微分計算
- 逆伝播 - 隠れ層における偏微分計算 (共通部分)
- 逆伝播 - 隠れ層における損失関数の重みによる偏微分計算
- 逆伝播 - 隠れ層における損失関数のバイアスによる偏微分計算
- 最適化
逆伝播の計算概要
逆伝播では、損失(出力)に対する重みの偏微分、バイアスの偏微分を算出し、算出値を元に重みとバイアスを最適化する。
また、損失に対する入力の偏微分を算出し、前層の逆伝播(偏微分)に繋げる。
\begin{align}
\frac{\partial{L}}{\partial{w}} &= \frac{\partial{L}}{\partial{a}} \times \frac{\partial{a}}{\partial{y}} \times \frac{\partial{y}}{\partial{w}} \\
\frac{\partial{L}}{\partial{b}} &= \frac{\partial{L}}{\partial{a}} \times \frac{\partial{a}}{\partial{y}} \times \frac{\partial{y}}{\partial{b}} \\
\frac{\partial{L}}{\partial{x}} &= \frac{\partial{L}}{\partial{a}} \times \frac{\partial{a}}{\partial{y}} \times \frac{\partial{y}}{\partial{x}}
\end{align} \\
\begin{bmatrix}
L & : & \text{損失関数} \\
a & : & \text{活性化関数} \\
y & : & \text{線形和関数} \\
w & : & \text{重み} \\
b & : & \text{バイアス} \\
x & : & \text{入力}
\end{bmatrix}
なお、損失関数は次のとおり定義する。
Loss=\frac{(A-Truth)^2}{2}
ここで A は活性化関数 a の出力と等しく、Truth は真値とする。
また、(1)~(3)の勾配関数は次のように解釈でき、
-
\frac{\partial{L}}{\partial{w}} は各層の出力に対する重みの傾き
-
\frac{\partial{L}}{\partial{b}} は各層の出力に対するバイアスの傾き
-
\frac{\partial{L}}{\partial{x}} は各層の出力に対する入力の傾き
関数の勾配に着目して微少にパラメータを変化させて最適な値に近づける。
重みとバイアスの最適化は、次のように勾配と異符号の向きにパラメータを調整することで実現する。
\begin{align*}
w &\leftarrow w-\eta\frac{\partial{L}}{\partial{w}} \\\\
b &\leftarrow b-\eta\frac{\partial{L}}{\partial{b}}
\end{align*}
ここで \eta \ (0<\eta<1) を学習率という。
上記(1)~(3)の勾配関数は、 \frac{\partial{L}}{\partial{a}} \times \frac{\partial{a}}{\partial{y}} を共通に持つことから、まずは \frac{\partial{L}}{\partial{y}} = \frac{\partial{L}}{\partial{a}} \times \frac{\partial{a}}{\partial{y}} を求める。
\begin{align*}
\frac{\partial{L}}{\partial{a}} &= \left(\frac{(A - Truth)^2}{2}\right)' \\
&= (A - Truth) \cdot (A - Truth)' \\
&= (A - Truth) \cdot 1 \\
&= (A - Truth) \\
\\
\frac{\partial{a}}{\partial{y}} &= sigmoid(y)' \\
&= \{sigmoid(y) \times (1 - sigmoid(y)) \} \\
\\
& \text{ 合成関数の連鎖律から } \\
\frac{\partial{L}}{\partial{y}} &= \frac{\partial{L}}{\partial{a}} \times \frac{\partial{a}}{\partial{y}} \\
&= (A - Truth) \times \{sigmoid(y) \times (1 - sigmoid(y)) \} \\
& \text{ また、 A は活性化関数 a の出力と等しいことから } \\
\frac{\partial{L}}{\partial{y}} &= (sigmoid(y) - Truth) \cdot \{sigmoid(y) \times (1 - sigmoid(y)) \} \\
\end{align*}
または、
\begin{align*}
\frac{\partial{L}}{\partial{y}} &= (\frac{(sigmoid(y)-Truth)^2}{2})' \\
&= (sigmoid(y) - Truth) \cdot (sigmoid(y) - Truth)' \\
&= (sigmoid(y) - Truth) \cdot sigmoid(y)' \\
&= (sigmoid(y) - Truth) \cdot \{sigmoid(y) \times (1 - sigmoid(y)\}
\end{align*}
計算過程
計算過程と共にPython でコーディングを行い理解を深めてみよう。
事前準備
ライブラリをインポート
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
データセット読込
data = load_iris()
X = data.data
y = data.target
データセットを訓練データと試験データに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20)
関数定義
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def mean_squared_error(y_pred, y_true):
return ((y_pred - y_true)**2).sum() / (2*y_pred.size)
def accuracy(y_pred, y_true):
acc = y_pred.argmax(axis=1) == y_true.argmax(axis=1)
return acc.mean()
\frac{sigmoid(x)}{\partial{x}} = sigmoid(x) \times (1-sigmoid(x))
1. ハイパーパラメータ初期設定
learning_rate = 0.1
iterations = 5000
N = y_train.size
input_size = 4
hidden_size = 2
output_size = 3
results = pd.DataFrame(columns=['mse', 'accuracy'])
2. 重みの初期化
np.random.seed(10)
W1 = np.random.normal(scale=0.5, size=(input_size, hidden_size))
B1 = np.random.normal(scale=0.5, size=hidden_size)
W2 = np.random.normal(scale=0.5, size=(hidden_size, output_size))
B2 = np.random.normal(scale=0.5, size=output_size)
隠れ層重み
\begin{align*}
W1 &\leftarrow \begin{pmatrix}
w_{11}^1 & w_{12}^1 & w_{13}^1 & w_{14}^1 \\
w_{21}^1 & w_{22}^1 & w_{23}^1 & w_{24}^1
\end{pmatrix}^\top
= \begin{pmatrix}
w_{11}^1 & w_{21}^1 \\
w_{12}^1 & w_{22}^1 \\
w_{13}^1 & w_{23}^1 \\
w_{14}^1 & w_{24}^1
\end{pmatrix} \\
B1 &\leftarrow \begin{pmatrix}b_1^1 & b_2^1\end{pmatrix}
\end{align*}
出力層重み
\begin{align*}
W2 &\leftarrow \begin{pmatrix}
w_{11}^2 & w_{12}^2 \\
w_{21}^2 & w_{22}^2 \\
w_{31}^2 & w_{32}^2 \\
\end{pmatrix}^\top
= \begin{pmatrix}
w_{11}^2 & w_{21}^2 & w_{31}^2 \\
w_{12}^2 & w_{22}^2 & w_{32}^2 \\
\end{pmatrix} \\\\
B2 &\leftarrow \begin{pmatrix}b_1^2 & b_2^2 & b_3^2\end{pmatrix}
\end{align*}
3. 順伝播 - 隠れ層における線形和関数計算
Z1 = np.dot(X_train, W1) + B1
入力
\begin{align*}
X\_train &= \begin{pmatrix}
x_{11} & x_{12} & x_{13} & x_{14} \\
x_{21} & x_{22} & x_{23} & x_{24} \\
x_{31} & x_{32} & x_{33} & x_{34} \\
x_{41} & x_{42} & x_{43} & x_{44} \\
\vdots & \vdots & \vdots & \vdots \\
\end{pmatrix}
\end{align*}
線形和
\begin{align*}
Z1 &\leftarrow X\_train \cdot W1 + B1 \\\\
&= \begin{pmatrix}
x_{11} & x_{12} & x_{13} & x_{14} \\
x_{21} & x_{22} & x_{23} & x_{24} \\
x_{31} & x_{32} & x_{33} & x_{34} \\
x_{41} & x_{42} & x_{43} & x_{44} \\
\vdots & \vdots & \vdots & \vdots \\
\end{pmatrix}
\cdot
\begin{pmatrix}
w_{11}^1 & w_{21}^1 \\
w_{12}^1 & w_{22}^1 \\
w_{13}^1 & w_{23}^1 \\
w_{14}^1 & w_{24}^1 \\
\end{pmatrix}
+ \begin{pmatrix}b_1^1 & b_2^1\end{pmatrix}
\\
&= \scriptsize{
\begin{pmatrix}
x_{11} \cdot w_{11}^1 + x_{12} \cdot w_{12}^1 + x_{13} \cdot w_{13}^1 + x_{14} \cdot w_{14}^1 + b_1^1
& x_{11} \cdot w_{21}^1 + x_{12} \cdot w_{22}^1 + x_{13} \cdot w_{23}^1 + x_{14} \cdot w_{24}^1 + b_2^1 \\
x_{21} \cdot w_{11}^1 + x_{22} \cdot w_{12}^1 + x_{23} \cdot w_{13}^1 + x_{24} \cdot w_{14}^1 + b_1^1
& x_{21} \cdot w_{21}^1 + x_{22} \cdot w_{22}^1 + x_{23} \cdot w_{23}^1 + x_{24} \cdot w_{24}^1 + b_2^1 \\
x_{31} \cdot w_{11}^1 + x_{32} \cdot w_{12}^1 + x_{33} \cdot w_{13}^1 + x_{34} \cdot w_{14}^1 + b_1^1
& x_{31} \cdot w_{21}^1 + x_{32} \cdot w_{22}^1 + x_{33} \cdot w_{23}^1 + x_{34} \cdot w_{24}^1 + b_2^1 \\
x_{41} \cdot w_{11}^1 + x_{42} \cdot w_{12}^1 + x_{43} \cdot w_{13}^1 + x_{44} \cdot w_{14}^1 + b_1^1
& x_{41} \cdot w_{21}^1 + x_{42} \cdot w_{22}^1 + x_{43} \cdot w_{23}^1 + x_{44} \cdot w_{24}^1 + b_2^1 \\
\vdots & \vdots \\
\end{pmatrix}
} \\
&= \begin{pmatrix}
y_{11}^1 & y_{12}^1 \\
y_{21}^1 & y_{22}^1 \\
y_{31}^1 & y_{32}^1 \\
y_{41}^1 & y_{42}^1 \\
\vdots & \vdots \\
\end{pmatrix} \\
\end{align*}
4. 順伝播 - 隠れ層における活性化関数計算
シグモイド関数により隠れ層の出力を得る
\begin{align*}
A1 &\leftarrow \frac{1}{1+e^{-Z1}} \\\\
&= \begin{pmatrix}
a_{11}^1 & a_{12}^1 \\
a_{21}^1 & a_{22}^1 \\
a_{31}^1 & a_{32}^1 \\
a_{41}^1 & a_{42}^1 \\
\vdots & \vdots
\end{pmatrix}
\end{align*}
5. 順伝播 - 出力層における線形和関数計算
入力
A1= \begin{pmatrix}
a_{11}^1 & a_{12}^1 \\
a_{21}^1 & a_{22}^1 \\
a_{31}^1 & a_{32}^1 \\
a_{41}^1 & a_{42}^1 \\
\vdots & \vdots
\end{pmatrix}
線形和
\begin{align*}
Z2 &\leftarrow A1 \cdot W2 + B2 \\\\
&= \begin{pmatrix}
a_{11}^1 & a_{12}^1 \\
a_{21}^1 & a_{22}^1 \\
a_{31}^1 & a_{32}^1 \\
a_{41}^1 & a_{42}^1 \\
\vdots & \vdots
\end{pmatrix}
\cdot
\begin{pmatrix}
w_{11}^2 & w_{21}^2 & w_{31}^2 \\
w_{12}^2 & w_{22}^2 & w_{32}^2 \\
\end{pmatrix}
+ \begin{pmatrix}b_1^2 & b_2^2 & b_3^2\end{pmatrix} \\
&= \scriptsize{\begin{pmatrix}
a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + 1 \cdot b_{1}^2
& a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + 1 \cdot b_{2}^2
& a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + 1 \cdot b_{3}^2 \\
a_{21}^1 \cdot w_{11}^2 + a_{22}^1 \cdot w_{12}^2 + 1 \cdot b_{1}^2
& a_{21}^1 \cdot w_{21}^2 + a_{22}^1 \cdot w_{22}^2 + 1 \cdot b_{2}^2
& a_{21}^1 \cdot w_{31}^2 + a_{22}^1 \cdot w_{32}^2 + 1 \cdot b_{3}^2 \\
a_{31}^1 \cdot w_{11}^2 + a_{32}^1 \cdot w_{12}^2 + 1 \cdot b_{1}^2
& a_{31}^1 \cdot w_{21}^2 + a_{32}^1 \cdot w_{22}^2 + 1 \cdot b_{2}^2
& a_{31}^1 \cdot w_{31}^2 + a_{32}^1 \cdot w_{32}^2 + 1 \cdot b_{3}^2 \\
a_{41}^1 \cdot w_{11}^2 + a_{42}^1 \cdot w_{12}^2 + 1 \cdot b_{1}^2
& a_{41}^1 \cdot w_{21}^2 + a_{42}^1 \cdot w_{22}^2 + 1 \cdot b_{2}^2
& a_{41}^1 \cdot w_{31}^2 + a_{42}^1 \cdot w_{32}^2 + 1 \cdot b_{3}^2 \\
\vdots & \vdots & \vdots
\end{pmatrix}} \\
&= \begin{pmatrix}
y_{11}^2 & y_{12}^2 & y_{13}^2 \\
y_{21}^2 & y_{22}^2 & y_{23}^2 \\
y_{31}^2 & y_{32}^2 & y_{33}^2 \\
y_{41}^2 & y_{42}^2 & y_{43}^2 \\
\vdots & \vdots & \vdots
\end{pmatrix}
\end{align*}
6. 順伝播 - 出力層における活性化関数計算
シグモイド関数により出力層の出力を得る。推論=A2
となる。
\begin{align*}
A2 &\leftarrow \frac{1}{1+e^{-Z2}} \\
&= \begin{pmatrix}
a_{11}^2 & a_{12}^2 & a_{13}^2 \\
a_{21}^2 & a_{22}^2 & a_{23}^2 \\
a_{31}^2 & a_{32}^2 & a_{33}^2 \\
a_{41}^2 & a_{42}^2 & a_{43}^2 \\
\vdots & \vdots & \vdots
\end{pmatrix}
&\Rightarrow 推論値
\end{align*}
7. 出力層の損失
E2 = A2 - np.identity(output_size)[y_train]
計算過程として、推論値 - 真値
を算出する。
\begin{align*}
E2 &\leftarrow (Prediction - Truth) \\
&= A2 - Truth \\\\
&= \begin{pmatrix}
a_{11}^2 & a_{12}^2 & a_{13}^2 \\
a_{21}^2 & a_{22}^2 & a_{23}^2 \\
a_{31}^2 & a_{32}^2 & a_{33}^2 \\
a_{41}^2 & a_{42}^2 & a_{43}^2 \\
\vdots & \vdots & \vdots
\end{pmatrix}
- \begin{pmatrix}
1 & 0 & 0 \\
0 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 0 \\
\vdots & \vdots & \vdots
\end{pmatrix} \\
&= \begin{pmatrix}
a_{11}^2 - 1 & a_{12}^2 - 0 & a_{13}^2 - 0 \\
a_{21}^2 - 0 & a_{22}^2 - 0 & a_{23}^2 - 1 \\
a_{31}^2 - 0 & a_{32}^2 - 1 & a_{33}^2 - 0 \\
a_{41}^2 - 1 & a_{42}^2 - 0 & a_{43}^2 - 0 \\
\vdots & \vdots & \vdots
\end{pmatrix}
\end{align*}
8. 逆伝播 - 出力層における偏微分計算 (共通部分)
上記、逆伝播の計算概要で示した勾配関数の共通部分を算出する。
\begin{align*}
\frac{\partial{L}}{\partial{y}} &= (sigmoid(y) - Truth) \cdot \{sigmoid(y) \times (1 - sigmoid(y)) \} \\
&= (A2 - Truth) \times A2 \times (1-A2) \\
&= E2 \times A2 \times (1-A2) \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\frac{\partial{L_2}}{\partial{y_{21}^2}} & \frac{\partial{L_2}}{\partial{y_{22}^2}} & \frac{\partial{L_2}}{\partial{y_{23}^2}} \\
\frac{\partial{L_3}}{\partial{y_{31}^2}} & \frac{\partial{L_3}}{\partial{y_{32}^2}} & \frac{\partial{L_3}}{\partial{y_{33}^2}} \\
\frac{\partial{L_4}}{\partial{y_{41}^2}} & \frac{\partial{L_4}}{\partial{y_{42}^2}} & \frac{\partial{L_4}}{\partial{y_{43}^2}} \\
\vdots & \vdots & \vdots
\end{pmatrix} \\
&\Rightarrow dE2
\end{align*}
9. 逆伝播 - 出力層における損失関数の入力による偏微分計算
損失関数 L の入力 A1 に対する偏微分を算出する。
\begin{align*}
\frac{\partial{L}}{\partial{A1}}
&= \frac{\partial{L_i}}{\partial{a_{ij}^1}}
= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{a_{11}^1}} & \frac{\partial{L_1}}{\partial{a_{12}}^1} \\
\frac{\partial{L_2}}{\partial{a_{21}^1}} & \frac{\partial{L_2}}{\partial{a_{22}}^1} \\
\vdots & \vdots
\end{pmatrix} \\
&= \frac{\partial{L_i}}{\partial{f(a_{i1}^1,a_{i2}^1)}} \cdot \frac{\partial{f(a_{i1}^1,a_{i2}^1)}}{\partial{a_{ij}^1}}
= \frac{\partial{L_i}}{\partial{Y_i}} \cdot \frac{\partial{Y_i}}{\partial{a_{ij}^1}}
= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{a_{11}^1}} & \frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{a_{12}^1}} \\
\frac{\partial{L_2}}{\partial{Y_{2}}} \cdot \frac{\partial{Y_{2}}}{\partial{a_{21}^1}} & \frac{\partial{L_2}}{\partial{Y_{2}}} \cdot \frac{\partial{Y_{2}}}{\partial{a_{22}^1}} \\
\frac{\partial{L_3}}{\partial{Y_{3}}} \cdot \frac{\partial{Y_{3}}}{\partial{a_{31}^1}} & \frac{\partial{L_3}}{\partial{Y_{3}}} \cdot \frac{\partial{Y_{3}}}{\partial{a_{32}^1}} \\
\frac{\partial{L_4}}{\partial{Y_{4}}} \cdot \frac{\partial{Y_{4}}}{\partial{a_{41}^1}} & \frac{\partial{L_4}}{\partial{Y_{4}}} \cdot \frac{\partial{Y_{4}}}{\partial{a_{42}^1}} \\
\vdots & \vdots
\end{pmatrix} \\
\end{align*}
ここで、ジャコビアンの連鎖律および
\begin{bmatrix}
y_{11}^2 &= a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2 \\
y_{12}^2 &= a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2 \\
y_{13}^2 &= a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2 \\
\end{bmatrix}
であることから、
\begin{align*}
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{a_{11}^1}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{a_{11}^1}} \\
\frac{\partial{y_{12}^2}}{\partial{a_{11}^1}} \\
\frac{\partial{y_{13}^2}}{\partial{a_{11}^1}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{a_{11}^1}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{a_{11}^1}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{a_{11}^1}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{a_{11}^1}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{a_{11}^1}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{a_{11}^1}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot w_{11}^2 +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot w_{21}^2 +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot w_{31}^2 \\
\end{pmatrix} \\
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{a_{12}^1}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{a_{12}^1}} \\
\frac{\partial{y_{12}^2}}{\partial{a_{12}^1}} \\
\frac{\partial{y_{13}^2}}{\partial{a_{12}^1}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{a_{12}^1}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{a_{12}^1}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{a_{12}^1}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{a_{12}^1}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{a_{12}^1}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{a_{12}^1}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot w_{12}^2 +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot w_{22}^2 +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot w_{32}^2 \\
\end{pmatrix} \\
\frac{\partial{L_2}}{\partial{Y_{2}}} \cdot \frac{\partial{Y_{2}}}{\partial{a_{21}^1}}
&= \begin{pmatrix}
\frac{\partial{L_2}}{\partial{y_{21}^2}} & \frac{\partial{L_2}}{\partial{y_{22}^2}} & \frac{\partial{L_2}}{\partial{y_{23}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{21}^2}}{\partial{a_{21}^1}} \\
\frac{\partial{y_{22}^2}}{\partial{a_{21}^1}} \\
\frac{\partial{y_{23}^2}}{\partial{a_{21}^1}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_2}}{\partial{y_{21}^2}} \cdot \frac{\partial{y_{21}^2}}{\partial{a_{21}^1}} +
\frac{\partial{L_2}}{\partial{y_{22}^2}} \cdot \frac{\partial{y_{22}^2}}{\partial{a_{21}^1}} +
\frac{\partial{L_2}}{\partial{y_{23}^2}} \cdot \frac{\partial{y_{23}^2}}{\partial{a_{21}^1}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_2}}{\partial{y_{21}^2}} \cdot \frac{\partial{}}{\partial{a_{21}^1}}(a_{21}^1 \cdot w_{11}^2 + a_{22}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_2}}{\partial{y_{22}^2}} \cdot \frac{\partial{}}{\partial{a_{21}^1}}(a_{21}^1 \cdot w_{21}^2 + a_{22}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_2}}{\partial{y_{23}^2}} \cdot \frac{\partial{}}{\partial{a_{21}^1}}(a_{21}^1 \cdot w_{31}^2 + a_{22}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_2}}{\partial{y_{21}^2}} \cdot w_{11}^2 +
\frac{\partial{L_2}}{\partial{y_{22}^2}} \cdot w_{21}^2 +
\frac{\partial{L_2}}{\partial{y_{23}^2}} \cdot w_{31}^2 \\
\end{pmatrix} \\
\frac{\partial{L_2}}{\partial{Y_{2}}} \cdot \frac{\partial{Y_{2}}}{\partial{a_{22}^1}}
&= \dots
\end{align*}
これを代入すると、
\begin{align*}
\frac{\partial{L}}{\partial{A^1}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{a_{11}^1}} & \frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{a_{12}^1}} \\
\frac{\partial{L_2}}{\partial{Y_{2}}} \cdot \frac{\partial{Y_{2}}}{\partial{a_{21}^1}} & \frac{\partial{L_2}}{\partial{Y_{2}}} \cdot \frac{\partial{Y_{2}}}{\partial{a_{22}^1}} \\
\frac{\partial{L_3}}{\partial{Y_{3}}} \cdot \frac{\partial{Y_{3}}}{\partial{a_{31}^1}} & \frac{\partial{L_3}}{\partial{Y_{3}}} \cdot \frac{\partial{Y_{3}}}{\partial{a_{32}^1}} \\
\frac{\partial{L_4}}{\partial{Y_{4}}} \cdot \frac{\partial{Y_{4}}}{\partial{a_{41}^1}} & \frac{\partial{L_4}}{\partial{Y_{4}}} \cdot \frac{\partial{Y_{4}}}{\partial{a_{42}^1}} \\
\vdots & \vdots
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot w_{11}^2 +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot w_{21}^2 +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot w_{31}^2 &
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot w_{12}^2 +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot w_{22}^2 +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot w_{32}^2 \\
\frac{\partial{L_2}}{\partial{y_{21}^2}} \cdot w_{11}^2 +
\frac{\partial{L_2}}{\partial{y_{22}^2}} \cdot w_{21}^2 +
\frac{\partial{L_2}}{\partial{y_{23}^2}} \cdot w_{31}^2 &
\frac{\partial{L_2}}{\partial{y_{21}^2}} \cdot w_{12}^2 +
\frac{\partial{L_2}}{\partial{y_{22}^2}} \cdot w_{22}^2 +
\frac{\partial{L_2}}{\partial{y_{23}^2}} \cdot w_{32}^2 \\
\frac{\partial{L_3}}{\partial{y_{31}^2}} \cdot w_{11}^2 +
\frac{\partial{L_3}}{\partial{y_{32}^2}} \cdot w_{21}^2 +
\frac{\partial{L_3}}{\partial{y_{33}^2}} \cdot w_{31}^2 &
\frac{\partial{L_3}}{\partial{y_{31}^2}} \cdot w_{12}^2 +
\frac{\partial{L_3}}{\partial{y_{32}^2}} \cdot w_{22}^2 +
\frac{\partial{L_3}}{\partial{y_{33}^2}} \cdot w_{32}^2 \\
\frac{\partial{L_4}}{\partial{y_{41}^2}} \cdot w_{11}^2 +
\frac{\partial{L_4}}{\partial{y_{42}^2}} \cdot w_{21}^2 +
\frac{\partial{L_4}}{\partial{y_{43}^2}} \cdot w_{31}^2 &
\frac{\partial{L_4}}{\partial{y_{41}^2}} \cdot w_{12}^2 +
\frac{\partial{L_4}}{\partial{y_{42}^2}} \cdot w_{22}^2 +
\frac{\partial{L_4}}{\partial{y_{43}^2}} \cdot w_{32}^2 \\
\vdots & \vdots
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}}} & \frac{\partial{L_1}}{\partial{y_{12}}} & \frac{\partial{L_1}}{\partial{y_{13}}} \\
\frac{\partial{L_2}}{\partial{y_{21}}} & \frac{\partial{L_2}}{\partial{y_{22}}} & \frac{\partial{L_2}}{\partial{y_{23}}} \\
\frac{\partial{L_3}}{\partial{y_{31}}} & \frac{\partial{L_3}}{\partial{y_{32}}} & \frac{\partial{L_3}}{\partial{y_{33}}} \\
\frac{\partial{L_4}}{\partial{y_{41}}} & \frac{\partial{L_4}}{\partial{y_{42}}} & \frac{\partial{L_4}}{\partial{y_{43}}} \\
\vdots & \vdots & \vdots
\end{pmatrix}
\begin{pmatrix}
w_{11}^2 & w_{12}^2 \\
w_{21}^2 & w_{22}^2 \\
w_{31}^2 & w_{32}^2 \\
\end{pmatrix} \\
&= dE2 \cdot W_2^\top \\
&\Rightarrow E1
\end{align*}
10. 逆伝播 - 出力層における損失関数の重みによる偏微分計算
dW2 = np.dot(A1.T, dE2) / N
損失関数 L の重み W2 に対する偏微分を算出する。
\begin{align*}
\frac{\partial{L_i}}{\partial{W2}}
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{w_{11}^2}} & \frac{\partial{L_i}}{\partial{w_{21}}^2} & \frac{\partial{L_i}}{\partial{w_{31}}^2} \\
\frac{\partial{L_i}}{\partial{w_{12}^2}} & \frac{\partial{L_i}}{\partial{w_{22}}^2} & \frac{\partial{L_i}}{\partial{w_{32}}^2} \\
\end{pmatrix} \\
\frac{\partial{L_i}}{\partial{w_{11}^2}} &= \frac{\partial{L_i}}{\partial{f(w_{11}^2,w_{12}^2,w_{13}^2,w_{21}^2,w_{22}^2,w_{23}^2)}} \cdot \frac{\partial{f(w_{11}^2,w_{12}^2,w_{13}^2,w_{21}^2,w_{22}^2,w_{23}^2)}}{\partial{w_{11}^2}} \\
&= \frac{\partial{L_i}}{\partial{Y_i}} \cdot \frac{\partial{Y_i}}{\partial{w_{11}^2}}
\end{align*}
ここで、ジャコビアンの連鎖律および
\begin{bmatrix}
y_{11}^2 &= a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2 \\
y_{12}^2 &= a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2 \\
y_{13}^2 &= a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2 \\
\end{bmatrix}
であることから、
\begin{align*}
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{w_{11}^2}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{w_{11}^2}} \\
\frac{\partial{y_{12}^2}}{\partial{w_{11}^2}} \\
\frac{\partial{y_{13}^2}}{\partial{w_{11}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{w_{11}^2}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{w_{11}^2}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{w_{11}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{w_{11}^2}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{w_{11}^2}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{w_{11}^2}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot a_{11}^1 + 0 + 0 \\
\end{pmatrix} \\
&= \frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot a_{11}^1 \\
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{w_{21}^2}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{w_{21}^2}} \\
\frac{\partial{y_{12}^2}}{\partial{w_{21}^2}} \\
\frac{\partial{y_{13}^2}}{\partial{w_{21}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{w_{21}^2}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{w_{21}^2}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{w_{21}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{w_{21}^2}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{w_{21}^2}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{w_{21}^2}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
0 + \frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot a_{11}^1 + 0 \\
\end{pmatrix} \\
&= \frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot a_{11}^1 \\
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{w_{12}^2}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{w_{12}^2}} \\
\frac{\partial{y_{12}^2}}{\partial{w_{12}^2}} \\
\frac{\partial{y_{13}^2}}{\partial{w_{12}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{w_{12}^2}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{w_{12}^2}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{w_{12}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{w_{12}^2}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{w_{12}^2}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{w_{12}^2}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot a_{12}^1 + 0 + 0 \\
\end{pmatrix} \\
&= \frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot a_{12}^1 \\
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{w_{22}^2}}
&= \dots
\end{align*}
これを代入すると、
\begin{align*}
\frac{\partial{L_i}}{\partial{W2}}
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{w_{11}^2}} & \frac{\partial{L_i}}{\partial{w_{21}}^2} & \frac{\partial{L_i}}{\partial{w_{31}}^2} \\
\frac{\partial{L_i}}{\partial{w_{12}^2}} & \frac{\partial{L_i}}{\partial{w_{22}}^2} & \frac{\partial{L_i}}{\partial{w_{32}}^2} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{Y_{i}}} \cdot \frac{\partial{Y_{i}}}{\partial{w_{11}^2}}
& \frac{\partial{L_i}}{\partial{Y_{i}}} \cdot \frac{\partial{Y_{i}}}{\partial{w_{21}^2}}
& \frac{\partial{L_i}}{\partial{Y_{i}}} \cdot \frac{\partial{Y_{i}}}{\partial{w_{31}^2}} \\
\frac{\partial{L_i}}{\partial{Y_{i}}} \cdot \frac{\partial{Y_{i}}}{\partial{w_{12}^2}}
& \frac{\partial{L_i}}{\partial{Y_{i}}} \cdot \frac{\partial{Y_{i}}}{\partial{w_{22}^2}}
& \frac{\partial{L_i}}{\partial{Y_{i}}} \cdot \frac{\partial{Y_{i}}}{\partial{w_{32}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{y_{i1}^2}} \cdot a_{i1}^1 & \frac{\partial{L_i}}{\partial{y_{i2}^2}} \cdot a_{i1}^1 & \frac{\partial{L_i}}{\partial{y_{i3}^2}} \cdot a_{i1}^1 \\
\frac{\partial{L_i}}{\partial{y_{i1}^2}} \cdot a_{i2}^1 & \frac{\partial{L_i}}{\partial{y_{i2}^2}} \cdot a_{i2}^1 & \frac{\partial{L_i}}{\partial{y_{i3}^2}} \cdot a_{i2}^1 \\
\end{pmatrix} \\
&= \begin{pmatrix}
a_{i1}^1 \\
a_{i2}^1 \\
\end{pmatrix}
\begin{pmatrix}
\frac{\partial{L_i}}{\partial{y_{i1}^2}} & \frac{\partial{L_i}}{\partial{y_{i2}^2}} & \frac{\partial{L_i}}{\partial{y_{i3}^2}} \\
\end{pmatrix} \\
&= A1_{i}^\top \cdot dE2_{i} \\
\frac{\partial{L}}{\partial{W2}}
&= \begin{pmatrix}
a_{11}^1 & a_{21}^1 & a_{31}^1 & a_{41}^1 & \dots \\
a_{12}^1 & a_{22}^1 & a_{32}^1 & a_{42}^1 & \dots \\
\end{pmatrix}
\begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\frac{\partial{L_2}}{\partial{y_{21}^2}} & \frac{\partial{L_2}}{\partial{y_{22}^2}} & \frac{\partial{L_2}}{\partial{y_{23}^2}} \\
\frac{\partial{L_3}}{\partial{y_{31}^2}} & \frac{\partial{L_3}}{\partial{y_{32}^2}} & \frac{\partial{L_3}}{\partial{y_{33}^2}} \\
\frac{\partial{L_4}}{\partial{y_{41}^2}} & \frac{\partial{L_4}}{\partial{y_{42}^2}} & \frac{\partial{L_4}}{\partial{y_{43}^2}} \\
\vdots & \vdots & \vdots
\end{pmatrix} \\
&= A1^\top \cdot dE2 \\
\frac{1}{N} \cdot \frac{\partial{L}}{\partial{W2}} &= \frac{1}{N} \cdot A1^\top \cdot dE2 \\
&\Rightarrow dW2
\end{align*}
11. 逆伝播 - 出力層における損失関数のバイアスによる偏微分計算
dB2 = dE2.sum(axis=0) / N
損失関数 L のバイアス B2 に対する偏微分を算出する。
\begin{align*}
\frac{\partial{L_i}}{\partial{b^2}}
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{b_{1}^2}} & \frac{\partial{L_i}}{\partial{b_{2}}^2} & \frac{\partial{L_i}}{\partial{b_{3}}^2} \\
\end{pmatrix} \\
\frac{\partial{L_i}}{\partial{b_{1}^2}} &= \frac{\partial{L_i}}{\partial{f(b_{1}^2,b_{2}^2,b_{3}^2)}} \cdot \frac{\partial{f(b_{1}^2,b_{2}^2,b_{3}^2)}}{\partial{b_{1}^2}}
= \frac{\partial{L_i}}{\partial{Y_i}} \cdot \frac{\partial{Y_i}}{\partial{b_{1}^2}}
\end{align*}
ここで、ジャコビアンの連鎖律および
\begin{bmatrix}
y_{11}^2 &= a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2 \\
y_{12}^2 &= a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2 \\
y_{13}^2 &= a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2 \\
\end{bmatrix}
であることから、
\begin{align*}
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{b_{1}^2}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{b_{1}^2}} \\
\frac{\partial{y_{12}^2}}{\partial{b_{1}^2}} \\
\frac{\partial{y_{13}^2}}{\partial{b_{1}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{b_{1}^2}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{b_{1}^2}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{b_{1}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{b_{1}^2}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{b_{1}^2}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{b_{1}^2}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot 1 + 0 + 0 \\
\end{pmatrix} \\
&= \frac{\partial{L_1}}{\partial{y_{11}^2}} \\
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{b_{2}^2}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{b_{2}^2}} \\
\frac{\partial{y_{12}^2}}{\partial{b_{2}^2}} \\
\frac{\partial{y_{13}^2}}{\partial{b_{2}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{b_{2}^2}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{b_{2}^2}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{b_{2}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{b_{2}^2}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{b_{2}^2}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{b_{2}^2}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
0 + \frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot 1 + 0 \\
\end{pmatrix} \\
&= \frac{\partial{L_1}}{\partial{y_{12}^2}} \\
\frac{\partial{L_1}}{\partial{Y_{1}}} \cdot \frac{\partial{Y_{1}}}{\partial{b_{3}^2}}
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} & \frac{\partial{L_1}}{\partial{y_{12}^2}} & \frac{\partial{L_1}}{\partial{y_{13}^2}} \\
\end{pmatrix}
\cdot
\begin{pmatrix}
\frac{\partial{y_{11}^2}}{\partial{b_{3}^2}} \\
\frac{\partial{y_{12}^2}}{\partial{b_{3}^2}} \\
\frac{\partial{y_{13}^2}}{\partial{b_{3}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{y_{11}^2}}{\partial{b_{3}^2}} +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{y_{12}^2}}{\partial{b_{3}^2}} +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{y_{13}^2}}{\partial{b_{3}^2}} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot \frac{\partial{}}{\partial{b_{3}^2}}(a_{11}^1 \cdot w_{11}^2 + a_{12}^1 \cdot w_{12}^2 + b_{1}^2) +
\frac{\partial{L_1}}{\partial{y_{12}^2}} \cdot \frac{\partial{}}{\partial{b_{3}^2}}(a_{11}^1 \cdot w_{21}^2 + a_{12}^1 \cdot w_{22}^2 + b_{2}^2) +
\frac{\partial{L_1}}{\partial{y_{13}^2}} \cdot \frac{\partial{}}{\partial{b_{3}^2}}(a_{11}^1 \cdot w_{31}^2 + a_{12}^1 \cdot w_{32}^2 + b_{3}^2) \\
\end{pmatrix} \\
&= \begin{pmatrix}
0 + 0 + \frac{\partial{L_1}}{\partial{y_{11}^2}} \cdot 1 \\
\end{pmatrix} \\
&= \frac{\partial{L_1}}{\partial{y_{13}^2}}
\end{align*}
これを代入すると、
\begin{align*}
\frac{\partial{L_i}}{\partial{b^2}}
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{b_{1}^2}} & \frac{\partial{L_i}}{\partial{b_{2}}^2} & \frac{\partial{L_i}}{\partial{b_{3}}^2} \\
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{Y_i}} \cdot \frac{\partial{Y_i}}{\partial{b_{1}^2}}
& \frac{\partial{L_i}}{\partial{Y_i}} \cdot \frac{\partial{Y_i}}{\partial{b_{2}^2}}
& \frac{\partial{L_i}}{\partial{Y_i}} \cdot \frac{\partial{Y_i}}{\partial{b_{3}^2}}
\end{pmatrix} \\
&= \begin{pmatrix}
\frac{\partial{L_i}}{\partial{y_{i1}^2}} & \frac{\partial{L_i}}{\partial{y_{i2}^2}} & \frac{\partial{L_i}}{\partial{y_{i3}^2}} \\
\end{pmatrix} \\
&= dE2_{i} \\
\frac{\partial{L}}{\partial{b_2}}
&= \sum_{i=1}^{N}{\begin{pmatrix}
\frac{\partial{L_i}}{\partial{y_{i1}^2}} & \frac{\partial{L_i}}{\partial{y_{i2}^2}} & \frac{\partial{L_i}}{\partial{y_{i3}^2}} & ,axis=0 \\
\end{pmatrix}} \\
&= \begin{pmatrix}
\displaystyle\sum_{i=1}^{N}{\frac{\partial{L_i}}{\partial{y_{i1}^2}}} & \displaystyle\sum_{i=1}^{N}{\frac{\partial{L_i}}{\partial{y_{i2}^2}}} & \displaystyle\sum_{i=1}^{N}{\frac{\partial{L_i}}{\partial{y_{i3}^2}}} \\
\end{pmatrix} \\
\frac{1}{N} \cdot \frac{\partial{L}}{\partial{b_2}}
&= \frac{1}{N}
\begin{pmatrix}
\displaystyle\sum_{i=1}^{N}{\frac{\partial{L_i}}{\partial{y_{i1}^2}}} & \displaystyle\sum_{i=1}^{N}{\frac{\partial{L_i}}{\partial{y_{i2}^2}}} & \displaystyle\sum_{i=1}^{N}{\frac{\partial{L_i}}{\partial{y_{i3}^2}}} \\
\end{pmatrix} \\
&\Rightarrow dB2
\end{align*}
12. 逆伝播 - 隠れ層における偏微分計算 (共通部分)
損失関数 L
に対する隠れ層の偏微分は、
\begin{align*}
\frac{\partial{L}}{\partial{y}}
&=\frac{\partial{L}}{\partial{a}} \cdot \frac{\partial{a}}{\partial{y}} \\
&=\frac{\partial{L}}{\partial{a}} \cdot \{sigmoid(y) \times (1 - sigmoid(y)) \}
\end{align*}
この場合、a=A^1であることから、
\begin{align*}
\frac{\partial{L}}{\partial{y}}
&= \frac{\partial{L}}{\partial{A^1}} \cdot \{A^1 \times (1 - A^1) \} \\
\end{align*}
[9. 逆伝播 - 出力層における損失関数の入力による偏微分計算] より、
\begin{align*}
\frac{\partial{L}}{\partial{y}}
&= E1 \cdot \{A^1 \times (1 - A^1) \} \\
\end{align*}
13. 逆伝播 - 隠れ層における損失関数の重みによる偏微分計算
dW1 = np.dot(X_train.T, dE1) / N
[10. 逆伝播 - 出力層における損失関数の重みによる偏微分計算] と同様
14. 逆伝播 - 隠れ層における損失関数のバイアスによる偏微分計算
dB1 = dE1.sum(axis=0) / N
[11. 逆伝播 - 出力層における損失関数のバイアスによる偏微分計算] と同様
15. 最適化
W2 = W2 - learning_rate * dW2
B2 = B2 - learning_rate * dB2
W1 = W1 - learning_rate * dW1
B1 = B1 - learning_rate * dB1
\begin{align*}
w &\leftarrow w-\eta\frac{\partial{L}}{\partial{w}} \\\\
b &\leftarrow b-\eta\frac{\partial{L}}{\partial{b}}
\end{align*}
により、重みとバイアスの最適化を行う。
まとめ
以上、20年以上前に学んだ数学を思い出しながら計算過程をお復習いしました。
間違いなどあれば、ぜひご教授ください!
コードはまとめると次のようになり、学習回数を重ねる毎に偏微分の値がゼロに近いて最適化が進んでいきます。
for itr in range(iterations):
Z1 = np.dot(X_train, W1) + B1
A1 = sigmoid(Z1)
Z2 = np.dot(A1, W2) + B2
A2 = sigmoid(Z2)
E2 = A2 - np.identity(3)[y_train]
dE2 = E2 * A2 * (1 - A2)
E1 = np.dot(dE2, W2.T)
dE1 = E1 * A1 * (1 - A1)
dW2 = np.dot(A1.T, dE2) / N
dB2 = dE2.sum(axis=0) / N
dW1 = np.dot(X_train.T, dE1) / N
dB1 = dE1.sum(axis=0) / N
W2 = W2 - learning_rate * dW2
B2 = B2 - learning_rate * dB2
W1 = W1 - learning_rate * dW1
B1 = B1 - learning_rate * dB1
コード全体はこちら。
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
data = load_iris()
X=data.data
y=data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20, random_state=4)
learning_rate = 0.1
iterations = 5000
N = y_train.size
input_size = 4
hidden_size = 2
output_size = 3
results = pd.DataFrame(columns=['mse', 'accuracy'])
np.random.seed(10)
W1 = np.random.normal(scale=0.5, size=(input_size, hidden_size))
B1 = np.random.normal(scale=0.5, size=hidden_size)
W1_baias = np.vstack([W1, B1])
W2 = np.random.normal(scale=0.5, size=(hidden_size, output_size))
B2 = np.random.normal(scale=0.5, size=output_size)
W2_baias = np.vstack([W2, B2])
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def mean_squared_error(y_pred, y_true):
return ((y_pred - y_true)**2).sum() / (2*y_pred.size)
def accuracy(y_pred, y_true):
acc = y_pred.argmax(axis=1) == y_true.argmax(axis=1)
return acc.mean()
for itr in range(iterations):
Z1 = np.dot(X_train, W1) + B1
A1 = sigmoid(Z1)
Z2 = np.dot(A1, W2) + B2
A2 = sigmoid(Z2)
mse = mean_squared_error(A2, np.identity(3)[y_train])
acc = accuracy(A2, np.identity(3)[y_train])
results=results.append({'mse':mse, 'accuracy':acc}, ignore_index=True)
E2 = A2 - np.identity(3)[y_train]
dE2 = E2 * A2 * (1 - A2)
E1 = np.dot(dE2, W2.T)
dE1 = E1 * A1 * (1 - A1)
dW2 = np.dot(A1.T, dE2) / N
dB2 = dE2.sum(axis=0) / N
dW1 = np.dot(X_train.T, dE1) / N
dB1 = dE1.sum(axis=0) / N
W2 = W2 - learning_rate * dW2
B2 = B2 - learning_rate * dB2
W1 = W1 - learning_rate * dW1
B1 = B1 - learning_rate * dB1
results.mse.plot(title='Mean Squared Error')
results.accuracy.plot(title='Accuracy')
参考 (感謝です!)
https://www.ccn.yamanashi.ac.jp/~tmiyamoto/img/matrixform_mlp.pdf
https://ieyasu03.web.fc2.com/Deep_Learning/7-BackProp.pdf
https://imagingsolution.net/deep-learning/backpropagation/
https://www.bigdata-navi.com/aidrops/2810/
https://blog.regonn.tokyo/data-science/2017-10-27-deep-learning/
https://www2.kaiyodai.ac.jp/~takenawa/learning/
https://atmarkit.itmedia.co.jp/ait/articles/2202/09/news027.html
Discussion