Pytorch Tutorial 実況プレイ２: Tensor と自動微分

『Pytorch Tutorial 実況プレイ１: Quickstart』の続き。

公式チュートリアル

日本語版

Josh Nobus

Tensor と自動微分

練習用ノートブック

Josh Nobus

雑感

Tensor の扱い自体は numpy や TensorFlow と似たような感じであり扱いやすい。自動微分の機能に関しては、TensorFlow がオンにするときに with tf.GradientTape() を使ってアクティベートするのに対し、Pytorch はデフォルトでオンであり、オフにするときに with torch.no_grad() で無効化する。

計算グラフは TensorFlow よりも Pytorch のほうが密接に Tensor に結びついていて、暗黙的にいろいろやってくれてしまう雰囲気を感じる。実装者が意識しなくても勝手にやってくれるという点では便利かもしれないが、逆に言えば意識してないことを勝手にやられる可能性があるので TensorFlow よりも挙動に注意を払わなければならない気がする。

Josh Nobus

Tensors

チュートリアルも短いし使い方にクセもなさそうだし特に書くことがない。

Josh Nobus

自動微分

import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

まぁここはよい。

You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method.

自動微分を計算するかどうかは生成時に requires_grad を与えるか、後からでも requires_grad_() メソッドでも変更できますよとのこと。

次がちょっと驚きポイントで、Tensor に対する各演算は Function クラスのオブジェクトになっていて、順方向への演算と誤差逆伝播用の演算の両方を保持しているとのこと。grad_fn で誤差逆伝播用の関数が取得できるようだ。

print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

output

Gradient function for z = <AddBackward0 object at 0x7f1358513e80>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7f1358513340>

二重数を使った自動微分であれば順方向への計算時に勝手に微分値が出てくるが、数値的な安定性などを考慮して微分用の関数を用意して真面目に計算しているのかもしれない。

Function のドキュメントを見に行ったら forward と backward を別々に定義しているから真面目に計算しているっぽい。

>>> class Exp(Function):
>>>     @staticmethod
>>>     def forward(ctx, i):
>>>         result = i.exp()
>>>         ctx.save_for_backward(result)
>>>         return result
>>>
>>>     @staticmethod
>>>     def backward(ctx, grad_output):
>>>         result, = ctx.saved_tensors
>>>         return grad_output * result
>>>
>>> # Use it by calling the apply method:
>>> output = Exp.apply(input)

ctx は context の略であろう。プログラミングに慣れてない人向けに説明しておくと、プログラミングにおける「コンテキスト」とは一連の処理を実行する際に自分よりも後の処理が必要とする情報を詰め込んでおく場所である。いろんな情報が詰め込まれているので「コンテキストである」と分かったところでその実装を見に行かないと何が入っているのかはわからないが、ここでは save_for_backward() で backward のときのために Tensor を保存しておけるわけなので、「順方向への演算」と「誤差逆伝播」を合わせて「一連の処理」とみなしていることになる。

grad_fn は計算グラフからその Tensor に対してどの backward を呼び出すかを取得するプロパティのようである。

The grad_fn attribute of a torch.Tensor holds a torch.autograd.graph.Node if the tensor is the output of a operation that was recorded by autograd (i.e., grad_mode is enabled and at least one of the inputs required gradients), or None otherwise.

誤差逆伝播には Tensor の backward メソッドを用いる。

print(w.grad)
print(b.grad)

loss.backward()

print(w.grad)
print(b.grad)

output

None
None
tensor([[0.2498, 0.1345, 0.0555],
        [0.2498, 0.1345, 0.0555],
        [0.2498, 0.1345, 0.0555],
        [0.2498, 0.1345, 0.0555],
        [0.2498, 0.1345, 0.0555]])
tensor([0.2498, 0.1345, 0.0555])

なんか注意書きが書いてある。

We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.

パフォーマンス上の問題で、ひとつの計算グラフに対して１回しか backward() が計算できないらしい。やってみよう。

loss.backward()

output

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-52a0569421b1> in <cell line: 1>()
----> 1 loss.backward()

1 frames
/usr/local/lib/python3.9/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    198     # some Python versions print out the first line of a multi-line function
    199     # calls in the traceback and some print out the last line
--> 200     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    201         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    202         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Josh Nobus

自動微分をオフにする

with torch.no_grad() の中で生成された Tensor に対しては勾配が計算されないようである。

z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

output

True
False

Pytorch の自動微分は backward() のときに明示的に計算され、それまで Tensor が保持する勾配は None でありメモリを消費しないから、torch.no_grad() で無効化されているのはおそらく次の２つである。

ctx.save_for_backward() などの backward() のための情報保存
計算グラフの生成

「おそらく」と言っているのはドキュメントとソースコードを読んでもよく分からなかったからである。

detach() を呼び出して勾配計算をオフにすることもできるらしい。

z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

どこかで require_grad_(False) でやれとか言ってなかったか？Tensor のドキュメントを見ると以下のような記述がある。

torch.tensor() always copies data. If you have a Tensor data and just want to change its requires_grad flag, use require_grad_() or detach() to avoid a copy. If you have a numpy array and want to avoid a copy, use torch.as_tensor().

どっちでもよさそうに見えてしまうのでドキュメントを読みに行くと、

require_grad_(False)：そのテンソルの requires_grad 属性を書き換える。
- どこかの計算グラフ内のテンソルの情報を書き換えるので、その計算グラフが表現する計算に影響を与えることがある。
- torch.no_grad() に近い気はする。
detach()：requires_grad=Falseの状態でそのテンソルのコピーを返す。
- テンソルをコピーして計算グラフから切り離された状態で返すので、そのテンソルをどう使おうが元の計算グラフが表現する計算には影響を与えない
- torch.no_grad() 相当ではない。

[ require_grad_() ]
Change if autograd should record operations on this tensor: sets this tensor’s requires_grad attribute in-place. Returns this tensor.

[ detach() ]
Returns a new Tensor, detached from the current graph.

ということらしい。用途がまったく違うじゃないか。

require_grad_(False) も「既存のテンソルの属性を書き換えている」という点で、「これから生成されるテンソルを requires_grad=False に設定して生成する」という挙動をする torch.no_grad() とは異なる気がする。

z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    print(z.requires_grad)

output

True
True

やはりそうだ。with torch.no_grad() は既存のテンソルの requires_grad 属性までは書き換えない。それぞれ微妙に挙動が違うことには留意しないといけないかもしれない。

Josh Nobus

More on Computational Graphs

計算グラフの挙動が書いてある。まぁ大体ここまでの予想通り。

In a forward pass, autograd does two things simultaneously:

run the requested operation to compute a resulting tensor

maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

computes the gradients from each .grad_fn,

accumulates them in the respective tensor’s .grad attribute

using the chain rule, propagates all the way to the leaf tensors.

なんか注意書きが書いてある。

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

計算グラフは .backward() のコールごとに作り直されるので、イテレーションごとにテンソルの形状が異なるような実装も可能ですよとのこと。便利。

Josh Nobus

Optional Reading: Tensor Gradients and Jacobian Products

出力がスカラーではないモデル $y=f(x)$ については Jacobian $J \in \mathbb{R} ^ {m \times n}$ とパラメータの形状が合わなくなってしまうので、 $y = \mathbb{R} ^ m$ の場合は適当なベクトル $v \in \mathbb{R} ^ m$ を backward() に与えて作られる Jacobian Product $v ^ \mathrm{T} J$ が勾配として算出されますよ、と書いてある。

目的関数が複数あるケースとかマルチモーダル AI で使われそう。

Josh Nobus

今回はこれでおしまい。