Open26

voxelpose-pytorchをAmazon EC2で動かす

Yosuke MIYAJIMAYosuke MIYAJIMA

学習させて推論までをやりたい。

コード:
https://github.com/microsoft/voxelpose-pytorch

論文:
[2004.06239] VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment
https://arxiv.org/abs/2004.06239

論文のまとめスライド:
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wi…
https://www.slideshare.net/DeepLearningJP2016/dlvoxelpose-towards-multicamera-3d-human-pose-estimation-in-wild-environment

デモ動画:
https://www.youtube.com/watch?v=-UyuQ7G74iE

論文執筆者の一人、王春雨=Chunyu Wangさんのサイト
https://www.chunyuwang.org/publication/multi-person-pose/

Yosuke MIYAJIMAYosuke MIYAJIMA

こちらで、SageMaker Studio上でvoxelposeを動かそうとしているが、SageMaker上でDockerコンテナ実行時に/dev/shmのサイズ制限を回避する方法がまだわからない(そもそもできるかどうか)、かつp3.2xlargeで一番小さいデータセットを1epoch回すのに45分かかった(p3.8xlargeを使っても45分かかった)ので、EC2で直接スクリプトを実行してより短時間で学習を終えられないか、確かめる。
https://zenn.dev/mayosuke/scraps/2b941227d47720

Yosuke MIYAJIMAYosuke MIYAJIMA

Deep Learning AMI (Ubuntu 18.04) Version 44.0 - ami-085443ce7e677f966をg4dn.xlargeで起動。

Yosuke MIYAJIMAYosuke MIYAJIMA

EC2 launch画面のstep3 configure instance detailsで"add file system"をクリックする。作成ずみのEFSはstudio用の一つ以外は無かったので、studioのEFSのIDがデフォルトでセットされた。"automatically create and attach the required security groups"にチェックが入っているので、そのまま作成を進めてみる。

Yosuke MIYAJIMAYosuke MIYAJIMA

UbuntuベースのAMIなので、

ssh ubuntu@<ec2-ip-addres> -i <pem>

pemはchmod 600するのを忘れずに。

Yosuke MIYAJIMAYosuke MIYAJIMA

マウント終わってなかったっぽい。ちょっと時間空けたら現れた。

$ ls /mnt/efs/fs1/
200005
$ ls /mnt/efs/fs1/200005/
ls: cannot open directory '/mnt/efs/fs1/200005/': Permission denied
Yosuke MIYAJIMAYosuke MIYAJIMA

sudo lsで無事EFSの中身が見えた。

$ sudo ls /mnt/efs/fs1/200005/
CampusSeq1  CampusSeq1.tar.bz2  Shelf  Shelf.tar.bz2  docker  voxelpose-pytorch  voxelpose.ipynb  voxelpose2.ipynb
Yosuke MIYAJIMAYosuke MIYAJIMA
$ pwd
/home/ubuntu
$ sudo apt update && sudo apt upgrade
$ sudo apt install -y python3-venv
# EFSのアクセスにroot権限が必要になるため、以降はrootになって作業する
$ sudo su -
$ pwd
/root
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $ 
(.venv) $ python3 -m pip install -U pip
(.venv) $ python3 -m pip freeze
pkg-resources==0.0.0

EFSのマウントが完了する(lsで見えるようになる)前にapt update/installなど実行するとエラーになった。

Yosuke MIYAJIMAYosuke MIYAJIMA

worker=0のままで、epoch 3を回してみる。(g4dn.xlarge)

$ cd <voxelpose-pytorch-root>
$ python3 -m pip install -r requirements.txt
$ time python3 run/train_3d.py --cfg configs/campus/prn64_cpn80x80x20.yaml

studio上でml.g4dn.xlargeで回した時とほぼ同じくらいの時間がかかった。

real    103m5.927s
user    100m10.748s
sys     5m42.869s
Yosuke MIYAJIMAYosuke MIYAJIMA

worker=4にして、epoch 4を回してみる。(g4dn.xlarge)
shmのサイズ不足のエラーは発生しない。

$ time python3 run/train_3d.py --cfg configs/campus/prn64_cpn80x80x20.yaml

ワーカー4にしても、実行時間ほとんど変化なしwworz

real    100m15.802s
user    100m20.524s
sys     7m29.478s
Yosuke MIYAJIMAYosuke MIYAJIMA

GPUの数を増やすと時間短縮できるか確認する。4GPUあるというg4dn.12xlargeで試してみる。

Yosuke MIYAJIMAYosuke MIYAJIMA

g4dn.12xlargeはちゃんと4つGPU(Tesla T4)が載ってる。

# python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
4
>>> for i in range(4):
...     print(torch.cuda.get_device_name(i))
... 
Tesla T4
Tesla T4
Tesla T4
Tesla T4
>>> 
Yosuke MIYAJIMAYosuke MIYAJIMA

worker=4,GPU count=4にしてepoch 5を回そうとしたら、warningが出た。学習自体は進んでいるようだから、そのまま様子をみる。

=> Training...
Epoch: 5
/root/.venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: [5][0/750]       Time: 12.206s (12.206s) Speed: 1.0 samples/s    Data: 0.676s (0.676s)   Loss: 16.870518 (16.870518)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000226 (0.0000226)    Loss_cord: 16.870495 (16.870495)        Memory 449552384.0
Yosuke MIYAJIMAYosuke MIYAJIMA

約35分で1epoch回った!

real    34m49.962s
user    83m52.213s
sys     3m39.516s

worker=0、1gpuのp3.2xlargeが$3/hで、約45min/epoch。
g4dn.12xlargeは$3.912/hで、約35min/epoch。

Yosuke MIYAJIMAYosuke MIYAJIMA
# time python3 test/evaluate.py --cfg configs/campus/prn64_cpn80x80x20.yaml
=> creating /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20
=> creating /mnt/efs/fs1/200005/voxelpose-pytorch/log/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x202021-06-03-15-57
=> Loading data ..
=> load /mnt/efs/fs1/200005/voxelpose-pytorch/data/CampusSeq1/pred_campus_maskrcnn_hrnet_coco.pkl
=> Constructing models ..
=> load models state /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20/model_best.pth.tar
  0%|                                                                                                                                                                                                                                          | 0/14 [00:00<?, ?it/s]
/root/.venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:29<00:00,  1.40s/it]
+------------+---------+---------+---------+---------+
| Bone Group | Actor 1 | Actor 2 | Actor 3 | Average |
+------------+---------+---------+---------+---------+
|    Head    |  100.0  |  100.0  |   98.6  |   99.5  |
|   Torso    |  100.0  |  100.0  |  100.0  |  100.0  |
| Upper arms |   94.9  |   99.5  |  100.0  |   98.1  |
| Lower arms |   85.7  |   66.9  |   93.1  |   81.9  |
| Upper legs |  100.0  |  100.0  |  100.0  |  100.0  |
| Lower legs |  100.0  |  100.0  |   99.6  |   99.9  |
|   Total    |   96.1  |   93.3  |   98.4  |   95.9  |
+------------+---------+---------+---------+---------+

real    0m34.817s
user    1m30.763s
sys     0m16.542s
Yosuke MIYAJIMAYosuke MIYAJIMA

4GPUのp3.8xlargeでもepoch回してみる。

まずはworker=4でepoch 6を回す。

real    19m36.913s
user    38m57.910s
sys     11m21.744s
Yosuke MIYAJIMAYosuke MIYAJIMA

worker=0にしてepoch 7を回した。worker=4の時より、若干遅い程度。

real    22m8.304s
user    48m23.424s
sys     9m12.854s
Yosuke MIYAJIMAYosuke MIYAJIMA

worker=16にしてepoch8~29を回してみる。
16にしたのは、元々の設定では1GPU使用時で4workerを割り当てていたため、単純にGPU数が4倍になった分worker数も4倍にしてみただけ。
1epochの1step(100/750)回った段階で、約2.5分経過していて、これはworker=4の時とほぼ同じ。
時間短縮への寄与は期待できなそうだけど、特に遅くなったりエラーになったりはしてないので、このまま回してみる。

Yosuke MIYAJIMAYosuke MIYAJIMA

epoch 27でエラーで発生

Epoch: 26
Epoch: [26][0/750]      Time: 4.218s (4.218s)   Speed: 2.8 samples/s    Data: 2.675s (2.675s)   Loss: 15.223214 (15.223214)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000330 (0.0000330)  Loss_cord: 15.223181 (15.223181)        Memory 445807616.0
Epoch: [26][100/750]    Time: 1.430s (1.530s)   Speed: 8.4 samples/s    Data: 0.000s (0.085s)   Loss: 18.009567 (14.135050)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000259 (0.0000292)  Loss_cord: 18.009541 (14.135021)        Memory 446594048.0
Epoch: [26][200/750]    Time: 1.650s (1.498s)   Speed: 7.3 samples/s    Data: 0.000s (0.066s)   Loss: 13.078530 (13.876710)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000378 (0.0000287)  Loss_cord: 13.078492 (13.876682)        Memory 447110144.0
Epoch: [26][300/750]    Time: 1.581s (1.499s)   Speed: 7.6 samples/s    Data: 0.000s (0.063s)   Loss: 14.929512 (13.772626)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000293 (0.0000296)  Loss_cord: 14.929482 (13.772597)        Memory 445448192.0
Epoch: [26][400/750]    Time: 1.590s (1.496s)   Speed: 7.5 samples/s    Data: 0.000s (0.061s)   Loss: 13.565663 (13.619408)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000333 (0.0000292)  Loss_cord: 13.565630 (13.619379)        Memory 446259200.0
Epoch: [26][500/750]    Time: 1.612s (1.495s)   Speed: 7.4 samples/s    Data: 0.000s (0.060s)   Loss: 12.003005 (13.656799)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000312 (0.0000291)  Loss_cord: 12.002974 (13.656770)        Memory 446160896.0
Epoch: [26][600/750]    Time: 1.541s (1.489s)   Speed: 7.8 samples/s    Data: 0.000s (0.059s)   Loss: 12.246838 (13.645142)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000240 (0.0000290)  Loss_cord: 12.246814 (13.645113)        Memory 445546496.0
Epoch: [26][700/750]    Time: 1.297s (1.488s)   Speed: 9.3 samples/s    Data: 0.000s (0.057s)   Loss: 11.776750 (13.628526)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000351 (0.0000292)  Loss_cord: 11.776714 (13.628496)        Memory 445767680.0
Test: [0/14]    Time: 4.910s (4.910s)   Speed: 9.8 samples/s    Data: 4.165s (4.165s)   Memory 211895296.0
Test: [13/14]   Time: 0.753s (2.044s)   Speed: 55.8 samples/s   Data: 0.000s (1.193s)   Memory 198838784.0
     | Actor 1 | Actor 2 | Actor 3 | Average | 
 PCP |  95.71  |  93.54  |  96.96  |  95.41  |   Recall@500mm: 1.0000
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
Epoch: 27
Epoch: [27][0/750]      Time: 3.533s (3.533s)   Speed: 3.4 samples/s    Data: 2.021s (2.021s)   Loss: 11.083156 (11.083156)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000180 (0.0000180)  Loss_cord: 11.083138 (11.083138)        Memory 445807616.0
Epoch: [27][100/750]    Time: 1.568s (1.523s)   Speed: 7.7 samples/s    Data: 0.000s (0.064s)   Loss: 7.779622 (13.582578)      Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000189 (0.0000337)  Loss_cord: 7.779603 (13.582544) Memory 446987264.0
Epoch: [27][200/750]    Time: 1.706s (1.505s)   Speed: 7.0 samples/s    Data: 0.000s (0.054s)   Loss: 13.117733 (13.915689)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000785 (0.0000340)  Loss_cord: 13.117655 (13.915655)        Memory 445669376.0
Epoch: [27][300/750]    Time: 1.540s (1.504s)   Speed: 7.8 samples/s    Data: 0.000s (0.057s)   Loss: 13.763736 (13.975562)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000129 (0.0000335)  Loss_cord: 13.763723 (13.975528)        Memory 446160896.0
Epoch: [27][400/750]    Time: 1.421s (1.495s)   Speed: 8.4 samples/s    Data: 0.000s (0.055s)   Loss: 12.280865 (13.815508)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000121 (0.0000316)  Loss_cord: 12.280852 (13.815477)        Memory 446160896.0
Epoch: [27][500/750]    Time: 1.583s (1.486s)   Speed: 7.6 samples/s    Data: 0.000s (0.055s)   Loss: 12.128736 (13.758208)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000220 (0.0000303)  Loss_cord: 12.128715 (13.758177)        Memory 445448192.0
Epoch: [27][600/750]    Time: 1.294s (1.480s)   Speed: 9.3 samples/s    Data: 0.000s (0.054s)   Loss: 9.983364 (13.721062)      Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000264 (0.0000298)  Loss_cord: 9.983337 (13.721032) Memory 446062592.0
Epoch: [27][700/750]    Time: 1.712s (1.480s)   Speed: 7.0 samples/s    Data: 0.000s (0.054s)   Loss: 11.498207 (13.653339)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000618 (0.0000299)  Loss_cord: 11.498145 (13.653310)        Memory 446397440.0
Test: [0/14]    Time: 4.518s (4.518s)   Speed: 10.6 samples/s   Data: 3.832s (3.832s)   Memory 211895296.0
Test: [13/14]   Time: 0.774s (1.963s)   Speed: 54.2 samples/s   Data: 0.000s (1.172s)   Memory 198838784.0
Traceback (most recent call last):
  File "run/train_3d.py", line 160, in <module>
    main()
  File "run/train_3d.py", line 134, in main
    precision = validate_3d(config, model, test_loader, final_output_dir)
  File "/mnt/efs/fs1/200005/voxelpose-pytorch/run/../lib/core/function.py", line 168, in validate_3d
    actor_pcp, avg_pcp, _, recall = loader.dataset.evaluate(preds)
  File "/mnt/efs/fs1/200005/voxelpose-pytorch/run/../lib/dataset/campus.py", line 183, in evaluate
    pred = np.stack([self.coco2campus3D(p) for p in copy.deepcopy(pred_coco[:, :, :3])])
  File "/root/.venv/lib/python3.6/site-packages/numpy/core/shape_base.py", line 412, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

real    386m56.209s
user    787m54.397s
sys     231m52.142s
Yosuke MIYAJIMAYosuke MIYAJIMA

epoch 27からやり直してみたら、今度は最終のepoch 29(開始はepoch 0)まで回せた。

 'WORKERS': 16}
=> Loading data ..
=> load /mnt/efs/fs1/200005/voxelpose-pytorch/data/CampusSeq1/pred_campus_maskrcnn_hrnet_coco.pkl
=> Constructing models ..
=> load checkpoint /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20/checkpoint.pth.tar (epoch 27)
=> Training...
Epoch: 27
/root/.venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: [27][0/750]      Time: 17.269s (17.269s) Speed: 0.7 samples/s    Data: 3.420s (3.420s)   Loss: 20.028179 (20.028179)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0001006 (0.0001006)  Loss_cord: 20.028078 (20.028078)        Memory 449552384.0
Epoch: [27][100/750]    Time: 1.469s (1.654s)   Speed: 8.2 samples/s    Data: 0.000s (0.102s)   Loss: 16.410538 (13.451304)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000506 (0.0000363)  Loss_cord: 16.410486 (13.451267)        Memory 445546496.0
Epoch: [27][200/750]    Time: 0.975s (1.572s)   Speed: 12.3 samples/s   Data: 0.000s (0.080s)   Loss: 22.311125 (13.832909)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000095 (0.0000332)  Loss_cord: 22.311115 (13.832876)        Memory 444563456.0
Epoch: [27][300/750]    Time: 1.398s (1.536s)   Speed: 8.6 samples/s    Data: 0.000s (0.067s)   Loss: 11.109104 (13.800611)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000231 (0.0000306)  Loss_cord: 11.109081 (13.800580)        Memory 446357504.0
Epoch: [27][400/750]    Time: 1.381s (1.520s)   Speed: 8.7 samples/s    Data: 0.000s (0.062s)   Loss: 16.316496 (13.736643)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000402 (0.0000311)  Loss_cord: 16.316456 (13.736612)        Memory 445743104.0
Epoch: [27][500/750]    Time: 1.388s (1.505s)   Speed: 8.6 samples/s    Data: 0.000s (0.060s)   Loss: 16.382431 (13.711130)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000197 (0.0000308)  Loss_cord: 16.382412 (13.711099)        Memory 445743104.0
Epoch: [27][600/750]    Time: 1.595s (1.500s)   Speed: 7.5 samples/s    Data: 0.000s (0.058s)   Loss: 16.412279 (13.659989)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000368 (0.0000305)  Loss_cord: 16.412243 (13.659958)        Memory 444268544.0
Epoch: [27][700/750]    Time: 1.561s (1.494s)   Speed: 7.7 samples/s    Data: 0.000s (0.057s)   Loss: 10.678740 (13.656100)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000450 (0.0000306)  Loss_cord: 10.678695 (13.656070)        Memory 445079552.0
Test: [0/14]    Time: 7.586s (7.586s)   Speed: 6.3 samples/s    Data: 3.708s (3.708s)   Memory 212190208.0
Test: [13/14]   Time: 0.913s (2.320s)   Speed: 46.0 samples/s   Data: 0.000s (1.209s)   Memory 199133696.0
     | Actor 1 | Actor 2 | Actor 3 | Average | 
 PCP |  96.53  |  93.33  |  98.48  |  96.11  |   Recall@500mm: 1.0000
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
Epoch: 28
Epoch: [28][0/750]      Time: 4.496s (4.496s)   Speed: 2.7 samples/s    Data: 2.400s (2.400s)   Loss: 12.797877 (12.797877)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000165 (0.0000165)  Loss_cord: 12.797861 (12.797861)        Memory 443678720.0
Epoch: [28][100/750]    Time: 1.095s (1.540s)   Speed: 11.0 samples/s   Data: 0.000s (0.076s)   Loss: 13.864036 (13.551936)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000358 (0.0000292)  Loss_cord: 13.863999 (13.551907)        Memory 445939712.0
Epoch: [28][200/750]    Time: 1.122s (1.496s)   Speed: 10.7 samples/s   Data: 0.000s (0.062s)   Loss: 12.567248 (13.712278)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000166 (0.0000282)  Loss_cord: 12.567232 (13.712250)        Memory 445153280.0
Epoch: [28][300/750]    Time: 1.573s (1.482s)   Speed: 7.6 samples/s    Data: 0.000s (0.058s)   Loss: 11.183677 (13.788406)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000146 (0.0000284)  Loss_cord: 11.183662 (13.788378)        Memory 445743104.0
Epoch: [28][400/750]    Time: 1.273s (1.483s)   Speed: 9.4 samples/s    Data: 0.000s (0.057s)   Loss: 11.101755 (13.717575)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000174 (0.0000296)  Loss_cord: 11.101738 (13.717545)        Memory 446160896.0
Epoch: [28][500/750]    Time: 1.382s (1.485s)   Speed: 8.7 samples/s    Data: 0.000s (0.056s)   Loss: 10.858953 (13.730741)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000143 (0.0000295)  Loss_cord: 10.858939 (13.730711)        Memory 445841408.0
Epoch: [28][600/750]    Time: 1.453s (1.481s)   Speed: 8.3 samples/s    Data: 0.000s (0.054s)   Loss: 13.933317 (13.686009)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000420 (0.0000295)  Loss_cord: 13.933275 (13.685980)        Memory 446652416.0
Epoch: [28][700/750]    Time: 1.479s (1.480s)   Speed: 8.1 samples/s    Data: 0.000s (0.055s)   Loss: 18.954720 (13.698181)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000430 (0.0000299)  Loss_cord: 18.954676 (13.698151)        Memory 446062592.0
Test: [0/14]    Time: 4.290s (4.290s)   Speed: 11.2 samples/s   Data: 3.470s (3.470s)   Memory 214696960.0
Test: [13/14]   Time: 0.775s (2.034s)   Speed: 54.2 samples/s   Data: 0.000s (1.171s)   Memory 199133696.0
     | Actor 1 | Actor 2 | Actor 3 | Average | 
 PCP |  96.94  |  93.17  |  95.22  |  95.11  |   Recall@500mm: 0.9947
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
Epoch: 29
Epoch: [29][0/750]      Time: 4.521s (4.521s)   Speed: 2.7 samples/s    Data: 2.754s (2.754s)   Loss: 10.656137 (10.656137)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000185 (0.0000185)  Loss_cord: 10.656118 (10.656118)        Memory 445276160.0
Epoch: [29][100/750]    Time: 1.650s (1.529s)   Speed: 7.3 samples/s    Data: 0.000s (0.076s)   Loss: 18.989956 (14.353836)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000837 (0.0000355)  Loss_cord: 18.989872 (14.353801)        Memory 445865984.0
Epoch: [29][200/750]    Time: 1.636s (1.504s)   Speed: 7.3 samples/s    Data: 0.000s (0.071s)   Loss: 11.814981 (14.153502)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000323 (0.0000331)  Loss_cord: 11.814949 (14.153469)        Memory 445743104.0
Epoch: [29][300/750]    Time: 1.573s (1.487s)   Speed: 7.6 samples/s    Data: 0.000s (0.065s)   Loss: 18.256166 (13.928165)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000364 (0.0000313)  Loss_cord: 18.256130 (13.928134)        Memory 446455808.0
Epoch: [29][400/750]    Time: 1.001s (1.470s)   Speed: 12.0 samples/s   Data: 0.000s (0.063s)   Loss: 12.627074 (13.716256)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000160 (0.0000296)  Loss_cord: 12.627058 (13.716227)        Memory 446259200.0
Epoch: [29][500/750]    Time: 1.256s (1.466s)   Speed: 9.6 samples/s    Data: 0.000s (0.059s)   Loss: 9.755458 (13.613090)      Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000401 (0.0000293)  Loss_cord: 9.755418 (13.613061) Memory 445644800.0
Epoch: [29][600/750]    Time: 1.428s (1.467s)   Speed: 8.4 samples/s    Data: 0.000s (0.056s)   Loss: 15.708157 (13.551201)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000241 (0.0000294)  Loss_cord: 15.708133 (13.551171)        Memory 445472768.0
Epoch: [29][700/750]    Time: 1.292s (1.466s)   Speed: 9.3 samples/s    Data: 0.000s (0.055s)   Loss: 15.418936 (13.466052)     Loss_2d: 0.0000000 (0.0000000)  Loss_3d: 0.0000234 (0.0000292)  Loss_cord: 15.418912 (13.466023)        Memory 446038016.0
Test: [0/14]    Time: 4.796s (4.796s)   Speed: 10.0 samples/s   Data: 4.193s (4.193s)   Memory 214696960.0
Test: [13/14]   Time: 0.750s (2.100s)   Speed: 56.0 samples/s   Data: 0.000s (1.288s)   Memory 199133696.0
     | Actor 1 | Actor 2 | Actor 3 | Average | 
 PCP |  96.94  |  93.28  |  98.19  |  96.14  |   Recall@500mm: 1.0000
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
saving final model state to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20/final_state.pth.tar

real    57m55.987s
user    116m3.626s
sys     36m16.613s
Yosuke MIYAJIMAYosuke MIYAJIMA

epoch 8~29を回し切るのに、
約387min + 約58min=445min
445 / (30-8) = 20分ちょい <- 1epochあたり

Yosuke MIYAJIMAYosuke MIYAJIMA

workerの数は0でも1以上でも、大差、なさそう?
であれば、インスタンスコストが若干高いのを許容できるならsagemaker studio上で回すのもアリかも?