voxelpose-pytorchをAmazon EC2で動かす
学習させて推論までをやりたい。
コード:
論文:
[2004.06239] VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment
論文のまとめスライド:
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wi…
デモ動画:
論文執筆者の一人、王春雨=Chunyu Wangさんのサイト
こちらで、SageMaker Studio上でvoxelposeを動かそうとしているが、SageMaker上でDockerコンテナ実行時に/dev/shmのサイズ制限を回避する方法がまだわからない(そもそもできるかどうか)、かつp3.2xlargeで一番小さいデータセットを1epoch回すのに45分かかった(p3.8xlargeを使っても45分かかった)ので、EC2で直接スクリプトを実行してより短時間で学習を終えられないか、確かめる。
Deep Learning AMI (Ubuntu 18.04) Version 44.0 - ami-085443ce7e677f966をg4dn.xlargeで起動。
studioのEFSをec2にマウントしたい。
EC2 launch画面のstep3 configure instance detailsで"add file system"をクリックする。作成ずみのEFSはstudio用の一つ以外は無かったので、studioのEFSのIDがデフォルトでセットされた。"automatically create and attach the required security groups"にチェックが入っているので、そのまま作成を進めてみる。
UbuntuベースのAMIなので、
ssh ubuntu@<ec2-ip-addres> -i <pem>
pemはchmod 600するのを忘れずに。
$ python3 --version
Python 3.6.9
/mnt/efs/fs1
にEFSがマウントされてることを期待してたが、マウントされてないっぽい。
$ ls /mnt/
$
マウント終わってなかったっぽい。ちょっと時間空けたら現れた。
$ ls /mnt/efs/fs1/
200005
$ ls /mnt/efs/fs1/200005/
ls: cannot open directory '/mnt/efs/fs1/200005/': Permission denied
sudo ls
で無事EFSの中身が見えた。
$ sudo ls /mnt/efs/fs1/200005/
CampusSeq1 CampusSeq1.tar.bz2 Shelf Shelf.tar.bz2 docker voxelpose-pytorch voxelpose.ipynb voxelpose2.ipynb
$ pwd
/home/ubuntu
$ sudo apt update && sudo apt upgrade
$ sudo apt install -y python3-venv
# EFSのアクセスにroot権限が必要になるため、以降はrootになって作業する
$ sudo su -
$ pwd
/root
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $
(.venv) $ python3 -m pip install -U pip
(.venv) $ python3 -m pip freeze
pkg-resources==0.0.0
EFSのマウントが完了する(lsで見えるようになる)前にapt update/installなど実行するとエラーになった。
worker=0のままで、epoch 3を回してみる。(g4dn.xlarge)
$ cd <voxelpose-pytorch-root>
$ python3 -m pip install -r requirements.txt
$ time python3 run/train_3d.py --cfg configs/campus/prn64_cpn80x80x20.yaml
studio上でml.g4dn.xlargeで回した時とほぼ同じくらいの時間がかかった。
real 103m5.927s
user 100m10.748s
sys 5m42.869s
worker=4にして、epoch 4を回してみる。(g4dn.xlarge)
shmのサイズ不足のエラーは発生しない。
$ time python3 run/train_3d.py --cfg configs/campus/prn64_cpn80x80x20.yaml
ワーカー4にしても、実行時間ほとんど変化なしwworz
real 100m15.802s
user 100m20.524s
sys 7m29.478s
GPUの数を増やすと時間短縮できるか確認する。4GPUあるというg4dn.12xlarge
で試してみる。
g4dn.12xlargeはちゃんと4つGPU(Tesla T4)が載ってる。
# python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
4
>>> for i in range(4):
... print(torch.cuda.get_device_name(i))
...
Tesla T4
Tesla T4
Tesla T4
Tesla T4
>>>
worker=4,GPU count=4にしてepoch 5を回そうとしたら、warningが出た。学習自体は進んでいるようだから、そのまま様子をみる。
=> Training...
Epoch: 5
/root/.venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: [5][0/750] Time: 12.206s (12.206s) Speed: 1.0 samples/s Data: 0.676s (0.676s) Loss: 16.870518 (16.870518) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000226 (0.0000226) Loss_cord: 16.870495 (16.870495) Memory 449552384.0
約35分で1epoch回った!
real 34m49.962s
user 83m52.213s
sys 3m39.516s
worker=0、1gpuのp3.2xlargeが$3/hで、約45min/epoch。
g4dn.12xlargeは$3.912/hで、約35min/epoch。
# time python3 test/evaluate.py --cfg configs/campus/prn64_cpn80x80x20.yaml
=> creating /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20
=> creating /mnt/efs/fs1/200005/voxelpose-pytorch/log/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x202021-06-03-15-57
=> Loading data ..
=> load /mnt/efs/fs1/200005/voxelpose-pytorch/data/CampusSeq1/pred_campus_maskrcnn_hrnet_coco.pkl
=> Constructing models ..
=> load models state /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20/model_best.pth.tar
0%| | 0/14 [00:00<?, ?it/s]
/root/.venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:29<00:00, 1.40s/it]
+------------+---------+---------+---------+---------+
| Bone Group | Actor 1 | Actor 2 | Actor 3 | Average |
+------------+---------+---------+---------+---------+
| Head | 100.0 | 100.0 | 98.6 | 99.5 |
| Torso | 100.0 | 100.0 | 100.0 | 100.0 |
| Upper arms | 94.9 | 99.5 | 100.0 | 98.1 |
| Lower arms | 85.7 | 66.9 | 93.1 | 81.9 |
| Upper legs | 100.0 | 100.0 | 100.0 | 100.0 |
| Lower legs | 100.0 | 100.0 | 99.6 | 99.9 |
| Total | 96.1 | 93.3 | 98.4 | 95.9 |
+------------+---------+---------+---------+---------+
real 0m34.817s
user 1m30.763s
sys 0m16.542s
4GPUのp3.8xlarge
でもepoch回してみる。
まずはworker=4でepoch 6を回す。
real 19m36.913s
user 38m57.910s
sys 11m21.744s
worker=0にしてepoch 7を回した。worker=4の時より、若干遅い程度。
real 22m8.304s
user 48m23.424s
sys 9m12.854s
worker=16にしてepoch8~29を回してみる。
16にしたのは、元々の設定では1GPU使用時で4workerを割り当てていたため、単純にGPU数が4倍になった分worker数も4倍にしてみただけ。
1epochの1step(100/750)回った段階で、約2.5分経過していて、これはworker=4の時とほぼ同じ。
時間短縮への寄与は期待できなそうだけど、特に遅くなったりエラーになったりはしてないので、このまま回してみる。
epoch 27でエラーで発生
Epoch: 26
Epoch: [26][0/750] Time: 4.218s (4.218s) Speed: 2.8 samples/s Data: 2.675s (2.675s) Loss: 15.223214 (15.223214) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000330 (0.0000330) Loss_cord: 15.223181 (15.223181) Memory 445807616.0
Epoch: [26][100/750] Time: 1.430s (1.530s) Speed: 8.4 samples/s Data: 0.000s (0.085s) Loss: 18.009567 (14.135050) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000259 (0.0000292) Loss_cord: 18.009541 (14.135021) Memory 446594048.0
Epoch: [26][200/750] Time: 1.650s (1.498s) Speed: 7.3 samples/s Data: 0.000s (0.066s) Loss: 13.078530 (13.876710) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000378 (0.0000287) Loss_cord: 13.078492 (13.876682) Memory 447110144.0
Epoch: [26][300/750] Time: 1.581s (1.499s) Speed: 7.6 samples/s Data: 0.000s (0.063s) Loss: 14.929512 (13.772626) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000293 (0.0000296) Loss_cord: 14.929482 (13.772597) Memory 445448192.0
Epoch: [26][400/750] Time: 1.590s (1.496s) Speed: 7.5 samples/s Data: 0.000s (0.061s) Loss: 13.565663 (13.619408) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000333 (0.0000292) Loss_cord: 13.565630 (13.619379) Memory 446259200.0
Epoch: [26][500/750] Time: 1.612s (1.495s) Speed: 7.4 samples/s Data: 0.000s (0.060s) Loss: 12.003005 (13.656799) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000312 (0.0000291) Loss_cord: 12.002974 (13.656770) Memory 446160896.0
Epoch: [26][600/750] Time: 1.541s (1.489s) Speed: 7.8 samples/s Data: 0.000s (0.059s) Loss: 12.246838 (13.645142) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000240 (0.0000290) Loss_cord: 12.246814 (13.645113) Memory 445546496.0
Epoch: [26][700/750] Time: 1.297s (1.488s) Speed: 9.3 samples/s Data: 0.000s (0.057s) Loss: 11.776750 (13.628526) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000351 (0.0000292) Loss_cord: 11.776714 (13.628496) Memory 445767680.0
Test: [0/14] Time: 4.910s (4.910s) Speed: 9.8 samples/s Data: 4.165s (4.165s) Memory 211895296.0
Test: [13/14] Time: 0.753s (2.044s) Speed: 55.8 samples/s Data: 0.000s (1.193s) Memory 198838784.0
| Actor 1 | Actor 2 | Actor 3 | Average |
PCP | 95.71 | 93.54 | 96.96 | 95.41 | Recall@500mm: 1.0000
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
Epoch: 27
Epoch: [27][0/750] Time: 3.533s (3.533s) Speed: 3.4 samples/s Data: 2.021s (2.021s) Loss: 11.083156 (11.083156) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000180 (0.0000180) Loss_cord: 11.083138 (11.083138) Memory 445807616.0
Epoch: [27][100/750] Time: 1.568s (1.523s) Speed: 7.7 samples/s Data: 0.000s (0.064s) Loss: 7.779622 (13.582578) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000189 (0.0000337) Loss_cord: 7.779603 (13.582544) Memory 446987264.0
Epoch: [27][200/750] Time: 1.706s (1.505s) Speed: 7.0 samples/s Data: 0.000s (0.054s) Loss: 13.117733 (13.915689) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000785 (0.0000340) Loss_cord: 13.117655 (13.915655) Memory 445669376.0
Epoch: [27][300/750] Time: 1.540s (1.504s) Speed: 7.8 samples/s Data: 0.000s (0.057s) Loss: 13.763736 (13.975562) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000129 (0.0000335) Loss_cord: 13.763723 (13.975528) Memory 446160896.0
Epoch: [27][400/750] Time: 1.421s (1.495s) Speed: 8.4 samples/s Data: 0.000s (0.055s) Loss: 12.280865 (13.815508) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000121 (0.0000316) Loss_cord: 12.280852 (13.815477) Memory 446160896.0
Epoch: [27][500/750] Time: 1.583s (1.486s) Speed: 7.6 samples/s Data: 0.000s (0.055s) Loss: 12.128736 (13.758208) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000220 (0.0000303) Loss_cord: 12.128715 (13.758177) Memory 445448192.0
Epoch: [27][600/750] Time: 1.294s (1.480s) Speed: 9.3 samples/s Data: 0.000s (0.054s) Loss: 9.983364 (13.721062) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000264 (0.0000298) Loss_cord: 9.983337 (13.721032) Memory 446062592.0
Epoch: [27][700/750] Time: 1.712s (1.480s) Speed: 7.0 samples/s Data: 0.000s (0.054s) Loss: 11.498207 (13.653339) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000618 (0.0000299) Loss_cord: 11.498145 (13.653310) Memory 446397440.0
Test: [0/14] Time: 4.518s (4.518s) Speed: 10.6 samples/s Data: 3.832s (3.832s) Memory 211895296.0
Test: [13/14] Time: 0.774s (1.963s) Speed: 54.2 samples/s Data: 0.000s (1.172s) Memory 198838784.0
Traceback (most recent call last):
File "run/train_3d.py", line 160, in <module>
main()
File "run/train_3d.py", line 134, in main
precision = validate_3d(config, model, test_loader, final_output_dir)
File "/mnt/efs/fs1/200005/voxelpose-pytorch/run/../lib/core/function.py", line 168, in validate_3d
actor_pcp, avg_pcp, _, recall = loader.dataset.evaluate(preds)
File "/mnt/efs/fs1/200005/voxelpose-pytorch/run/../lib/dataset/campus.py", line 183, in evaluate
pred = np.stack([self.coco2campus3D(p) for p in copy.deepcopy(pred_coco[:, :, :3])])
File "/root/.venv/lib/python3.6/site-packages/numpy/core/shape_base.py", line 412, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
real 386m56.209s
user 787m54.397s
sys 231m52.142s
epoch 27からやり直してみたら、今度は最終のepoch 29(開始はepoch 0)まで回せた。
'WORKERS': 16}
=> Loading data ..
=> load /mnt/efs/fs1/200005/voxelpose-pytorch/data/CampusSeq1/pred_campus_maskrcnn_hrnet_coco.pkl
=> Constructing models ..
=> load checkpoint /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20/checkpoint.pth.tar (epoch 27)
=> Training...
Epoch: 27
/root/.venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: [27][0/750] Time: 17.269s (17.269s) Speed: 0.7 samples/s Data: 3.420s (3.420s) Loss: 20.028179 (20.028179) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0001006 (0.0001006) Loss_cord: 20.028078 (20.028078) Memory 449552384.0
Epoch: [27][100/750] Time: 1.469s (1.654s) Speed: 8.2 samples/s Data: 0.000s (0.102s) Loss: 16.410538 (13.451304) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000506 (0.0000363) Loss_cord: 16.410486 (13.451267) Memory 445546496.0
Epoch: [27][200/750] Time: 0.975s (1.572s) Speed: 12.3 samples/s Data: 0.000s (0.080s) Loss: 22.311125 (13.832909) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000095 (0.0000332) Loss_cord: 22.311115 (13.832876) Memory 444563456.0
Epoch: [27][300/750] Time: 1.398s (1.536s) Speed: 8.6 samples/s Data: 0.000s (0.067s) Loss: 11.109104 (13.800611) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000231 (0.0000306) Loss_cord: 11.109081 (13.800580) Memory 446357504.0
Epoch: [27][400/750] Time: 1.381s (1.520s) Speed: 8.7 samples/s Data: 0.000s (0.062s) Loss: 16.316496 (13.736643) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000402 (0.0000311) Loss_cord: 16.316456 (13.736612) Memory 445743104.0
Epoch: [27][500/750] Time: 1.388s (1.505s) Speed: 8.6 samples/s Data: 0.000s (0.060s) Loss: 16.382431 (13.711130) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000197 (0.0000308) Loss_cord: 16.382412 (13.711099) Memory 445743104.0
Epoch: [27][600/750] Time: 1.595s (1.500s) Speed: 7.5 samples/s Data: 0.000s (0.058s) Loss: 16.412279 (13.659989) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000368 (0.0000305) Loss_cord: 16.412243 (13.659958) Memory 444268544.0
Epoch: [27][700/750] Time: 1.561s (1.494s) Speed: 7.7 samples/s Data: 0.000s (0.057s) Loss: 10.678740 (13.656100) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000450 (0.0000306) Loss_cord: 10.678695 (13.656070) Memory 445079552.0
Test: [0/14] Time: 7.586s (7.586s) Speed: 6.3 samples/s Data: 3.708s (3.708s) Memory 212190208.0
Test: [13/14] Time: 0.913s (2.320s) Speed: 46.0 samples/s Data: 0.000s (1.209s) Memory 199133696.0
| Actor 1 | Actor 2 | Actor 3 | Average |
PCP | 96.53 | 93.33 | 98.48 | 96.11 | Recall@500mm: 1.0000
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
Epoch: 28
Epoch: [28][0/750] Time: 4.496s (4.496s) Speed: 2.7 samples/s Data: 2.400s (2.400s) Loss: 12.797877 (12.797877) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000165 (0.0000165) Loss_cord: 12.797861 (12.797861) Memory 443678720.0
Epoch: [28][100/750] Time: 1.095s (1.540s) Speed: 11.0 samples/s Data: 0.000s (0.076s) Loss: 13.864036 (13.551936) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000358 (0.0000292) Loss_cord: 13.863999 (13.551907) Memory 445939712.0
Epoch: [28][200/750] Time: 1.122s (1.496s) Speed: 10.7 samples/s Data: 0.000s (0.062s) Loss: 12.567248 (13.712278) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000166 (0.0000282) Loss_cord: 12.567232 (13.712250) Memory 445153280.0
Epoch: [28][300/750] Time: 1.573s (1.482s) Speed: 7.6 samples/s Data: 0.000s (0.058s) Loss: 11.183677 (13.788406) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000146 (0.0000284) Loss_cord: 11.183662 (13.788378) Memory 445743104.0
Epoch: [28][400/750] Time: 1.273s (1.483s) Speed: 9.4 samples/s Data: 0.000s (0.057s) Loss: 11.101755 (13.717575) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000174 (0.0000296) Loss_cord: 11.101738 (13.717545) Memory 446160896.0
Epoch: [28][500/750] Time: 1.382s (1.485s) Speed: 8.7 samples/s Data: 0.000s (0.056s) Loss: 10.858953 (13.730741) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000143 (0.0000295) Loss_cord: 10.858939 (13.730711) Memory 445841408.0
Epoch: [28][600/750] Time: 1.453s (1.481s) Speed: 8.3 samples/s Data: 0.000s (0.054s) Loss: 13.933317 (13.686009) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000420 (0.0000295) Loss_cord: 13.933275 (13.685980) Memory 446652416.0
Epoch: [28][700/750] Time: 1.479s (1.480s) Speed: 8.1 samples/s Data: 0.000s (0.055s) Loss: 18.954720 (13.698181) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000430 (0.0000299) Loss_cord: 18.954676 (13.698151) Memory 446062592.0
Test: [0/14] Time: 4.290s (4.290s) Speed: 11.2 samples/s Data: 3.470s (3.470s) Memory 214696960.0
Test: [13/14] Time: 0.775s (2.034s) Speed: 54.2 samples/s Data: 0.000s (1.171s) Memory 199133696.0
| Actor 1 | Actor 2 | Actor 3 | Average |
PCP | 96.94 | 93.17 | 95.22 | 95.11 | Recall@500mm: 0.9947
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
Epoch: 29
Epoch: [29][0/750] Time: 4.521s (4.521s) Speed: 2.7 samples/s Data: 2.754s (2.754s) Loss: 10.656137 (10.656137) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000185 (0.0000185) Loss_cord: 10.656118 (10.656118) Memory 445276160.0
Epoch: [29][100/750] Time: 1.650s (1.529s) Speed: 7.3 samples/s Data: 0.000s (0.076s) Loss: 18.989956 (14.353836) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000837 (0.0000355) Loss_cord: 18.989872 (14.353801) Memory 445865984.0
Epoch: [29][200/750] Time: 1.636s (1.504s) Speed: 7.3 samples/s Data: 0.000s (0.071s) Loss: 11.814981 (14.153502) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000323 (0.0000331) Loss_cord: 11.814949 (14.153469) Memory 445743104.0
Epoch: [29][300/750] Time: 1.573s (1.487s) Speed: 7.6 samples/s Data: 0.000s (0.065s) Loss: 18.256166 (13.928165) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000364 (0.0000313) Loss_cord: 18.256130 (13.928134) Memory 446455808.0
Epoch: [29][400/750] Time: 1.001s (1.470s) Speed: 12.0 samples/s Data: 0.000s (0.063s) Loss: 12.627074 (13.716256) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000160 (0.0000296) Loss_cord: 12.627058 (13.716227) Memory 446259200.0
Epoch: [29][500/750] Time: 1.256s (1.466s) Speed: 9.6 samples/s Data: 0.000s (0.059s) Loss: 9.755458 (13.613090) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000401 (0.0000293) Loss_cord: 9.755418 (13.613061) Memory 445644800.0
Epoch: [29][600/750] Time: 1.428s (1.467s) Speed: 8.4 samples/s Data: 0.000s (0.056s) Loss: 15.708157 (13.551201) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000241 (0.0000294) Loss_cord: 15.708133 (13.551171) Memory 445472768.0
Epoch: [29][700/750] Time: 1.292s (1.466s) Speed: 9.3 samples/s Data: 0.000s (0.055s) Loss: 15.418936 (13.466052) Loss_2d: 0.0000000 (0.0000000) Loss_3d: 0.0000234 (0.0000292) Loss_cord: 15.418912 (13.466023) Memory 446038016.0
Test: [0/14] Time: 4.796s (4.796s) Speed: 10.0 samples/s Data: 4.193s (4.193s) Memory 214696960.0
Test: [13/14] Time: 0.750s (2.100s) Speed: 56.0 samples/s Data: 0.000s (1.288s) Memory 199133696.0
| Actor 1 | Actor 2 | Actor 3 | Average |
PCP | 96.94 | 93.28 | 98.19 | 96.14 | Recall@500mm: 1.0000
=> saving checkpoint to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20 (Best: False)
saving final model state to /mnt/efs/fs1/200005/voxelpose-pytorch/output/campus_synthetic/multi_person_posenet_50/prn64_cpn80x80x20/final_state.pth.tar
real 57m55.987s
user 116m3.626s
sys 36m16.613s
epoch 8~29を回し切るのに、
約387min + 約58min=445min
445 / (30-8) = 20分ちょい <- 1epochあたり
workerの数は0でも1以上でも、大差、なさそう?
であれば、インスタンスコストが若干高いのを許容できるならsagemaker studio上で回すのもアリかも?