🤑
MMDetection カスタムデータ 暫定運用とトラブルシュート
MMDetection カスタムデータがうまく使えないときのワークアラウンド
MMDetection custom data temporary use work around tips
学習するコマンド例(今回はFaster_RCNN)
python tools/train.py configs/your_custom_config.py --gpus 1
学習できない状況
- パスが通っていない
- このデータセットはない
- tmpで動かせない
- その他もろもろでとにかく動かせない!
(En)
failure examples
"__init__.py at the mmdet/datasets/"
- Error is occurred from wrong setting custom dataset
- You should not keep defective dataset description in it.
# <error1>
09/07 13:34:53 - mmengine - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
09/07 13:34:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/07 13:34:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/07 13:34:53 - mmengine - INFO - Checkpoints will be saved to /home/askengi/mmdetection/work_d.
Traceback (most recent call last):
File "tools/train.py", line 133, in <module>
main()
File "tools/train.py", line 129, in main
runner.train()
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1721, in train
model = self.train_loop.run() # type: ignore
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
losses = self._run_forward(data, mode='loss') # type: ignore
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 340, in _run_forward
results = self(**data, mode=mode)
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/askengi/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/home/askengi/mmdetection/mmdet/models/detectors/two_stage.py", line 190, in loss
roi_losses = self.roi_head.loss(x, rpn_results_list,
File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 135, in loss
bbox_results = self.bbox_loss(x, sampling_results)
File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 193, in bbox_loss
bbox_loss_and_target = self.bbox_head.loss_and_target(
File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 323, in loss_and_target
losses = self.loss(
File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 392, in loss
losses['acc'] = accuracy(cls_score, labels)
File "/home/askengi/mmdetection/mmdet/models/losses/accuracy.py", line 47, in accuracy
correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [1,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [512,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [513,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [514,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [515,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [516,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
(mmlab) askengi@askengi-desktop:~/mmdetection$
:::details
solution
- custom config file for CocoDataset
- num_classes number should be match with coco.py label number.
# base config file in faster_rcnn directory
model = dict(
roi_head=dict(
bbox_head=dict(
num_classes=80 # classes must much with coco.py setting
)
)
)
# データセット関連の設定
dataset_type = 'CocoDataset' # 自分で定義したデータセットクラス
data_root = 'data/cup/' # データのルートパス
# クラス情報
classes = ('Furyo', 'cup', 'Ibutsu')
一時的な対応策で学習をさせる
どうにもならない場合
__init__.pyに誤った設定を残している場合
# <error1>
09/07 13:34:53 - mmengine - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
09/07 13:34:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/07 13:34:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/07 13:34:53 - mmengine - INFO - Checkpoints will be saved to /home/askengi/mmdetection/work_d.
Traceback (most recent call last):
File "tools/train.py", line 133, in <module>
main()
File "tools/train.py", line 129, in main
runner.train()
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1721, in train
model = self.train_loop.run() # type: ignore
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
losses = self._run_forward(data, mode='loss') # type: ignore
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 340, in _run_forward
results = self(**data, mode=mode)
File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/askengi/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/home/askengi/mmdetection/mmdet/models/detectors/two_stage.py", line 190, in loss
roi_losses = self.roi_head.loss(x, rpn_results_list,
File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 135, in loss
bbox_results = self.bbox_loss(x, sampling_results)
File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 193, in bbox_loss
bbox_loss_and_target = self.bbox_head.loss_and_target(
File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 323, in loss_and_target
losses = self.loss(
File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 392, in loss
losses['acc'] = accuracy(cls_score, labels)
File "/home/askengi/mmdetection/mmdet/models/losses/accuracy.py", line 47, in accuracy
correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [1,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [512,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [513,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [514,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [515,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [516,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
(mmlab) askengi@askengi-desktop:~/mmdetection$
coco.pyを自分のIDラベルに置き換えたらすること
# モデルの設定
model = dict(
roi_head=dict(
bbox_head=dict(
num_classes=80 # クラス数をはmmdet/datasets/coco.pyの数
)
)
)
# データセット関連の設定
dataset_type = 'CocoDataset' # 既存のデータセットクラス
data_root = 'data/cup/' # データのルートパス
# クラス情報
classes = ('Furyo', 'cup', 'Ibutsu')
- num_classes=3 とコンフィグ内の数字にするとエラーになってしまいます・・・
成功例・・・60日かかるのか
path: torchvision://resnet50
09/07 13:36:47 - mmengine - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
09/07 13:36:47 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/07 13:36:47 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/07 13:36:47 - mmengine - INFO - Checkpoints will be saved to /home/askengi/mmdetection/work_d.
09/07 13:37:00 - mmengine - INFO - Epoch(train) [1][ 50/58252] lr: 1.9820e-03 eta: 70 days, 9:17:32 time: 0.2610 data_time: 0.0086 memory: 3384 loss: 1.5163 loss_rpn_cls: 0.4873 loss_rpn_bbox: 0.0711 loss_cls: 0.9302 acc: 98.3398 loss_bbox: 0.0278
09/07 13:37:10 - mmengine - INFO - Epoch(train) [1][ 100/58252] lr: 3.9840e-03 eta: 61 days, 3:22:05 time: 0.1924 data_time: 0.0037 memory: 3385 loss: 0.7108 loss_rpn_cls: 0.2134 loss_rpn_bbox: 0.0740 loss_cls: 0.3196 acc: 95.9961 loss_bbox: 0.1038
本来はこちらの設定でうごけば・・・
:::details
元画像も問題になりそうなので加工ツールを練り直してみる
- これでイメージ取り込み用に画像加工できる(リサイズ・水まし)
- Image optimized for png format
Discussion