🤑

MMDetection カスタムデータ暫定運用とトラブルシュート

2023/09/07に公開

Ubuntu

MMDetection カスタムデータがうまく使えないときのワークアラウンド

MMDetection custom data temporary use work around tips

学習するコマンド例（今回はFaster_RCNN)

python tools/train.py configs/your_custom_config.py --gpus 1

学習できない状況

パスが通っていない
このデータセットはない
tmpで動かせない
その他もろもろでとにかく動かせない！

(En)

failure examples

"__init__.py at the mmdet/datasets/"

Error is occurred from wrong setting custom dataset
You should not keep defective dataset description in it.

# <error1>

09/07 13:34:53 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

09/07 13:34:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/07 13:34:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/07 13:34:53 - mmengine - INFO - Checkpoints will be saved to /home/askengi/mmdetection/work_d.
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1721, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
    self.run_iter(idx, data_batch)
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
    outputs = self.runner.model.train_step(
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 340, in _run_forward
    results = self(**data, mode=mode)
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/askengi/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/home/askengi/mmdetection/mmdet/models/detectors/two_stage.py", line 190, in loss
    roi_losses = self.roi_head.loss(x, rpn_results_list,
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 135, in loss
    bbox_results = self.bbox_loss(x, sampling_results)
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 193, in bbox_loss
    bbox_loss_and_target = self.bbox_head.loss_and_target(
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 323, in loss_and_target
    losses = self.loss(
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 392, in loss
    losses['acc'] = accuracy(cls_score, labels)
  File "/home/askengi/mmdetection/mmdet/models/losses/accuracy.py", line 47, in accuracy
    correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [1,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [512,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [513,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [514,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [515,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [516,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
(mmlab) askengi@askengi-desktop:~/mmdetection$

:::details

solution

custom config file for CocoDataset
num_classes　number should be match with coco.py label number.

# base config file in faster_rcnn directory
model = dict(
    roi_head=dict(
        bbox_head=dict(
            num_classes=80  # classes must much with coco.py setting
        )
    )
)

# データセット関連の設定
dataset_type = 'CocoDataset'  # 自分で定義したデータセットクラス
data_root = 'data/cup/'  # データのルートパス

# クラス情報
classes = ('Furyo', 'cup', 'Ibutsu')

一時的な対応策で学習をさせる

どうにもならない場合

__init__.pyに誤った設定を残している場合

# <error1>

09/07 13:34:53 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

09/07 13:34:53 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/07 13:34:53 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/07 13:34:53 - mmengine - INFO - Checkpoints will be saved to /home/askengi/mmdetection/work_d.
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1721, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
    self.run_iter(idx, data_batch)
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
    outputs = self.runner.model.train_step(
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 340, in _run_forward
    results = self(**data, mode=mode)
  File "/home/askengi/anaconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/askengi/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/home/askengi/mmdetection/mmdet/models/detectors/two_stage.py", line 190, in loss
    roi_losses = self.roi_head.loss(x, rpn_results_list,
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 135, in loss
    bbox_results = self.bbox_loss(x, sampling_results)
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/standard_roi_head.py", line 193, in bbox_loss
    bbox_loss_and_target = self.bbox_head.loss_and_target(
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 323, in loss_and_target
    losses = self.loss(
  File "/home/askengi/mmdetection/mmdet/models/roi_heads/bbox_heads/bbox_head.py", line 392, in loss
    losses['acc'] = accuracy(cls_score, labels)
  File "/home/askengi/mmdetection/mmdet/models/losses/accuracy.py", line 47, in accuracy
    correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [1,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [512,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [513,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [514,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [515,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [0,0,0], thread: [516,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
(mmlab) askengi@askengi-desktop:~/mmdetection$

coco.pyを自分のIDラベルに置き換えたらすること

# モデルの設定
model = dict(
    roi_head=dict(
        bbox_head=dict(
            num_classes=80  # クラス数をはmmdet/datasets/coco.pyの数
        )
    )
)

# データセット関連の設定
dataset_type = 'CocoDataset'  # 既存のデータセットクラス
data_root = 'data/cup/'  # データのルートパス

# クラス情報
classes = ('Furyo', 'cup', 'Ibutsu')

num_classes=3 とコンフィグ内の数字にするとエラーになってしまいます・・・

成功例・・・６０日かかるのか


path: torchvision://resnet50
09/07 13:36:47 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

09/07 13:36:47 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
09/07 13:36:47 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
09/07 13:36:47 - mmengine - INFO - Checkpoints will be saved to /home/askengi/mmdetection/work_d.
09/07 13:37:00 - mmengine - INFO - Epoch(train)   [1][   50/58252]  lr: 1.9820e-03  eta: 70 days, 9:17:32  time: 0.2610  data_time: 0.0086  memory: 3384  loss: 1.5163  loss_rpn_cls: 0.4873  loss_rpn_bbox: 0.0711  loss_cls: 0.9302  acc: 98.3398  loss_bbox: 0.0278
09/07 13:37:10 - mmengine - INFO - Epoch(train)   [1][  100/58252]  lr: 3.9840e-03  eta: 61 days, 3:22:05  time: 0.1924  data_time: 0.0037  memory: 3385  loss: 0.7108  loss_rpn_cls: 0.2134  loss_rpn_bbox: 0.0740  loss_cls: 0.3196  acc: 95.9961  loss_bbox: 0.1038

本来はこちらの設定でうごけば・・・

:::details

元画像も問題になりそうなので加工ツールを練り直してみる

これでイメージ取り込み用に画像加工できる（リサイズ・水まし）
Image optimized for png format

Discussion

ログインするとコメントできます