深度学习|mmsegmentation 报错

【深度学习|mmsegmentation 报错】Describe the bug
2022-03-01 07:52:37,584 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters
2022-03-01 07:52:37,585 - mmseg - INFO - Checkpoints will be saved to /data/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug by HardDiskBackend.
/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "tools/train.py", line 234, in
main()
File "tools/train.py", line 223, in main
train_segmentor(
File "/data/mmsegmentation/mmseg/apis/train.py", line 174, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/data/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(*args, *kwargs)
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/data/mmsegmentation/mmseg/models/losses/accuracy.py", line 47, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f57296eea22 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10983 (0x7f572994f983 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a7 (0x7f5729951027 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f57296d85a4 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa27e1a (0x7f56d4a16e1a in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa27eb1 (0x7f56d4a16eb1 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #23: __libc_start_main + 0xe7 (0x7f573a7ffb97 in /lib/x86_64-linux-gnu/libc.so.6)

使用mmseg工程进行VOC数据集训练的时候,发现报了上述错误
RuntimeError: CUDA error: an illegal memory access was encountered

correct = correct[:, target != ignore_index]

因此,我在外部使用python进行的算法上的测试,重新生成了true和false的tensor,然后添加判断条件,发现并没有什么问题,说明不是cpu和gpu方面的问题,之后回到github的链接下,发现有人出现了同样的问题,是由于num_class没有修改,因此需要修改num_class的数值,可以正常开始算法模型的训练。

    推荐阅读