You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead
#15
Open
unknowone opened this issue
Feb 12, 2023
· 2 comments
Hi, thanks for sharing such a good project~
I have some problems when I tried to train with bash tools/scripts/dist_train.sh 2 --cfg_file /public/chenrunze/xyy/VFF-main/tools/cfgs/kitti_models/VFF_PVRCNN.yaml
Here is the error:
Traceback (most recent call last):
File "tools/train.py", line 205, in <module>
main()
File "tools/train.py", line 160, in main
train_model(
File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 88, in train_model
accumulated_iter = train_one_epoch(
File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 41, in train_one_epoch
loss.backward()
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
[torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the
backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or
anywhere later. Good luck!
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 100876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 100877) of binary: /public/chenrunze/miniconda3/envs/bevfusion/bin/python3
Traceback (most recent call last):
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-02-12_16:28:26
host : 8265f0d3bcdf
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 100877)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
can u give some advice?
lots of thanks!
The text was updated successfully, but these errors were encountered:
I have also encountered this problem. Have you solved it?
I replace RELU by using leakyrelu to solve this problem but this maybe influence the performance of the model, if you have better solution plz tell me ,thanks!
Hi, thanks for sharing such a good project~
I have some problems when I tried to train with
bash tools/scripts/dist_train.sh 2 --cfg_file /public/chenrunze/xyy/VFF-main/tools/cfgs/kitti_models/VFF_PVRCNN.yaml
Here is the error:
can u give some advice?
lots of thanks!
The text was updated successfully, but these errors were encountered: