RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead #15

unknowone · 2023-02-12T08:48:52Z

Hi, thanks for sharing such a good project~
I have some problems when I tried to train with
bash tools/scripts/dist_train.sh 2 --cfg_file /public/chenrunze/xyy/VFF-main/tools/cfgs/kitti_models/VFF_PVRCNN.yaml
Here is the error:

Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 160, in main
    train_model(
  File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 88, in train_model
    accumulated_iter = train_one_epoch(
  File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 41, in train_one_epoch
    loss.backward()
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the 
backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or 
anywhere later. Good luck!
                                                                                                                                
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 100876 closing signal SIGTERM                               
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 100877) of binary: /public/chenrunze/miniconda3/envs/bevfusion/bin/python3
Traceback (most recent call last):
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-12_16:28:26
  host      : 8265f0d3bcdf
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 100877)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

can u give some advice?
lots of thanks!

The text was updated successfully, but these errors were encountered:

0neDawn · 2023-03-26T02:43:06Z

I have also encountered this problem. Have you solved it?

liulin813 · 2023-09-13T13:53:07Z

I have also encountered this problem. Have you solved it?

I replace RELU by using leakyrelu to solve this problem but this maybe influence the performance of the model, if you have better solution plz tell me ,thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead #15

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead #15

unknowone commented Feb 12, 2023

0neDawn commented Mar 26, 2023

liulin813 commented Sep 13, 2023

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead #15

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead #15

Comments

unknowone commented Feb 12, 2023

0neDawn commented Mar 26, 2023

liulin813 commented Sep 13, 2023