Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead #15

Open
unknowone opened this issue Feb 12, 2023 · 2 comments

Comments

@unknowone
Copy link

Hi, thanks for sharing such a good project~
I have some problems when I tried to train with
bash tools/scripts/dist_train.sh 2 --cfg_file /public/chenrunze/xyy/VFF-main/tools/cfgs/kitti_models/VFF_PVRCNN.yaml
Here is the error:

Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 160, in main
    train_model(
  File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 88, in train_model
    accumulated_iter = train_one_epoch(
  File "/public/chenrunze/xyy/VFF-main/tools/train_utils/train_utils.py", line 41, in train_one_epoch
    loss.backward()
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.cuda.FloatTensor [16000, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the 
backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or 
anywhere later. Good luck!
                                                                                                                                
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 100876 closing signal SIGTERM                               
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 100877) of binary: /public/chenrunze/miniconda3/envs/bevfusion/bin/python3
Traceback (most recent call last):
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/chenrunze/miniconda3/envs/bevfusion/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-12_16:28:26
  host      : 8265f0d3bcdf
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 100877)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

can u give some advice?
lots of thanks!

@0neDawn
Copy link

0neDawn commented Mar 26, 2023

I have also encountered this problem. Have you solved it?

@liulin813
Copy link

I have also encountered this problem. Have you solved it?

I replace RELU by using leakyrelu to solve this problem but this maybe influence the performance of the model, if you have better solution plz tell me ,thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants