Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_strided() error #41

Open
logan-markewich opened this issue Apr 2, 2021 · 10 comments
Open

as_strided() error #41

logan-markewich opened this issue Apr 2, 2021 · 10 comments

Comments

@logan-markewich
Copy link

I've noticed a few people encountering issues like this:

Traceback (most recent call last):
  File "train.py", line 121, in <module>
    con.train(model[args.model_name], args.save_name)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\config\ConfigBert.py", line 740, in train
    predict_re = model(context_idxs, context_pos, context_ner,
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\code_bert\lsr_bert.py", line 163, in forward
    output = self.reasoner[i](output)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\models\reasoner.py", line 186, in forward
    _, att = self.struc_att(input)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\models\reasoner.py", line 66, in forward
    res.as_strided(tmp.size(), [res.stride(0), res.size(2) + 1]).copy_(tmp)
RuntimeError: setStorage: sizes [88, 88], strides [7744, 89], storage offset 0, and itemsize 4 requiring a storage size of 2725888 are out of bounds for storage of size 30976

It has been stated that changing the seed or batch size can help fix this. But the memory requirement for training is quite insane lol. I can try with a batch size of 1 or 2 only, and all the seeds I've tried hit a similar error.

I'm pretty unfamiliar with that exactly is causing this error, but if you can suggest a fix for the code I can try it out! Otherwise, I cannot train.

@logan-markewich
Copy link
Author

logan-markewich commented Apr 21, 2021

After some debugging, this happens when a batch isn't full.

tmp.size(0) != res.size(0), which is causing the error in as_strided

This could maybe be fixed by padding each batch to reach the batch size?

@ThinkNaive
Copy link

After some debugging, this happens when a batch isn't full.

tmp.size(0) != res.size(0), which is causing the error in as_strided

This could maybe be fixed by padding each batch to reach the batch size?

Have you ever solved this problems? I got this problem when running LSR(bert version) setting BATCH_SZIE from 1 to 12. And I do not know how to address it.

@logan-markewich
Copy link
Author

Nope, I just moved on to a different relation extraction model.

If you are curious, currently ATLOP has state-of-the-art on DocRED (63% F1). It is a much simpler model as well, and they provide the trained model checkpoints from their paper.

@ThinkNaive
Copy link

You are right, I've run this model with batch_size=12 for 33 hours (no bert version), and still it has not finished yet (currently at epoch 140). And ATLOP is much faster with nice result (I ran ATLOP-bert-base-cased and got re_f1: 61.31%)

@nanguoshun
Copy link
Owner

Hi @ThinkNaive , Thanks for your attention. For the BERT-based model, we empirically use a large batch size ( > 16 ) for BERT-model for better convergence.

@ThinkNaive
Copy link

@nanguoshun Thank you for advice. Maybe I should use gpu with large memory for large batch size. I currently work on 12GB memory.

@logan-markewich
Copy link
Author

Yea, 12GB may not be enough. The curse of deep learning 😆

@nguyenvanhoang7398
Copy link

I'm encountering this error even on a machine with 4 16GB GPUs. When this happened I checked the GPU consumption and it was very low, so it couldn't be that the machine is out of GPU memory. I even reduced the batch size to 8 and the hidden dim to 64 but couldn't fix this. Would it be possible for someone to examine this? Thank you.

@IKeepMoving
Copy link

Nope, I just moved on to a different relation extraction model.

If you are curious, currently ATLOP has state-of-the-art on DocRED (63% F1). It is a much simpler model as well, and they provide the trained model checkpoints from their paper.

You are right, I've run this model with batch_size=12 for 33 hours (no bert version), and still it has not finished yet (currently at epoch 140). And ATLOP is much faster with nice result (I ran ATLOP-bert-base-cased and got re_f1: 61.31%)

Have you changed its parameters? I run ATLOP-bert-base-cased and just get re_f1: 59%.

@logan-markewich
Copy link
Author

@IKeepMoving I used the weights they provide from their GitHub page (see the releases pane on the right side). No need to re-train unless you really want to :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants