as_strided() error #41

logan-markewich · 2021-04-02T00:36:34Z

I've noticed a few people encountering issues like this:

Traceback (most recent call last):
  File "train.py", line 121, in <module>
    con.train(model[args.model_name], args.save_name)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\config\ConfigBert.py", line 740, in train
    predict_re = model(context_idxs, context_pos, context_ner,
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\code_bert\lsr_bert.py", line 163, in forward
    output = self.reasoner[i](output)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\models\reasoner.py", line 186, in forward
    _, att = self.struc_att(input)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\models\reasoner.py", line 66, in forward
    res.as_strided(tmp.size(), [res.stride(0), res.size(2) + 1]).copy_(tmp)
RuntimeError: setStorage: sizes [88, 88], strides [7744, 89], storage offset 0, and itemsize 4 requiring a storage size of 2725888 are out of bounds for storage of size 30976

It has been stated that changing the seed or batch size can help fix this. But the memory requirement for training is quite insane lol. I can try with a batch size of 1 or 2 only, and all the seeds I've tried hit a similar error.

I'm pretty unfamiliar with that exactly is causing this error, but if you can suggest a fix for the code I can try it out! Otherwise, I cannot train.

The text was updated successfully, but these errors were encountered:

logan-markewich · 2021-04-21T15:37:30Z

After some debugging, this happens when a batch isn't full.

tmp.size(0) != res.size(0), which is causing the error in as_strided

This could maybe be fixed by padding each batch to reach the batch size?

ThinkNaive · 2021-06-02T06:37:12Z

After some debugging, this happens when a batch isn't full.

tmp.size(0) != res.size(0), which is causing the error in as_strided

This could maybe be fixed by padding each batch to reach the batch size?

Have you ever solved this problems? I got this problem when running LSR(bert version) setting BATCH_SZIE from 1 to 12. And I do not know how to address it.

logan-markewich · 2021-06-04T14:08:25Z

Nope, I just moved on to a different relation extraction model.

If you are curious, currently ATLOP has state-of-the-art on DocRED (63% F1). It is a much simpler model as well, and they provide the trained model checkpoints from their paper.

ThinkNaive · 2021-06-05T01:14:32Z

You are right, I've run this model with batch_size=12 for 33 hours (no bert version), and still it has not finished yet (currently at epoch 140). And ATLOP is much faster with nice result (I ran ATLOP-bert-base-cased and got re_f1: 61.31%)

nanguoshun · 2021-06-05T05:35:17Z

Hi @ThinkNaive , Thanks for your attention. For the BERT-based model, we empirically use a large batch size ( > 16 ) for BERT-model for better convergence.

ThinkNaive · 2021-06-06T05:24:38Z

@nanguoshun Thank you for advice. Maybe I should use gpu with large memory for large batch size. I currently work on 12GB memory.

logan-markewich · 2021-06-09T16:47:19Z

Yea, 12GB may not be enough. The curse of deep learning 😆

nguyenvanhoang7398 · 2021-09-27T13:50:41Z

I'm encountering this error even on a machine with 4 16GB GPUs. When this happened I checked the GPU consumption and it was very low, so it couldn't be that the machine is out of GPU memory. I even reduced the batch size to 8 and the hidden dim to 64 but couldn't fix this. Would it be possible for someone to examine this? Thank you.

IKeepMoving · 2021-12-21T06:46:22Z

Nope, I just moved on to a different relation extraction model.

If you are curious, currently ATLOP has state-of-the-art on DocRED (63% F1). It is a much simpler model as well, and they provide the trained model checkpoints from their paper.

You are right, I've run this model with batch_size=12 for 33 hours (no bert version), and still it has not finished yet (currently at epoch 140). And ATLOP is much faster with nice result (I ran ATLOP-bert-base-cased and got re_f1: 61.31%)

Have you changed its parameters? I run ATLOP-bert-base-cased and just get re_f1: 59%.

logan-markewich · 2021-12-21T22:04:22Z

@IKeepMoving I used the weights they provide from their GitHub page (see the releases pane on the right side). No need to re-train unless you really want to :)

logan-markewich closed this as completed Apr 12, 2021

logan-markewich reopened this Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as_strided() error #41

as_strided() error #41

logan-markewich commented Apr 2, 2021

logan-markewich commented Apr 21, 2021 •

edited

Loading

ThinkNaive commented Jun 2, 2021

logan-markewich commented Jun 4, 2021

ThinkNaive commented Jun 5, 2021

nanguoshun commented Jun 5, 2021

ThinkNaive commented Jun 6, 2021

logan-markewich commented Jun 9, 2021

nguyenvanhoang7398 commented Sep 27, 2021

IKeepMoving commented Dec 21, 2021

logan-markewich commented Dec 21, 2021

as_strided() error #41

as_strided() error #41

Comments

logan-markewich commented Apr 2, 2021

logan-markewich commented Apr 21, 2021 • edited Loading

ThinkNaive commented Jun 2, 2021

logan-markewich commented Jun 4, 2021

ThinkNaive commented Jun 5, 2021

nanguoshun commented Jun 5, 2021

ThinkNaive commented Jun 6, 2021

logan-markewich commented Jun 9, 2021

nguyenvanhoang7398 commented Sep 27, 2021

IKeepMoving commented Dec 21, 2021

logan-markewich commented Dec 21, 2021

logan-markewich commented Apr 21, 2021 •

edited

Loading