-
Notifications
You must be signed in to change notification settings - Fork 7
error with multiprocess dataloader #328
Comments
with received 0 items of ancdata
File ".../lib/python3.8/multiprocessing/reduction.py", line 164, in recvfds
raise RuntimeError('received %d items of ancdata' %
File ".../lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File ".../lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File ".../lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File ".../lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File ".../lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 278, in _fetch_next_batch
batch = next(iterator)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 264, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File ".../lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 185, in __next__
return self.fetching_function()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
batch = next(data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
self.val_loop.run()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
self._run_validation()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
self.on_advance_end()
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File ".../lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File ".../lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File ".../train.py", line 123, in _run_experiment
trainer.fit(lit_module, data_module)
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 155, in _run_main
output = self.definition.func(
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 108, in _run
raise failure_exc
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 108, in _run
raise failure_exc
File ".../lib/python3.8/site-packages/layer/executables/entrypoint/common.py", line 71, in __call__
return self._run()
File ".../lib/python3.8/site-packages/layer/decorators/layer_wrapper.py", line 82, in __call__
return runner()
File ".../train.py", line 161, in train
layer.model(layer_model_name)(_run_experiment)(**kwargs)
File ".../experiment_scripts/mlp_experiments2.py", line 161, in <module>
train(**param)
File ".../lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File ".../lib/python3.8/runpy.py", line 194, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None, |
Hi Fatih, thanks for reporting this issue, we're looking into it. |
I am having it only with the Layer decorator. We have performed more than 100 runs without any error when the Layer decorator is not used. |
Thanks for the information. Are you also able to share the latest Layer SDK version that did not produce such an error? |
With Did not have the time to try with newer Layer versions. |
summary
We are having this error when
run_experiment
is wrapped with layer decorator but not having it while performing training without layer decorator. Not %100 tested and sure on that. Will conduct more experiments to see if it's a layer issue or PytorchLighning-related issue.update
In the latest runs with more recent layer version, I did not have this error but had a different error: #333
scenario
I can't provide the full code because of privacy but the overall structure is like this:
layer==0.10.2861256067
layer-api==0.9.377751
Python 3.8.5
Ubuntu 18.04
error trace
The text was updated successfully, but these errors were encountered: