I am trying to reproduce the results of your publication for my course project. However, I think there is some issue with the "Pascal: JPEGImages | SegmentationClass" data set. It keeps on giving the error "File not found". The complete error has been provided below:
FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2024-04-13 19:40:45,183][INFO] - {'criterion': {'kwargs': {'use_weight': False}, 'type': 'CELoss'},
'dataset': {'ignore_label': 255,
'mean': [0.485, 0.456, 0.406],
'n_sup': 662,
'std': [0.229, 0.224, 0.225],
'train': {'batch_size': 8,
'crop': {'size': [513, 513], 'type': 'rand'},
'data_list': './data/splitsall/pascal_u2pl/662/labeled.txt',
'data_root': './data/VOC2012',
'flip': True,
'rand_resize': [0.5, 2.0],
'resize_base_size': 500,
'strong_aug': {'flag_use_random_num_sampling': True,
'num_augs': 3}},
'type': 'pascal_semi',
'val': {'batch_size': 1,
'data_list': './data/splitsall/pascal_u2pl/val.txt',
'data_root': './data/VOC2012'},
'workers': 4},
'exp_path': './exps/zrun_vocs_u2pl/voc_semi662',
'log_path': './exps/zrun_vocs_u2pl/voc_semi662/log',
'net': {'decoder': {'kwargs': {'dilations': [6, 12, 18],
'inner_planes': 256,
'low_conv_planes': 48},
'type': 'augseg.models.decoder.dec_deeplabv3_plus'},
'ema_decay': 0.999,
'encoder': {'kwargs': {'multi_grid': True,
'replace_stride_with_dilation': [False,
False,
True],
'zero_init_residual': True},
'pretrain': './pretrained/resnet101.pth',
'type': 'augseg.models.resnet.resnet101'},
'num_classes': 21,
'sync_bn': True},
'save_path': './exps/zrun_vocs_u2pl/voc_semi662/checkpoints',
'saver': {'pretrain': '', 'snapshot_dir': 'checkpoints', 'use_tb': False},
'trainer': {'epochs': 80,
'evaluate_student': True,
'lr_scheduler': {'kwargs': {'power': 0.9}, 'mode': 'poly'},
'optimizer': {'kwargs': {'lr': 0.001,
'momentum': 0.9,
'weight_decay': 0.0001},
'type': 'SGD'},
'sup_only_epoch': 0,
'unsupervised': {'flag_extra_weak': False,
'loss_weight': 1.0,
'threshold': 0.95,
'use_cutmix': True,
'use_cutmix_adaptive': True,
'use_cutmix_trigger_prob': 1.0}}}
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth'
missing_keys: []
unexpected_keys: ['fc.weight', 'fc.bias']
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth'
missing_keys: []
unexpected_keys: ['fc.weight', 'fc.bias']
[2024-04-13 19:40:55,377][INFO] - # samples: 662
[2024-04-13 19:40:55,390][INFO] - # samples: 9920
[2024-04-13 19:40:55,396][INFO] - # samples: 1449
[2024-04-13 19:40:55,396][INFO] - Get loader Done...
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth'
missing_keys: []
unexpected_keys: ['fc.weight', 'fc.bias']
[Info] Load ImageNet pretrain from './pretrained/resnet101.pth'
missing_keys: []
unexpected_keys: ['fc.weight', 'fc.bias']
[2024-04-13 19:40:58,584][INFO] - -------------------------- start training --------------------------
Traceback (most recent call last):
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
main(args)
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 172, in main
res_loss_sup, res_loss_unsup = train(
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 301, in train
_, image_u_weak, image_u_aug, _ = loader_u_iter.next()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/DATA2/dse316/grp_007/augseg/augseg/dataset/pascal_voc.py", line 63, in __getitem__
label = self.img_loader(label_path, "L")
File "/DATA2/dse316/grp_007/augseg/augseg/dataset/base.py", line 44, in img_loader
with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/VOC2012/SegmentationClassAug/2008_006330.png'
Traceback (most recent call last):
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
main(args)
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 172, in main
res_loss_sup, res_loss_unsup = train(
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 301, in train
_, image_u_weak, image_u_aug, _ = loader_u_iter.next()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/DATA2/dse316/grp_007/augseg/augseg/dataset/pascal_voc.py", line 63, in __getitem__
label = self.img_loader(label_path, "L")
File "/DATA2/dse316/grp_007/augseg/augseg/dataset/base.py", line 44, in img_loader
with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/VOC2012/SegmentationClassAug/2008_000085.png'
Exception in thread Thread-1 (_pin_memory_loop):
Traceback (most recent call last):
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd
fd = df.detach()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 686256) of binary: /home/dse316/miniconda3/envs/grp_007/bin/python
Traceback (most recent call last):
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_semi.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-13_19:41:03
host : pragyan
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 686257)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-13_19:41:03
host : pragyan
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 686256)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Anticipating a positive response. Please cross check the data source file contains all the files and the link on the GitHub repository is correct.
Hello there!
I am trying to reproduce the results of your publication for my course project. However, I think there is some issue with the "Pascal: JPEGImages | SegmentationClass" data set. It keeps on giving the error "File not found". The complete error has been provided below:
Anticipating a positive response. Please cross check the data source file contains all the files and the link on the GitHub repository is correct.