I tried to run torchrun --nproc_per_node=4 train/train_vace_lora.py --config train/config/pickstyle-1.3b.yaml --wandb-name pickstyle_vace1.3b_lora on 4*5090 and get error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank1]: main()
[rank1]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank1]: dataset = create_multi_style_dataset(args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank1]: datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank1]: with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank0]: main()
[rank0]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank0]: dataset = create_multi_style_dataset(args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank0]: datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank0]: with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank3]: main()
[rank3]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank3]: dataset = create_multi_style_dataset(args)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank3]: datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank3]: with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank2]: Traceback (most recent call last):
[rank2]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank2]: main()
[rank2]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank2]: dataset = create_multi_style_dataset(args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank2]: datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank2]: with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank0]:[W121 05:16:11.291023460 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0121 05:16:12.963000 376 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 441 closing signal SIGTERM
W0121 05:16:12.964000 376 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 442 closing signal SIGTERM
W0121 05:16:12.966000 376 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 443 closing signal SIGTERM
E0121 05:16:13.332000 376 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 3 (pid: 444) of binary: /data/jiangyi/jiangyi/pickstyle/.venv/bin/python3
Traceback (most recent call last):
File "/data/jiangyi/jiangyi/pickstyle/.venv/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train/train_vace_lora.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-01-21_05:16:12
host : 4689125fedca
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 444)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The dataset is as follows:
class VideoStyleDataset(Dataset):
def __init__(
self, style, styles_list, basedir, video_resolution=(81, 480, 832), version=1
):
assert style in styles_list, f"Style must be one of: {styles_list}"
assert version in [1, 2], "Version must be either 1 or 2"
if version == 1: # Unity and Style pairs
self.style_paths = sorted(glob(os.path.join(basedir, style, "*.png")))
self.src_paths = sorted(glob(os.path.join(basedir, 'unity', "*.png")))
with open(os.path.join(basedir, 'caption.json'), 'r') as f:
self.annotations = json.load(f)
But the dataset on huggingface doesnt't include caption.json in v1
I tried to run
torchrun --nproc_per_node=4 train/train_vace_lora.py --config train/config/pickstyle-1.3b.yaml --wandb-name pickstyle_vace1.3b_loraon 4*5090 and get error:The dataset is as follows:
But the dataset on huggingface doesnt't include
caption.jsonin v1