Skip to content

[BUG] Huggingface Dataset v1 lack caption.json #2

@j-yi-11

Description

@j-yi-11

I tried to run torchrun --nproc_per_node=4 train/train_vace_lora.py --config train/config/pickstyle-1.3b.yaml --wandb-name pickstyle_vace1.3b_lora on 4*5090 and get error:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank1]:     main()
[rank1]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank1]:     dataset = create_multi_style_dataset(args)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank1]:     datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank1]:     with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank1]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank0]:     main()
[rank0]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank0]:     dataset = create_multi_style_dataset(args)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank0]:     datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank0]:     with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank3]:     main()
[rank3]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank3]:     dataset = create_multi_style_dataset(args)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank3]:     datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank3]:     with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank3]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 287, in <module>
[rank2]:     main()
[rank2]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 194, in main
[rank2]:     dataset = create_multi_style_dataset(args)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/jiangyi/jiangyi/pickstyle/train/train_vace_lora.py", line 180, in create_multi_style_dataset
[rank2]:     datasets = [VideoStyleDataset(style=style, **args.image_dataset_v1) for style in styles]
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/data/jiangyi/jiangyi/pickstyle/train/dataset_style.py", line 178, in __init__
[rank2]:     with open(os.path.join(basedir, 'caption.json'), 'r') as f:
[rank2]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/image_dataset_v1/caption.json'
[rank0]:[W121 05:16:11.291023460 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0121 05:16:12.963000 376 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 441 closing signal SIGTERM
W0121 05:16:12.964000 376 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 442 closing signal SIGTERM
W0121 05:16:12.966000 376 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 443 closing signal SIGTERM
E0121 05:16:13.332000 376 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 3 (pid: 444) of binary: /data/jiangyi/jiangyi/pickstyle/.venv/bin/python3
Traceback (most recent call last):
  File "/data/jiangyi/jiangyi/pickstyle/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/jiangyi/jiangyi/pickstyle/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train/train_vace_lora.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-21_05:16:12
  host      : 4689125fedca
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 444)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The dataset is as follows:

class VideoStyleDataset(Dataset):
    def __init__(
        self, style, styles_list, basedir, video_resolution=(81, 480, 832), version=1
    ):
        assert style in styles_list, f"Style must be one of: {styles_list}"
        assert version in [1, 2], "Version must be either 1 or 2"
        if version == 1: # Unity and Style pairs
            self.style_paths = sorted(glob(os.path.join(basedir, style, "*.png")))
            self.src_paths = sorted(glob(os.path.join(basedir, 'unity', "*.png")))
            with open(os.path.join(basedir, 'caption.json'), 'r') as f:
                self.annotations = json.load(f)

But the dataset on huggingface doesnt't include caption.json in v1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions