Skip to content

Latest commit

 

History

History
128 lines (100 loc) · 30.7 KB

RecipeGallery.md

File metadata and controls

128 lines (100 loc) · 30.7 KB

Data Recipe Gallery

  • The recipe folder contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.
  • 📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight acknowledgements!

Table of Contents

1. Data-Juicer Minimal Example Recipe

Some basic configuration files are placed in the Demo folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.

2. Reproduce Open Source Text Datasets

  • We reproduced the processing flow of part of the Redpajama dataset. Please refer to the reproduced_redpajama folder for detailed description.
  • We reproduced the processing flow of part of the BLOOM dataset. Please refer to the reproduced_bloom folder for detailed description.

3. Improved Open Source Pre-training Text Datasets

We found that there are still some "bad" data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.

We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.

Data subset Number of samples before refinement Number of samples after refinement Sample retention rate Config link Data link Source
arXiv 1,724,497 1,655,259 95.99% redpajama-arxiv-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Books 205,182 195,983 95.51% redpajama-book-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Wikipedia 29,834,171 26,990,659 90.47% redpajama-wiki-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
C4 364,868,892 344,491,171 94.42% redpajama-c4-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2019-30 81,085,420 36,557,283 45.08% redpajama-cc-2019-30-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2020-05 90,850,492 42,612,596 46.90% redpajama-cc-2020-05-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2021-04 98,878,523 44,724,752 45.23% redpajama-cc-2021-04-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2022-05 94,058,868 42,648,496 45.34% redpajama-cc-2022-05-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2023-06 111,402,716 50,643,699 45.46% redpajama-cc-2023-06-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Github Code 73,208,524
+ 21,387,703
49,279,344 52.09% redpajama-code-refine.yaml
stack-code-refine.yaml
redpajama-stack-code-deduplicate.yaml
Aliyun
ModelScope
HuggingFace
Redpajama
The Stack
StackExchange 45,447,328 26,309,203 57.89% redpajama-pile-stackexchange-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
The Pile
EuroParl 69,814 61,601 88.23% pile-europarl-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
FreeLaw 3,562,015 2,942,612 82.61% pile-freelaw-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
HackerNews 373,027 371,331 99.55% pile-hackernews-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
NIH ExPorter 939,661 858,492 91.36% pile-nih-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PhilPapers 32,782 29,117 88.82% pile-philpaper-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PubMed Abstracts 15,518,009 15,009,325 96.72% pile-pubmed-abstract-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PubMed Central 3,098,930 2,694,860 86.96% pile-pubmed-central-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
USPTO 5,883,024 4,516,283 76.77% pile-uspto-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile

4. Improved Open Source Post-tuning Text Dataset

Take the Alpaca-CoT dataset as an example:

Data subset Number of samples before improvement Number of samples after improvement Sample retention rate Configuration link Data link Source
Alpaca-Cot EN 136,219,879 72,855,345 54.48% alpaca-cot-en-refine.yaml Aliyun
ModelScope
HuggingFace
39 subsets from Alpaca-CoT
Alpaca-Cot ZH 21,197,246 9,873,214 46.58% alpaca-cot-zh-refine.yaml Aliyun
ModelScope
HuggingFace
28 subsets from Alpaca-CoT

5. Synthetic Contrastive Learning Image-text datasets

Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff paper, and the corresponding recipe implementation can refer to ImgDiff-Dev.

6. Improved Open Source Image-text datasets

Data subset Number of samples before improvement Number of samples after improvement Sample retention rate Configuration link Data link Source
LLaVA pretrain (LCS-558k) 558,128 500,380 89.65% llava-pretrain-refine.yaml Aliyun
ModelScope
HuggingFace
LLaVA-1.5
Data-Juicer (T2V, 147k) 1,217,346 147,176 12.09% data-juicer-sandbox-optimal.yaml Aliyun
ModelScope
HuggingFace
InternVid (606k)
Panda-70M (605k)
MSR-VTT (6k)
Data-Juicer (DJ, 228k) 3,408,553 227,867 8.15% data-juicer-sandbox-self-evolution.yaml Aliyun
ModelScope
InternVid (606k)
Panda-70M (2,599k)
Pexels (198k)
MSR-VTT (6k)

6.1. Evaluation and Verification

  • LLaVA pretrain (LCS-558k): The model pre-trained with the improved pre-training dataset and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.
Models VQAv2 GQA VizWiz SQA TextVQA POPE MME MM-Bench MM-Bench-CN SEED LLaVA-Bench-Wild MM-Vet
LLaVA-1.5-13B
(Baseline)
80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 61.6 72.5 36.1
LLaVA-1.5-13B
(Rectified Pretraining Dataset)
79.94 63.5 54.09 74.20 60.82 86.67 1565.53 68.2 63.9 61.8 75.9 37.4
  • Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model T2V-Turbo on VBench with refined dataset. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Laboratory.
model Total Score Quality Score Semantic Score subject consistency background consistency temporal flickering motion smoothness dynamic degree aesthetic quality
T2V-Turbo 81.01 82.57 74.76 96.28 97.02 97.48 97.34 49.17 63.04
Data-Juicer (T2V, 147k) 82.10 83.14 77.93 97.32 99.03 96.60 96.51 51.67 68.92
Data-Juicer (DJ, 228k) 82.53 83.38 79.13 97.92 99.27 98.14 97.77 38.89 67.39
model imaging quality object class multiple objects human action color spatial relationship scene appearance style temporal style overall consistency
T2V-Turbo 72.49 93.96 54.65 95.20 89.90 38.67 55.58 24.42 25.51 28.16
Data-Juicer (T2V, 147k) 70.42 95.85 61.63 95.60 94.06 46.95 57.57 24.42 26.34 28.90
Data-Juicer (DJ, 228k) 70.41 96.44 64.51 95.40 95.51 47.17 57.30 25.55 26.82 29.25

7. Basic Example Recipes for Video Data

We provide users with a video dataset processing recipe sample to help better use video-related operators: general-video-refine-example.yaml . Here we apply three types of operators:

  • Text-only: Improve the dataset quality based on video description
  • Video-only: Improve the dataset quality based on video properties
  • Text-Video: Improve the dataset quality based on the alignment between text and video Users can start their video dataset processing workflow based on this recipe.

8. Synthesize Human-centric Video Benchmarks

Data-Juicer can also support video benchmark synthesis, such as HumanVBench, which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in HumanVBench-dev.

9. Improve Existing Open Source Video Datasets

Data subset Number of samples before improvement Number of samples after improvement Sample retention rate Configuration link Data link Source
Data-Juicer (T2V, 147k) 1,217,346 147,176 12.09% data-juicer-sandbox-optimal.yaml Aliyun
ModelScope
HuggingFace
InternVid (606k)
Panda-70M (605k)
MSR-VTT (6k)
Data-Juicer (DJ, 228k) 3,408,553 227,867 8.15% data-juicer-sandbox-self-evolution.yaml Aliyun
ModelScope
InternVid (606k)
Panda-70M (2,599k)
Pexels (198k)
MSR-VTT (6k)

9.1. Evaluation and Verification

  • Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the refined dataset, they fully surpass the baseline model T2V-Turbo in VBench. Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Lab.
model Total Score Quality Score Semantic Score subject consistency background consistency temporal flickering motion smoothness dynamic degree aesthetic quality
T2V-Turbo 81.01 82.57 74.76 96.28 97.02 97.48 97.34 49.17 63.04
Data-Juicer (T2V, 147k) 82.10 83.14 77.93 97.32 99.03 96.60 96.51 51.67 68.92
Data-Juicer (DJ, 228k) 82.53 83.38 79.13 97.92 99.27 98.14 97.77 38.89 67.39
model imaging quality object class multiple objects human action color spatial relationship scene appearance style temporal style overall consistency
T2V-Turbo 72.49 93.96 54.65 95.20 89.90 38.67 55.58 24.42 25.51 28.16
Data-Juicer (T2V, 147k) 70.42 95.85 61.63 95.60 94.06 46.95 57.57 24.42 26.34 28.90
Data-Juicer (DJ, 228k) 70.41 96.44 64.51 95.40 95.51 47.17 57.30 25.55 26.82 29.25