- The recipe folder contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.
- 📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight acknowledgements!
Table of Contents
- 1. Data-Juicer Minimal Example Recipe
- 2. Reproduce Open Source Text Datasets
- 3. Improved Open Source Pre-training Text Datasets
- 4. Improved Open Source Post-tuning Text Dataset
- 5. Synthetic Contrastive Learning Image-text datasets
- 6. Improved Open Source Image-text datasets
- 7. Basic Example Recipes for Video Data
- 8. Synthesize Human-centric Video Benchmarks
- 9. Improve Existing Open Source Video Datasets
Some basic configuration files are placed in the Demo folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.
- We reproduced the processing flow of part of the Redpajama dataset. Please refer to the reproduced_redpajama folder for detailed description.
- We reproduced the processing flow of part of the BLOOM dataset. Please refer to the reproduced_bloom folder for detailed description.
We found that there are still some "bad" data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.
We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.
Data subset | Number of samples before refinement | Number of samples after refinement | Sample retention rate | Config link | Data link | Source |
---|---|---|---|---|---|---|
arXiv | 1,724,497 | 1,655,259 | 95.99% | redpajama-arxiv-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Books | 205,182 | 195,983 | 95.51% | redpajama-book-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Wikipedia | 29,834,171 | 26,990,659 | 90.47% | redpajama-wiki-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
C4 | 364,868,892 | 344,491,171 | 94.42% | redpajama-c4-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | redpajama-cc-2019-30-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | redpajama-cc-2020-05-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | redpajama-cc-2021-04-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | redpajama-cc-2022-05-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | redpajama-cc-2023-06-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama |
Github Code | 73,208,524 + 21,387,703 |
49,279,344 | 52.09% | redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml |
Aliyun ModelScope HuggingFace |
Redpajama The Stack |
StackExchange | 45,447,328 | 26,309,203 | 57.89% | redpajama-pile-stackexchange-refine.yaml | Aliyun ModelScope HuggingFace |
Redpajama The Pile |
EuroParl | 69,814 | 61,601 | 88.23% | pile-europarl-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
FreeLaw | 3,562,015 | 2,942,612 | 82.61% | pile-freelaw-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
HackerNews | 373,027 | 371,331 | 99.55% | pile-hackernews-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
NIH ExPorter | 939,661 | 858,492 | 91.36% | pile-nih-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PhilPapers | 32,782 | 29,117 | 88.82% | pile-philpaper-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | pile-pubmed-abstract-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
PubMed Central | 3,098,930 | 2,694,860 | 86.96% | pile-pubmed-central-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
USPTO | 5,883,024 | 4,516,283 | 76.77% | pile-uspto-refine.yaml | Aliyun ModelScope HuggingFace |
The Pile |
Take the Alpaca-CoT dataset as an example:
Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
---|---|---|---|---|---|---|
Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | alpaca-cot-en-refine.yaml | Aliyun ModelScope HuggingFace |
39 subsets from Alpaca-CoT |
Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | alpaca-cot-zh-refine.yaml | Aliyun ModelScope HuggingFace |
28 subsets from Alpaca-CoT |
Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff paper, and the corresponding recipe implementation can refer to ImgDiff-Dev.
Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
---|---|---|---|---|---|---|
LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | llava-pretrain-refine.yaml | Aliyun ModelScope HuggingFace |
LLaVA-1.5 |
Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | data-juicer-sandbox-optimal.yaml | Aliyun ModelScope HuggingFace |
InternVid (606k) Panda-70M (605k) MSR-VTT (6k) |
Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | data-juicer-sandbox-self-evolution.yaml | Aliyun ModelScope |
InternVid (606k) Panda-70M (2,599k) Pexels (198k) MSR-VTT (6k) |
- LLaVA pretrain (LCS-558k): The model pre-trained with the improved pre-training dataset and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.
Models | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5-13B (Baseline) |
80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
LLaVA-1.5-13B (Rectified Pretraining Dataset) |
79.94 | 63.5 | 54.09 | 74.20 | 60.82 | 86.67 | 1565.53 | 68.2 | 63.9 | 61.8 | 75.9 | 37.4 |
- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model T2V-Turbo on VBench with refined dataset. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Laboratory.
model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
---|---|---|---|---|---|---|---|---|---|
T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | 51.67 | 68.92 |
Data-Juicer (DJ, 228k) | 82.53 | 83.38 | 79.13 | 97.92 | 99.27 | 98.14 | 97.77 | 38.89 | 67.39 |
model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
---|---|---|---|---|---|---|---|---|---|---|
T2V-Turbo | 72.49 | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | 95.60 | 94.06 | 46.95 | 57.57 | 24.42 | 26.34 | 28.90 |
Data-Juicer (DJ, 228k) | 70.41 | 96.44 | 64.51 | 95.40 | 95.51 | 47.17 | 57.30 | 25.55 | 26.82 | 29.25 |
We provide users with a video dataset processing recipe sample to help better use video-related operators: general-video-refine-example.yaml . Here we apply three types of operators:
- Text-only: Improve the dataset quality based on video description
- Video-only: Improve the dataset quality based on video properties
- Text-Video: Improve the dataset quality based on the alignment between text and video Users can start their video dataset processing workflow based on this recipe.
Data-Juicer can also support video benchmark synthesis, such as HumanVBench, which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in HumanVBench-dev.
Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
---|---|---|---|---|---|---|
Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | data-juicer-sandbox-optimal.yaml | Aliyun ModelScope HuggingFace |
InternVid (606k) Panda-70M (605k) MSR-VTT (6k) |
Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | data-juicer-sandbox-self-evolution.yaml | Aliyun ModelScope |
InternVid (606k) Panda-70M (2,599k) Pexels (198k) MSR-VTT (6k) |
- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the refined dataset, they fully surpass the baseline model T2V-Turbo in VBench. Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Lab.
model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
---|---|---|---|---|---|---|---|---|---|
T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | 51.67 | 68.92 |
Data-Juicer (DJ, 228k) | 82.53 | 83.38 | 79.13 | 97.92 | 99.27 | 98.14 | 97.77 | 38.89 | 67.39 |
model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
---|---|---|---|---|---|---|---|---|---|---|
T2V-Turbo | 72.49 | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | 95.60 | 94.06 | 46.95 | 57.57 | 24.42 | 26.34 | 28.90 |
Data-Juicer (DJ, 228k) | 70.41 | 96.44 | 64.51 | 95.40 | 95.51 | 47.17 | 57.30 | 25.55 | 26.82 | 29.25 |