Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 1.01 KB

process-wiki.md

File metadata and controls

28 lines (19 loc) · 1.01 KB

Building Wiki Corpus

You can create your own wiki corpus by following these steps.

Step1: Download Wiki dump

Download the Wikipedia dump you require in XML format. For instance:

wget https://archive.org/download/enwiki-20181220/enwiki-20181220-pages-articles.xml.bz2

You can access other dumps from this website.

Step2: Run process script

Execute the provided script to process the wiki dump into JSONL format. Adjust the corpus partitioning parameters as needed:

cd scripts
python preprocess_wiki.py --dump_path ../enwikinews-20240420-pages-articles.xml.bz2  \
                        --save_path ../test_sample.jsonl \
                        --chunk_by sentence \
                        --chunk_size 512 \
                        --num_workers 1

We also provide the version we used for experiments. Download link: https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus