You can create your own wiki corpus by following these steps.
Download the Wikipedia dump you require in XML format. For instance:
wget https://archive.org/download/enwiki-20181220/enwiki-20181220-pages-articles.xml.bz2
You can access other dumps from this website.
Execute the provided script to process the wiki dump into JSONL format. Adjust the corpus partitioning parameters as needed:
cd scripts
python preprocess_wiki.py --dump_path ../enwikinews-20240420-pages-articles.xml.bz2 \
--save_path ../test_sample.jsonl \
--chunk_by sentence \
--chunk_size 512 \
--num_workers 1
We also provide the version we used for experiments. Download link: https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus