Change sample data to general knowledge

RUC-NLPIR · Jun 14, 2024 · 3d20a75 · 3d20a75
1 parent be9453d
commit 3d20a75
Show file tree

Hide file tree

Showing 7 changed files with 15,019 additions and 1,019 deletions.
diff --git a/docs/introduction_for_beginners_en.md b/docs/introduction_for_beginners_en.md
@@ -16,8 +16,8 @@ To smoothly run the entire RAG process, you need to complete the following five
 1. Install the project and its dependencies.
 2. Download the required models.
 3. Download the necessary datasets (a [toy dataset](../examples/quick_start/dataset/nq) is provided).
-4. Download the document collection for retrieval (a [toy corpus](../examples/quick_start/indexes/sample_data.jsonl) is provided).
-5. Build the index for retrieval (a [toy index](../examples/quick_start/indexes/e5_flat_sample.index) is provided).
+4. Download the document collection for retrieval (a [toy corpus](../examples/quick_start/indexes/general_knowledge.jsonl) is provided).
+5. Build the index for retrieval (a [toy index](../examples/quick_start/indexes/e5_Flat.index) is provided).
 
 To save time in getting started, we provide toy datasets, document collections, and corresponding indices. Therefore, you only need to complete the first two steps to successfully run the entire process.
 
@@ -52,7 +52,7 @@ Our repository also provides a large number of processed benchmark datasets. You
 
 ### 2.4 Downloading the Document Collection
 
-The document collection contains a large number of segmented paragraphs, serving as the external knowledge source for the RAG system. Since commonly used document collections are often very large (~5G or more), we have extracted 10,000 texts from the Wikipedia document collection as a toy collection, located at  [examples/quick_start/indexes/sample_data.jsonl](../examples/quick_start/indexes/sample_data.jsonl)。
+The document collection contains a large number of segmented paragraphs, serving as the external knowledge source for the RAG system. Since commonly used document collections are often very large (~5G or more), we use a [general knowledge dataset](https://huggingface.co/datasets/MuskumPillerum/General-Knowledge) as a toy collection, located at  [examples/quick_start/indexes/general_knowledge.jsonl](../examples/quick_start/indexes/general_knowledge.jsonl)。
 
 > Due to the small number of documents, many queries may not find relevant texts, which could affect the final retrieval results.
 
@@ -64,7 +64,7 @@ If you need to obtain the full document collection, you can visit our [huggingfa
 
 To improve retrieval efficiency, we often need to build the retrieval index in advance. For the BM25 method, the index is usually an inverted index (a directory in our project). For various embedding methods, the index is a Faiss database containing the embeddings of all texts in the document collection (an .index file). **Each index corresponds to a corpus and a retrieval method**, meaning that every time you want to use a new embedding model, you need to rebuild the index.
 
-Here, we provide a [toy index](../examples/quick_start/indexes/e5_flat_sample.index), built using E5-base-v2 and the aforementioned toy corpus.
+Here, we provide a [toy index](../examples/quick_start/indexes/e5_Flat.index), built using E5-base-v2 and the aforementioned toy corpus.
 
 If you want to use your own retrieval model and documents, you can refer to our [index building document](./building-index.md) to build your index.
 
@@ -88,8 +88,8 @@ from flashrag.config import Config
 
 config_dict = { 
     'data_dir': 'dataset/',
-    'index_path': 'indexes/e5_flat_sample.index',
-    'corpus_path': 'indexes/sample_data.jsonl',
+    'index_path': 'indexes/e5_Flat.index',
+    'corpus_path': 'indexes/general_knowledge.jsonl',
     'model2path': {'e5': <retriever_path>, 'llama2-7B-chat': <generator_path>},
     'generator_model': 'llama2-7B-chat',
     'retrieval_method': 'e5',
@@ -137,8 +137,8 @@ from flashrag.pipeline import SequentialPipeline
 
 config_dict = { 
                 'data_dir': 'dataset/',
-                'index_path': 'indexes/e5_flat_sample.index',
-                'corpus_path': 'indexes/sample_data.jsonl',
+                'index_path': 'indexes/e5_Flat.index',
+                'corpus_path': 'indexes/general_knowledge.jsonl',
                 'model2path': {'e5': <retriever_path>, 'llama2-7B-chat': <generator_path>},
                 'generator_model': 'llama2-7B-chat',
                 'retrieval_method': 'e5',

diff --git a/docs/introduction_for_beginners_zh.md b/docs/introduction_for_beginners_zh.md
@@ -17,8 +17,8 @@ Standard RAG的流程包括以下三个步骤:
 1. 安装本项目以及对应的依赖库
 2. 下载需要的各种模型
 3. 下载需要的数据集 (已提供[toy dataset](../examples/quick_start/dataset/nq))
-4. 下载用于检索的文档集合(已提供[toy corpus](../examples/quick_start/indexes/sample_data.jsonl))
-5. 构建用于检索的index (已提供[toy index](../examples/quick_start/indexes/e5_flat_sample.index))
+4. 下载用于检索的文档集合(已提供[toy corpus](../examples/quick_start/indexes/general_knowledge.jsonl))
+5. 构建用于检索的index (已提供[toy index](../examples/quick_start/indexes/e5_Flat.index))
 
 
 为了能够节省入门所需要的时间，我们提供了玩具数据集、文档集合以及对应的index。因此实际上只需要进行前两步就可以顺利完成整个流程。
@@ -54,8 +54,8 @@ pip install -e .
 
 ### 2.4 下载文档集合
 
-文档集合包含了大量的切分好的段落，是RAG系统的外部知识来源。由于常用的文档集合往往非常大(~5G以上)，我们这里从维基百科文档集合中抽取出了10000条文本作为toy集合， 地址为 [examples/quick_start/indexes/sample_data.jsonl](../examples/quick_start/indexes/sample_data.jsonl)。
-> 由于文档数量非常少，可能很多query都无法搜到相关的文本，这可能会影响最终的检索结果。
+文档集合包含了大量的切分好的段落，是RAG系统的外部知识来源。由于常用的文档集合往往非常大(~5G以上)，我们使用了一个通用知识数据集作为检索文档， 地址为 [examples/quick_start/indexes/general_knowledge.jsonl](../examples/quick_start/indexes/general_knowledge.jsonl)。
+> 由于文档数量较少，可能很多query都无法搜到相关的文本，这可能会影响最终的检索结果。
 
 
 如果需要获取完整的文档集合，可以访问我们[huggingface上的数据集](https://huggingface.co/datasets/ignore/FlashRAG_datasets)进行下载和使用。
@@ -65,7 +65,7 @@ pip install -e .
 
 为了提高检索的查询效率，我们往往需要提前构建检索的索引。对于BM25方法，索引往往是倒排表(在我们项目中是一个文件夹)。对于各类embedding方法，索引是一个包含检索文档集合中所有文本的embedding的faiss数据库(一个.index文件)。**每个索引对应着一个corpus和一种检索方式**，也就是每当想使用一种新的embedding模型，都得重新构建索引。
 
-在这里我们提供了一个[toy index](../examples/quick_start/indexes/e5_flat_sample.index)，其使用E5-base-v2以及前面的toy corpus进行构建。
+在这里我们提供了一个[toy index](../examples/quick_start/indexes/e5_Flat.index)，其使用E5-base-v2以及前面的toy corpus进行构建。
 
 如果想使用自己的检索模型和检索文档，可以参考我们的[索引构建文档](./building-index.md)来构建。
 
@@ -88,8 +88,8 @@ from flashrag.config import Config
 
 config_dict = { 
     'data_dir': 'dataset/',
-    'index_path': 'indexes/e5_flat_sample.index',
-    'corpus_path': 'indexes/sample_data.jsonl',
+    'index_path': 'indexes/e5_Flat.index',
+    'corpus_path': 'indexes/general_knowledge.jsonl',
     'model2path': {'e5': <retriever_path>, 'llama2-7B-chat': <generator_path>},
     'generator_model': 'llama2-7B-chat',
     'retrieval_method': 'e5',
@@ -136,8 +136,8 @@ from flashrag.pipeline import SequentialPipeline
 
 config_dict = { 
                 'data_dir': 'dataset/',
-                'index_path': 'indexes/e5_flat_sample.index',
-                'corpus_path': 'indexes/sample_data.jsonl',
+                'index_path': 'indexes/e5_Flat.index',
+                'corpus_path': 'indexes/general_knowledge.jsonl',
                 'model2path': {'e5': <retriever_path>, 'llama2-7B-chat': <generator_path>},
                 'generator_model': 'llama2-7B-chat',
                 'retrieval_method': 'e5',

diff --git a/examples/quick_start/indexes/e5_Flat.index b/examples/quick_start/indexes/e5_Flat.index
diff --git a/examples/quick_start/indexes/e5_flat_sample.index b/examples/quick_start/indexes/e5_flat_sample.index