Merge branch 'main' into dev

RUC-NLPIR · Jan 7, 2025 · 3e703ce · 3e703ce
2 parents ae29073 + 309c7b2
commit 3e703ce
Show file tree

Hide file tree

Showing 26 changed files with 727 additions and 377 deletions.
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -22,7 +22,7 @@ jobs:
 
     environment:
       name: pypi
-      url: https://pypi.org/p/flashrag-dev
+      url: https://pypi.org/p/flashrag_dev
 
     steps:
     - uses: actions/checkout@v3

diff --git a/README.md b/README.md
@@ -14,10 +14,10 @@
 <p>
 <a href="#wrench-installation">Installation</a> |
 <a href="#sparkles-features">Features</a> |
-<a href="#running-quick-start">Quick-Start</a> |
+<a href="#rocket-quick-start">Quick-Start</a> |
 <a href="#gear-components"> Components</a> |
 <a href="#robot-supporting-methods"> Supporting Methods</a> |
-<a href="#notebook-supporting-datasets"> Supporting Datasets</a> |
+<a href="#notebook-supporting-datasets--document-corpus"> Supporting Datasets</a> |
 <a href="#raised_hands-additional-faqs"> FAQs</a>
 </p>
 
@@ -59,6 +59,9 @@ FlashRAG is still under development and there are many issues and room for impro
 
 
 ## :page_with_curl: Changelog
+
+[25/01/07] We have integrated a very flexible and lightweight corpus chunking library [**Chunkie**](https://github.com/chonkie-ai/chonkie?tab=readme-ov-file#usage), which supports various custom chunking methods (tokens, sentences, semantic, etc.). Use it in [<u>chunking doc corpus</u>](docs/chunk-doc-corpus.md). 
+
 [24/10/21] We have released a version based on the Paddle framework that supports Chinese hardware platforms. Please refer to [FlashRAG Paddle](https://github.com/RUC-NLPIR/FlashRAG-Paddle) for details.
 
 [24/10/13] A new in-domain dataset and corpus - [DomainRAG](https://arxiv.org/pdf/2406.05654) have been added to the dataset. The dataset is based on the internal enrollment data of Renmin University of China, covering seven types of tasks, which can be used for conducting domain-specific RAG testing.
@@ -67,16 +70,18 @@ FlashRAG is still under development and there are many issues and room for impro
 
 [24/09/18] Due to the complexity and limitations of installing Pyserini in certain environments, we have introduced a lightweight `BM25s` package as an alternative (faster and easier to use). The retriever based on Pyserini will be deprecated in future versions. To use retriever with `bm25s`, just set `bm25_backend` to `bm25s` in config.
 
+
 [24/09/09] We add support for a new method [<u>Adaptive-RAG</u>](https://aclanthology.org/2024.naacl-long.389.pdf), which can automatically select the RAG process to execute based on the type of query. See it result in [<u>result table</u>](#robot-supporting-methods).
 
 [24/08/02] We add support for a new method [<u>Spring</u>](https://arxiv.org/abs/2405.19670), significantly improve the performance of LLM by adding only a few token embeddings. See it result in [<u>result table</u>](#robot-supporting-methods).
 
+<details>
+<summary>Show more</summary>
+
 [24/07/17] Due to some unknown issues with HuggingFace, our original dataset link has been invalid. We have updated it. Please check the [new link](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/) if you encounter any problems.
 
 [24/07/06] We add support for a new method: [<u>Trace</u>](https://arxiv.org/abs/2406.11460), which refine text by constructing a knowledge graph. See it [<u>results</u>](#robot-supporting-methods) and [<u>details</u>](./docs/baseline_details.md).
 
-<details>
-<summary>Show more</summary>
 
 [24/06/19] We add support for a new method: [<u>IRCoT</u>](https://arxiv.org/abs/2212.10509), and update the [<u>result table</u>](#robot-supporting-methods).
 
@@ -537,6 +542,7 @@ The index was created using the e5-base-v2 retriever on our uploaded wiki18_100w
 
 FlashRAG is licensed under the [<u>MIT License</u>](./LICENSE).
 
+
 ## :star2: Citation
 Please kindly cite our paper if helps your research:
 ```BibTex

diff --git a/docs/chunk-doc-corpus.md b/docs/chunk-doc-corpus.md
@@ -0,0 +1,40 @@
+# Chunking Document Corpus
+
+You can chunk your document corpus into smaller chunks by following these steps. This is useful in building an index over a large corpus of long documents for RAG, or if you want to make sure that the document length is not too long for the model.
+
+Given a Document Corpus JSONL file with the following format and `contents` field containing the `"{title}\n{text}"` format:
+
+```jsonl
+{ "id": 0, "contents": "..." }
+{ "id": 1, "contents": "..." }
+{ "id": 2, "contents": "..." }
+...
+```
+
+You can run the following command:
+
+```bash
+cd scripts
+python chunk_doc_corpus.py --input_path input.jsonl \
+                          --output_path output.jsonl \
+                          --chunk_by sentence \
+                          --chunk_size 512
+```
+
+You will get a JSONL file with the following format:
+
+```jsonl
+{ "id": 0, "doc_id": 0, "title": ..., "contents": ... }
+{ "id": 1, "doc_id": 0, "title": ..., "contents": ... }
+{ "id": 2, "doc_id": 0, "title": ..., "contents": ... }
+...
+```
+
+**NOTE:** That `doc_id` will be the same as the original document id, and `contents` will be the chunked document content in the new JSONL output.
+
+## Parameters
+
+- `input_path`: Path to the input JSONL file.
+- `output_path`: Path to the output JSONL file.
+- `chunk_by`: Chunking method to use. Can be `token`, `word`, `sentence`, or `recursive`.
+- `chunk_size`: Size of chunks.
diff --git a/docs/process-wiki.md b/docs/process-wiki.md
@@ -1,23 +1,8 @@
-## Building Wiki Corpus
+# Building Wiki Corpus
 
 You can create your own wiki corpus by following these steps.
 
-### Step1: Install necessary tools
-
-First, install the required tools:
-```bash
-pip install wikiextractor==0.1
-python -m spacy download en_core_web_lg
-```
-
-If you encounter issues downloading en_core_web_lg, you can manually download it:
-```bash
-wget https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl
-pip install en_core_web_lg-3.7.1-py3-none-any.whl
-```
-
-
-### Step2: Download Wiki dump
+## Step1: Download Wiki dump
 
 Download the Wikipedia dump you require in XML format. For instance: 
 
@@ -27,16 +12,17 @@ wget https://archive.org/download/enwiki-20181220/enwiki-20181220-pages-articles
 
 You can access other dumps from this [<u>website</u>](https://archive.org/search?query=Wikimedia+database+dump&sort=-downloads).
 
-
-### Step3: Run process script
+## Step2: Run process script
 
 Execute the provided script to process the wiki dump into JSONL format. Adjust the corpus partitioning parameters as needed:
 
 ```bash
 cd scripts
 python preprocess_wiki.py --dump_path ../enwikinews-20240420-pages-articles.xml.bz2  \
                         --save_path ../test_sample.jsonl \
-                        --chunk_by 100w
+                        --chunk_by sentence \
+                        --chunk_size 512 \
+                        --num_workers 1
 ```
 
 We also provide the version we used for experiments. Download link: https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus
diff --git a/examples/methods/my_config.yaml b/examples/methods/my_config.yaml
@@ -56,7 +56,7 @@ retrieval_topk: 5 # number of retrieved documents
 retrieval_batch_size: 256  # batch size for retrieval
 retrieval_use_fp16: True  # whether to use fp16 for retrieval model
 retrieval_query_max_length: 128  # max length of the query
-save_retrieval_cache: True # whether to save the retrieval cache
+save_retrieval_cache: False # whether to save the retrieval cache
 use_retrieval_cache: False # whether to use the retrieval cache
 retrieval_cache_path: ~ # path to the retrieval cache
 retrieval_pooling_method: ~ # set automatically if not provided

diff --git a/examples/methods/run_exp.py b/examples/methods/run_exp.py
@@ -378,7 +378,7 @@ def selfrag(args):
         ignore_cont=True,
         mode="adaptive_retrieval",
     )
-    result = pipeline.run(test_data, batch_size=256)
+    result = pipeline.run(test_data, long_form=False)
 
 
 def flare(args):
@@ -549,6 +549,31 @@ def adaptive(args):
     pipeline = AdaptivePipeline(config)
     result = pipeline.run(test_data)
 
+def rqrag(args):
+    """
+    Function to run the RQRAGPipeline.
+    """
+    from flashrag.pipeline import RQRAGPipeline
+
+    save_note = "rqrag"
+    max_depth = 3
+    config_dict = {
+        "save_note": save_note,
+        "gpu_id": args.gpu_id,
+        'framework': 'vllm',
+        "dataset_name": args.dataset_name,
+        "split": args.split,
+        "max_depth": max_depth
+    }
+
+    config = Config("my_config.yaml", config_dict)
+
+    all_split = get_dataset(config)
+    test_data = all_split[args.split]
+
+    pipeline = RQRAGPipeline(config, max_depth = max_depth)
+    result = pipeline.run(test_data)
+
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Running exp")
@@ -557,6 +582,7 @@ def adaptive(args):
     parser.add_argument("--dataset_name", type=str)
     parser.add_argument("--gpu_id", type=str)
 
+
     func_dict = {
         "AAR-contriever": aar,
         "AAR-ANCE": aar,
@@ -575,6 +601,7 @@ def adaptive(args):
         "ircot": ircot,
         "trace": trace,
         "adaptive": adaptive,
+        "rqrag": rqrag,
     }
 
     args = parser.parse_args()

diff --git a/flashrag/config/basic_config.yaml b/flashrag/config/basic_config.yaml
@@ -15,7 +15,7 @@ model2pooling:
   bge: "cls"
   contriever: "mean"
   jina: "mean"
-  dpr: cls
+  dpr: "pooler"
 
 # Indexes path for retrieval models
 method2index:

diff --git a/flashrag/config/config.py b/flashrag/config/config.py
@@ -1,5 +1,6 @@
 import re
 import os
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
 import yaml
 import random
 import datetime
@@ -103,13 +104,19 @@ def _init_device(self):
         gpu_id = self.final_config["gpu_id"]
         if gpu_id is not None:
             os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+        try:
+            # import pynvml 
+            # pynvml.nvmlInit()
+            # gpu_num = pynvml.nvmlDeviceGetCount()
             import torch
-
-            self.final_config["device"] = torch.device("cuda")
+            gpu_num = torch.cuda.device_count()
+        except:
+            gpu_num = 0
+        self.final_config['gpu_num'] = gpu_num
+        if gpu_num > 0:
+            self.final_config["device"] = "cuda"
         else:
-            import torch
-
-            self.final_config["device"] = torch.device("cpu")
+            self.final_config['device'] = 'cpu'
 
     def _set_additional_key(self):
         def set_pooling_method(method, model2pooling):

diff --git a/flashrag/dataset/dataset.py b/flashrag/dataset/dataset.py
@@ -186,10 +186,10 @@ def save(self, save_path: str) -> None:
         """Save the dataset into the original format."""
 
         save_data = [item.to_dict() for item in self.data]
-
         with open(save_path, "w", encoding="utf-8") as f:
             json.dump(save_data, f, indent=4, ensure_ascii=False)
 
+
     def __str__(self) -> str:
         """Return a string representation of the dataset with a summary of items."""
         return f"Dataset '{self.dataset_name}' with {len(self)} items"
diff --git a/flashrag/dataset/utils.py b/flashrag/dataset/utils.py
@@ -21,7 +21,6 @@ def convert_numpy(data: Any) -> Any:
     else:
         return data
 
-
 def filter_dataset(dataset: Dataset, filter_func=None):
     if filter_func is None:
         return dataset

diff --git a/flashrag/evaluator/metrics.py b/flashrag/evaluator/metrics.py
@@ -53,7 +53,7 @@ class F1_Score(BaseMetric):
     def __init__(self, config):
         super().__init__(config)
 
-    def token_level_scores(self, prediction: str, ground_truths: str):
+    def token_level_scores(self, prediction: str, ground_truths: list):
         final_metric = {"f1": 0, "precision": 0, "recall": 0}
         if isinstance(ground_truths, str):
             ground_truths = [ground_truths]
@@ -282,14 +282,17 @@ def calculate_metric(self, data):
 
 class Rouge_Score(BaseMetric):
     metric_name = "rouge_score"
-
+    cached_scores = {}
+
     def __init__(self, config):
         super().__init__(config)
         from rouge import Rouge
 
         self.scorer = Rouge()
 
     def calculate_rouge(self, pred, golden_answers):
+        if (pred, tuple(golden_answers)) in self.cached_scores:
+            return self.cached_scores[(pred, tuple(golden_answers))]
         output = {}
         for answer in golden_answers:
             scores = self.scorer.get_scores(pred, answer)
@@ -300,6 +303,7 @@ def calculate_rouge(self, pred, golden_answers):
         for k, v in output.items():
             output[k] = max(v)
 
+        self.cached_scores[(pred, tuple(golden_answers))] = output
         return output
 
 
@@ -394,8 +398,8 @@ def calculate_metric(self, data):
             pred = [pred]
             golden_answers = [golden_answers]
             score = compute_bleu(
-                reference_corpus=golden_answers_list,
-                translation_corpus=pred_list,
+                reference_corpus=golden_answers,
+                translation_corpus=pred,
                 max_order=self.max_order,
                 smooth=self.smooth,
             )