Skip to content

Commit

Permalink
Merge branch 'main' into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
ignorejjj authored Jan 7, 2025
2 parents ae29073 + 309c7b2 commit 3e703ce
Show file tree
Hide file tree
Showing 26 changed files with 727 additions and 377 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:

environment:
name: pypi
url: https://pypi.org/p/flashrag-dev
url: https://pypi.org/p/flashrag_dev

steps:
- uses: actions/checkout@v3
Expand Down
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@
<p>
<a href="#wrench-installation">Installation</a> |
<a href="#sparkles-features">Features</a> |
<a href="#running-quick-start">Quick-Start</a> |
<a href="#rocket-quick-start">Quick-Start</a> |
<a href="#gear-components"> Components</a> |
<a href="#robot-supporting-methods"> Supporting Methods</a> |
<a href="#notebook-supporting-datasets"> Supporting Datasets</a> |
<a href="#notebook-supporting-datasets--document-corpus"> Supporting Datasets</a> |
<a href="#raised_hands-additional-faqs"> FAQs</a>
</p>

Expand Down Expand Up @@ -59,6 +59,9 @@ FlashRAG is still under development and there are many issues and room for impro


## :page_with_curl: Changelog

[25/01/07] We have integrated a very flexible and lightweight corpus chunking library [**Chunkie**](https://github.com/chonkie-ai/chonkie?tab=readme-ov-file#usage), which supports various custom chunking methods (tokens, sentences, semantic, etc.). Use it in [<u>chunking doc corpus</u>](docs/chunk-doc-corpus.md).

[24/10/21] We have released a version based on the Paddle framework that supports Chinese hardware platforms. Please refer to [FlashRAG Paddle](https://github.com/RUC-NLPIR/FlashRAG-Paddle) for details.

[24/10/13] A new in-domain dataset and corpus - [DomainRAG](https://arxiv.org/pdf/2406.05654) have been added to the dataset. The dataset is based on the internal enrollment data of Renmin University of China, covering seven types of tasks, which can be used for conducting domain-specific RAG testing.
Expand All @@ -67,16 +70,18 @@ FlashRAG is still under development and there are many issues and room for impro

[24/09/18] Due to the complexity and limitations of installing Pyserini in certain environments, we have introduced a lightweight `BM25s` package as an alternative (faster and easier to use). The retriever based on Pyserini will be deprecated in future versions. To use retriever with `bm25s`, just set `bm25_backend` to `bm25s` in config.


[24/09/09] We add support for a new method [<u>Adaptive-RAG</u>](https://aclanthology.org/2024.naacl-long.389.pdf), which can automatically select the RAG process to execute based on the type of query. See it result in [<u>result table</u>](#robot-supporting-methods).

[24/08/02] We add support for a new method [<u>Spring</u>](https://arxiv.org/abs/2405.19670), significantly improve the performance of LLM by adding only a few token embeddings. See it result in [<u>result table</u>](#robot-supporting-methods).

<details>
<summary>Show more</summary>

[24/07/17] Due to some unknown issues with HuggingFace, our original dataset link has been invalid. We have updated it. Please check the [new link](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/) if you encounter any problems.

[24/07/06] We add support for a new method: [<u>Trace</u>](https://arxiv.org/abs/2406.11460), which refine text by constructing a knowledge graph. See it [<u>results</u>](#robot-supporting-methods) and [<u>details</u>](./docs/baseline_details.md).

<details>
<summary>Show more</summary>

[24/06/19] We add support for a new method: [<u>IRCoT</u>](https://arxiv.org/abs/2212.10509), and update the [<u>result table</u>](#robot-supporting-methods).

Expand Down Expand Up @@ -537,6 +542,7 @@ The index was created using the e5-base-v2 retriever on our uploaded wiki18_100w

FlashRAG is licensed under the [<u>MIT License</u>](./LICENSE).


## :star2: Citation
Please kindly cite our paper if helps your research:
```BibTex
Expand Down
40 changes: 40 additions & 0 deletions docs/chunk-doc-corpus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Chunking Document Corpus

You can chunk your document corpus into smaller chunks by following these steps. This is useful in building an index over a large corpus of long documents for RAG, or if you want to make sure that the document length is not too long for the model.

Given a Document Corpus JSONL file with the following format and `contents` field containing the `"{title}\n{text}"` format:

```jsonl
{ "id": 0, "contents": "..." }
{ "id": 1, "contents": "..." }
{ "id": 2, "contents": "..." }
...
```

You can run the following command:

```bash
cd scripts
python chunk_doc_corpus.py --input_path input.jsonl \
--output_path output.jsonl \
--chunk_by sentence \
--chunk_size 512
```

You will get a JSONL file with the following format:

```jsonl
{ "id": 0, "doc_id": 0, "title": ..., "contents": ... }
{ "id": 1, "doc_id": 0, "title": ..., "contents": ... }
{ "id": 2, "doc_id": 0, "title": ..., "contents": ... }
...
```

**NOTE:** That `doc_id` will be the same as the original document id, and `contents` will be the chunked document content in the new JSONL output.

## Parameters

- `input_path`: Path to the input JSONL file.
- `output_path`: Path to the output JSONL file.
- `chunk_by`: Chunking method to use. Can be `token`, `word`, `sentence`, or `recursive`.
- `chunk_size`: Size of chunks.
26 changes: 6 additions & 20 deletions docs/process-wiki.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,8 @@
## Building Wiki Corpus
# Building Wiki Corpus

You can create your own wiki corpus by following these steps.

### Step1: Install necessary tools

First, install the required tools:
```bash
pip install wikiextractor==0.1
python -m spacy download en_core_web_lg
```

If you encounter issues downloading en_core_web_lg, you can manually download it:
```bash
wget https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl
pip install en_core_web_lg-3.7.1-py3-none-any.whl
```


### Step2: Download Wiki dump
## Step1: Download Wiki dump

Download the Wikipedia dump you require in XML format. For instance:

Expand All @@ -27,16 +12,17 @@ wget https://archive.org/download/enwiki-20181220/enwiki-20181220-pages-articles

You can access other dumps from this [<u>website</u>](https://archive.org/search?query=Wikimedia+database+dump&sort=-downloads).


### Step3: Run process script
## Step2: Run process script

Execute the provided script to process the wiki dump into JSONL format. Adjust the corpus partitioning parameters as needed:

```bash
cd scripts
python preprocess_wiki.py --dump_path ../enwikinews-20240420-pages-articles.xml.bz2 \
--save_path ../test_sample.jsonl \
--chunk_by 100w
--chunk_by sentence \
--chunk_size 512 \
--num_workers 1
```

We also provide the version we used for experiments. Download link: https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus
2 changes: 1 addition & 1 deletion examples/methods/my_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ retrieval_topk: 5 # number of retrieved documents
retrieval_batch_size: 256 # batch size for retrieval
retrieval_use_fp16: True # whether to use fp16 for retrieval model
retrieval_query_max_length: 128 # max length of the query
save_retrieval_cache: True # whether to save the retrieval cache
save_retrieval_cache: False # whether to save the retrieval cache
use_retrieval_cache: False # whether to use the retrieval cache
retrieval_cache_path: ~ # path to the retrieval cache
retrieval_pooling_method: ~ # set automatically if not provided
Expand Down
29 changes: 28 additions & 1 deletion examples/methods/run_exp.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,7 @@ def selfrag(args):
ignore_cont=True,
mode="adaptive_retrieval",
)
result = pipeline.run(test_data, batch_size=256)
result = pipeline.run(test_data, long_form=False)


def flare(args):
Expand Down Expand Up @@ -549,6 +549,31 @@ def adaptive(args):
pipeline = AdaptivePipeline(config)
result = pipeline.run(test_data)

def rqrag(args):
"""
Function to run the RQRAGPipeline.
"""
from flashrag.pipeline import RQRAGPipeline

save_note = "rqrag"
max_depth = 3
config_dict = {
"save_note": save_note,
"gpu_id": args.gpu_id,
'framework': 'vllm',
"dataset_name": args.dataset_name,
"split": args.split,
"max_depth": max_depth
}

config = Config("my_config.yaml", config_dict)

all_split = get_dataset(config)
test_data = all_split[args.split]

pipeline = RQRAGPipeline(config, max_depth = max_depth)
result = pipeline.run(test_data)


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Running exp")
Expand All @@ -557,6 +582,7 @@ def adaptive(args):
parser.add_argument("--dataset_name", type=str)
parser.add_argument("--gpu_id", type=str)


func_dict = {
"AAR-contriever": aar,
"AAR-ANCE": aar,
Expand All @@ -575,6 +601,7 @@ def adaptive(args):
"ircot": ircot,
"trace": trace,
"adaptive": adaptive,
"rqrag": rqrag,
}

args = parser.parse_args()
Expand Down
2 changes: 1 addition & 1 deletion flashrag/config/basic_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ model2pooling:
bge: "cls"
contriever: "mean"
jina: "mean"
dpr: cls
dpr: "pooler"

# Indexes path for retrieval models
method2index:
Expand Down
17 changes: 12 additions & 5 deletions flashrag/config/config.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import re
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import yaml
import random
import datetime
Expand Down Expand Up @@ -103,13 +104,19 @@ def _init_device(self):
gpu_id = self.final_config["gpu_id"]
if gpu_id is not None:
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
try:
# import pynvml
# pynvml.nvmlInit()
# gpu_num = pynvml.nvmlDeviceGetCount()
import torch

self.final_config["device"] = torch.device("cuda")
gpu_num = torch.cuda.device_count()
except:
gpu_num = 0
self.final_config['gpu_num'] = gpu_num
if gpu_num > 0:
self.final_config["device"] = "cuda"
else:
import torch

self.final_config["device"] = torch.device("cpu")
self.final_config['device'] = 'cpu'

def _set_additional_key(self):
def set_pooling_method(method, model2pooling):
Expand Down
2 changes: 1 addition & 1 deletion flashrag/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,10 +186,10 @@ def save(self, save_path: str) -> None:
"""Save the dataset into the original format."""

save_data = [item.to_dict() for item in self.data]

with open(save_path, "w", encoding="utf-8") as f:
json.dump(save_data, f, indent=4, ensure_ascii=False)


def __str__(self) -> str:
"""Return a string representation of the dataset with a summary of items."""
return f"Dataset '{self.dataset_name}' with {len(self)} items"
1 change: 0 additions & 1 deletion flashrag/dataset/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ def convert_numpy(data: Any) -> Any:
else:
return data


def filter_dataset(dataset: Dataset, filter_func=None):
if filter_func is None:
return dataset
Expand Down
12 changes: 8 additions & 4 deletions flashrag/evaluator/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ class F1_Score(BaseMetric):
def __init__(self, config):
super().__init__(config)

def token_level_scores(self, prediction: str, ground_truths: str):
def token_level_scores(self, prediction: str, ground_truths: list):
final_metric = {"f1": 0, "precision": 0, "recall": 0}
if isinstance(ground_truths, str):
ground_truths = [ground_truths]
Expand Down Expand Up @@ -282,14 +282,17 @@ def calculate_metric(self, data):

class Rouge_Score(BaseMetric):
metric_name = "rouge_score"

cached_scores = {}

def __init__(self, config):
super().__init__(config)
from rouge import Rouge

self.scorer = Rouge()

def calculate_rouge(self, pred, golden_answers):
if (pred, tuple(golden_answers)) in self.cached_scores:
return self.cached_scores[(pred, tuple(golden_answers))]
output = {}
for answer in golden_answers:
scores = self.scorer.get_scores(pred, answer)
Expand All @@ -300,6 +303,7 @@ def calculate_rouge(self, pred, golden_answers):
for k, v in output.items():
output[k] = max(v)

self.cached_scores[(pred, tuple(golden_answers))] = output
return output


Expand Down Expand Up @@ -394,8 +398,8 @@ def calculate_metric(self, data):
pred = [pred]
golden_answers = [golden_answers]
score = compute_bleu(
reference_corpus=golden_answers_list,
translation_corpus=pred_list,
reference_corpus=golden_answers,
translation_corpus=pred,
max_order=self.max_order,
smooth=self.smooth,
)
Expand Down
Loading

0 comments on commit 3e703ce

Please sign in to comment.