This repository provides the implementation of Tree-of-Thought Retrieval (ToTR) and Self-Consistency Retrieval (SCR), two novel frameworks designed to enhance retrieval-augmented generation for knowledge-intensive multi-hop question answering tasks. By exploring diverse reasoning paths, these methods aim to improve retrieval coverage and robustness, addressing limitations in existing approaches.
- Instructions for Reproducing the Paper Results
- Installation
- Preparing Database for RAG
- Project Structure
- Utility scripts
- Running LLM server on Slurm cluster
- Running benchmarks
This section describes how to reproduce the results presented in the paper in detail.
Important: Reproducing the results from scratch involves downloading the corpora used by the datasets, setting up a vector database (Elasticsearch), ingesting the corpora into the vector database, and running inference on relatively large LMs, which may take up to a few days in total. Moreover, you need to download and run the LLMs and, thus, need to make sure you have sufficient disk storage and GPU memory to store the models. If these constraints prevent you from running the code, we encourage you to download raw predictions and results and inspect them instead. You can use the following command to download the results and skip to step 7. You may also need to install necessary dependencies first (see Installation).
bash scripts/download_results.shTo reproduce the paper results,
- Install the dependencies following the Installation section. You also need to install the vllm dependency group, as our config files assume that you use vLLM's OpenAI-compatible server.
- Set up a vector database (Elasticsearch) following the Preparing Database for RAG section. We recommend you to build Elasticsearch indices only for the datasets to be used, namely, hotpotqa, multihoprag, and musique.
- Run Elasticsearch in the background. (This should already be done when preparing the database.)
- Run vLLM's OpenAI-compatible server in the server using the following command:
bash scripts/serve_vllm.sh
- Run the benchmark code following the Running benchmarks section. Note that in order to use a different LM, you have to modify the scripts/serve_vllm.sh file to use the desired LM and change the model name in benchmark/bench.py. The prediction and performance results will be saved to the results directory.
- You may also run the code for the ablation study using the following command
Note that you also need to modify the model name in the benchmark/ablation.py manually for different models to be evaluated.
python benchmark/ablation.py --verbose
- Run the Python notebook, notebooks/results.ipynb, to generate plots for the paper results. Note that since ToTR, SCR, and ReAct employ temperature sampling and asynchronous execution, it is very complicated to obtain deterministic results. Therefore, your reproduced results may be slightly different from the ones in the paper.
-
Clone this repository.
git clone https://github.com/kaiitunnz/totr.git cd totr -
Create and activate a Conda environment
conda create -n totr python=3.11 -y conda activate totr
-
Install Poetry for dependency management.
pip install poetry
-
Run the following command to install all the dependencies.
poetry install --with lint,vllm
Note that
lintis optional but strongly recommended for code linting and formatting. On the other hand, you can removevllmif you do not plan to use vLLM as the LLM server. -
Install SpaCy.
python -m spacy download en_core_web_sm
-
Set up a RAG database following the Preparing Database for RAG section.
-
Install Elasticsearch 7.10 (source: IRCoT). See the following options:
MacOS (Homebrew)
# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/brew.html brew tap elastic/tap brew install elastic/tap/elasticsearch-full # if it doesn't work: try 'brew untap elastic/tap' first: untap>tap>install.
To run the server,
brew services start elastic/tap/elasticsearch-full # to start the server brew services stop elastic/tap/elasticsearch-full # to stop the server
MacOS (wget)
# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512 shasum -a 512 -c elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512 tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz rm elasticsearch-7.10.2-linux-x86_64.tar.gz elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512To run the server,
cd elasticsearch-7.10.2/ ./bin/elasticsearch # start the server pkill -f elasticsearch # to stop the server
Linux (wget)
# source: https://www.elastic.co/guide/en/elasticsearch/reference/8.1/targz.html wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512 shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512 tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz rm elasticsearch-7.10.2-linux-x86_64.tar.gz elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512To run the server,
cd elasticsearch-7.10.2/ ./bin/elasticsearch # start the server pkill -f elasticsearch # to stop the server
-
Download the datasets using the following command. The downloaded files will be stored in the raw_data directory.
bash scripts/download_raw_data.sh
-
Build indices for the downloaded datasets in Elasticsearch. First, ensure that Elasticsearch is running in the background. Then, run the following command:
python -m totr.retriever.build_es_index --all
You can also choose to build indices for specific datasets. For example, using the following command:
python -m totr.retriever.build_es_index --datasets hotpotqa multihoprag musique
benchmark: Benchmarking tasks, utility code, and baselines' implementation.configs: Configuration files for different models, datasets, and systems.datasets: Dataset-related files.notebooks: Useful notebooks, for example, results.ipynb, which generates plots for the paper results.prompts: Collection of prompts used in the experiments.results: Benchmark results.scripts: Utility scripts for developing and running experiments.src: ToTR's implementation.tests: Various test cases. Currently contains only basic tests without unit testing. It can be a reference for how different functions and classes are used.
-
scripts/format.sh: Script for code linting, formatting, type-checking, and spell-checking.Usage:
bash scripts/format.sh --all
-
scripts/serve_tgi.sh: Script for starting the HuggingFace TGI server.Usage:
bash scripts/serve_tgi.sh
-
scripts/serve_vllm.sh: Script for serving the vLLM server.Usage:
bash scripts/serve_vllm.sh
-
scripts/sbatch_vllm.sh: Script for serving the vLLM server on Slurm cluster. See this section for example usage.
-
Log in to your Slurm login node.
-
Clone this repository and set up the environment with the following commands:
# Clone this repository git clone https://github.com/kaiitunnz/totr.git cd totr # Create and activate a Conda environment conda create -n totr python=3.11 -y conda activate totr # Install the dependencies for running an LLM server pip install poetry poetry install --only vllm
-
Submit a batch job using the following command. You may need to set appropriate arguments for sbatch in
scripts/sbatch_vllm.sh.sbatch scripts/sbatch_vllm.sh
-
Check your allocated node with the following command:
squeue -u $USER -
Log out from the Slurm login node and start ssh tunneling with the following command:
ssh -L 8010:<gpu-node>:8010 <user>@<login-node-address>
or in the background with the following command (in this case, you need to kill the process by yourself):
ssh -fN -L 8010:<gpu-node>:8010 <user>@<login-node-address>
-
Now you can run benchmarking scripts or connect to the LLM server from your local host at the following address:
http://localhost:8010. -
To stop the LLM server before it is timed out, run the following command with the appropriate
job-idobtained from thesqueuecommand.scancel <job-id>
Run the following command:
python benchmark/bench.py --verbose --testThe benchmark results will be saved to the results directory. You may omit the --testflag if you want to perform validation instead.