Code of the IJCAI 2024 paper "MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge"
google-ai-generativelanguage==0.4.0
google-api-core==2.15.0
google-generativeai==0.3.1
googleapis-common-protos==1.62.0
huggingface-hub==0.20.2
numpy==1.25.1
openai==0.27.8
scikit-learn==1.3.0
scipy==1.11.1
tokenizers==0.15.0
torch==2.0.1
tqdm==4.65.0
transformers==4.36.2
Please follow the instruction below to reimplement our experiments. Results can be found in the results/medqa directory.
Please fill your API key (and API base) into the corresponding code first. Note that running all experiments of GPT-3.5-turbo under the CoT+SC setting will be very expensive (~2400$).
python evaluate_gpt_medqa_ao.py
python evaluate_gpt_medqa_cotsc.py
Please fill your API key into the corresponding code first. The usage of Gemini-pro can be found here.
python evaluate_gemini_medqa_ao.py
python evaluate_gemini_medqa_cotsc.py
You may first download the corresponding LLMs in here. Some LLMs (llama2, med42, etc.) may require certification. Please make sure that you have the corresponding certification.
CUDA_VISIBLE_DEVICES=X python evaluate_hf_medqa_ao.py --model [model path] --model_name [model name]
CUDA_VISIBLE_DEVICES=X python evaluate_hf_medqa_cotsc.py --model [model path] --model_name [model name]
For Answer-only setting, run
bash eval_ao.sh
For CoT+SC, run the following command instead
bash eval_cot.sh
You may also generate the MultiMedQA dataset from the original MedQA dataset. To do so, you need to first download the MedQA dataset and place it in this repo, and then run
python recognize_and_rewrite_medqa.py
to parse the questions. After that, run
python gen_medqa_questions.py
to generate the questions. Make sure that you have downloaded the MedCAT tool kit and replace the placeholder in the script with the correct path. The documents of MedCAT can be found here. Also, you need to prepare a JSON file named "concept_attributes.json" that contains a dictionary mapping each UMLS CUI to a list of medical synonyms. Due to copyright issues, we do not provide this file; please prepare it yourself after obtaining UMLS access rights.
7B inference: a single RTX4090 (24GB)
13B inference: 2 RTX4090 (24GB) or 1 A800 (80GB)
70B inference: 2 A800 (80GB)