MultifacetEval

Code of the IJCAI 2024 paper "MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge"

Requirements

google-ai-generativelanguage==0.4.0
google-api-core==2.15.0
google-generativeai==0.3.1
googleapis-common-protos==1.62.0
huggingface-hub==0.20.2
numpy==1.25.1
openai==0.27.8
scikit-learn==1.3.0
scipy==1.11.1
tokenizers==0.15.0
torch==2.0.1
tqdm==4.65.0
transformers==4.36.2

Experiments

Please follow the instruction below to reimplement our experiments. Results can be found in the results/medqa directory.

GPT-3.5-turbo

Please fill your API key (and API base) into the corresponding code first. Note that running all experiments of GPT-3.5-turbo under the CoT+SC setting will be very expensive (~2400$).

Answer-only

python evaluate_gpt_medqa_ao.py

CoT+SC

python evaluate_gpt_medqa_cotsc.py

Gemini-pro

Please fill your API key into the corresponding code first. The usage of Gemini-pro can be found here.

Answer-only

python evaluate_gemini_medqa_ao.py

CoT+SC

python evaluate_gemini_medqa_cotsc.py

HuggingFace Models

You may first download the corresponding LLMs in here. Some LLMs (llama2, med42, etc.) may require certification. Please make sure that you have the corresponding certification.

Answer-only

CUDA_VISIBLE_DEVICES=X python evaluate_hf_medqa_ao.py --model [model path] --model_name [model name]

CoT+SC

CUDA_VISIBLE_DEVICES=X python evaluate_hf_medqa_cotsc.py --model [model path] --model_name [model name]

Results Analysis

For Answer-only setting, run

bash eval_ao.sh

For CoT+SC, run the following command instead

bash eval_cot.sh

Create the Dataset from Scripts

You may also generate the MultiMedQA dataset from the original MedQA dataset. To do so, you need to first download the MedQA dataset and place it in this repo, and then run

python recognize_and_rewrite_medqa.py

to parse the questions. After that, run

python gen_medqa_questions.py

to generate the questions. Make sure that you have downloaded the MedCAT tool kit and replace the placeholder in the script with the correct path. The documents of MedCAT can be found here. Also, you need to prepare a JSON file named "concept_attributes.json" that contains a dictionary mapping each UMLS CUI to a list of medical synonyms. Due to copyright issues, we do not provide this file; please prepare it yourself after obtaining UMLS access rights.

Resources Requirements

7B inference: a single RTX4090 (24GB)
13B inference: 2 RTX4090 (24GB) or 1 A800 (80GB)
70B inference: 2 A800 (80GB)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
medqa		medqa
results		results
.DS_Store		.DS_Store
LICENSE		LICENSE
MAQ_answer_analysis_medqa.py		MAQ_answer_analysis_medqa.py
MAQ_answer_analysis_medqa_cotsc.py		MAQ_answer_analysis_medqa_cotsc.py
MCQ_answer_analysis_medqa.py		MCQ_answer_analysis_medqa.py
MCQ_answer_analysis_medqa_cotsc.py		MCQ_answer_analysis_medqa_cotsc.py
RQ_answer_analysis_medqa.py		RQ_answer_analysis_medqa.py
RQ_answer_analysis_medqa_cotsc.py		RQ_answer_analysis_medqa_cotsc.py
TFQ_answer_analysis_medqa.py		TFQ_answer_analysis_medqa.py
TFQ_answer_analysis_medqa_cotsc.py		TFQ_answer_analysis_medqa_cotsc.py
analysis.py		analysis.py
eval_ao.sh		eval_ao.sh
eval_cot.sh		eval_cot.sh
evaluate_gemini_medqa_ao.py		evaluate_gemini_medqa_ao.py
evaluate_gemini_medqa_cotsc.py		evaluate_gemini_medqa_cotsc.py
evaluate_gpt_medqa_ao.py		evaluate_gpt_medqa_ao.py
evaluate_gpt_medqa_cotsc.py		evaluate_gpt_medqa_cotsc.py
evaluate_hf_medqa_ao.py		evaluate_hf_medqa_ao.py
evaluate_hf_medqa_cotsc.py		evaluate_hf_medqa_cotsc.py
gen_medqa_questions.py		gen_medqa_questions.py
readme.md		readme.md
recognize_and_rewrite_medqa.py		recognize_and_rewrite_medqa.py
requirements.txt		requirements.txt
results_ao.xlsx		results_ao.xlsx
results_cot.xlsx		results_cot.xlsx
rewrite.jsonl		rewrite.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MultifacetEval

Requirements

Experiments

GPT-3.5-turbo

Answer-only

CoT+SC

Gemini-pro

Answer-only

CoT+SC

HuggingFace Models

Answer-only

CoT+SC

Results Analysis

Create the Dataset from Scripts

Resources Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

THUMLP/MultifacetEval

Folders and files

Latest commit

History

Repository files navigation

MultifacetEval

Requirements

Experiments

GPT-3.5-turbo

Answer-only

CoT+SC

Gemini-pro

Answer-only

CoT+SC

HuggingFace Models

Answer-only

CoT+SC

Results Analysis

Create the Dataset from Scripts

Resources Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages