Skip to content

Commit 68a5d31

Browse files
hkiyomaruytivy
andauthored
Add scripts for g-leaderboard (GENIAC official evaluation) (#31)
Co-authored-by: YumaTsuta <[email protected]>
1 parent 861072f commit 68a5d31

File tree

7 files changed

+356
-0
lines changed

7 files changed

+356
-0
lines changed
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# LLM Evaluation using g-leaderboard (GENIAC Official Evaluation)
2+
3+
This repository contains scripts for evaluating LLMs using [g-leaderboard](https://github.com/wandb/llm-leaderboard/tree/g-leaderboard).
4+
5+
## Usage
6+
7+
### Build
8+
9+
Clone this repository and move to the installation directory.
10+
11+
```bash
12+
git clone https://github.com/llm-jp/scripts
13+
cd scripts/evaluation/installers/g-leaderboard
14+
```
15+
16+
Then, run the installation script.
17+
The following command will create an installation directory under the specified directory (here, `~/g-leaderboard`).
18+
19+
```bash
20+
# NOTE: Using a CPU node is recommended as the installation process doesn't require GPUs
21+
22+
# For a cluster with SLURM
23+
sbatch --partition {partition} install.sh ~/g-leaderboard
24+
25+
# For a cluster without SLURM
26+
bash install.sh ~/g-leaderboard > logs/install.out 2> logs/install.err
27+
```
28+
29+
After the installation is complete, set up the wandb and huggingface accounts.
30+
31+
```shell
32+
cd ~/g-leaderboard
33+
source environment/venv/bin/activate
34+
wandb login
35+
huggingface-cli login
36+
```
37+
38+
### Contents in installed directory (~/g-leaderboard)
39+
40+
The following directory structure will be created after installation.
41+
42+
```
43+
~/g-leaderboard/
44+
run_g-leaderboard.sh Script for running g-leaderboard
45+
logs/ Log files written by SLURM jobs
46+
resources/
47+
config_base.yaml Configuration file template
48+
environment/
49+
installer_envvar.log List of environment variables recorded during installation
50+
install.sh Installation script
51+
python/ Python built from source
52+
scripts/ Scripts for environment settings
53+
src/ Downloaded libraries
54+
venv/ Python virtual environemnt (linked to python/)
55+
```
56+
57+
### Evaluation
58+
59+
The evaluation script takes the model path and wandb run name as arguments.
60+
For the other settings, edit the configuration file `resources/config_base.yaml` and/or `run_g-leaderboard.sh`.
61+
- To edit the tokenizer, wandb entity, and/or wandb project: Edit `run_g-leaderboard.sh`.
62+
- Otherwise: Edit `resources/config_base.yaml` and `run_g-leaderboard.sh`.
63+
64+
```shell
65+
cd ~/g-leaderboard
66+
67+
# For a cluster with SLURM
68+
AZURE_OPENAI_ENDPOINT=xxx AZURE_OPENAI_KEY=xxx sbatch --partition {partition} run_g-leaderboard.sh {path/to/model} {wandb.run_name}
69+
70+
# For a cluster without SLURM
71+
CUDA_VISIBLE_DEVICES=<num> AZURE_OPENAI_ENDPOINT=xxx AZURE_OPENAI_KEY=xxx bash run_g-leaderboard.sh {path/to/model} {wandb.run_name}
72+
```
73+
74+
#### Sample code
75+
76+
```shell
77+
# For a cluster with SLURM
78+
AZURE_OPENAI_ENDPOINT=xxx AZURE_OPENAI_KEY=xxx sbatch --partition {partition} run_g-leaderboard.sh llm-jp/llm-jp-13b-v2.0 g-leaderboard-$(whoami)
79+
80+
# For a cluster without SLURM
81+
AZURE_OPENAI_ENDPOINT=xxx AZURE_OPENAI_KEY=xxx bash run_g-leaderboard.sh llm-jp/llm-jp-13b-v2.0 g-leaderboard-$(whoami)
82+
```
83+
84+
### About Azure OpenAI API
85+
86+
To conduct an evaluation, you must configure the Azure OpenAI API by setting the endpoint and key for the deployment named `gpt-4`, which corresponds to `gpt-4-0613`. Please contact the administrator to obtain the necessary endpoint and key.
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
#!/bin/bash
2+
#
3+
# g-leaderboard installation script
4+
#
5+
# This script use CPU on a cluster.
6+
# - In a SLURM environment, it is recommended to use CPU nodes.
7+
#
8+
# Usage:
9+
# On a cluster with SLURM:
10+
# Run `sbatch --paratition {partition} install.sh TARGET_DIR`
11+
# On a cluster without SLURM:
12+
# Run `bash install.sh TARGET_DIR > logs/install-eval.out 2> logs/install-eval.err`
13+
# - TARGET_DIR: Instalation directory
14+
#
15+
#SBATCH --job-name=install-g-leaderboard
16+
#SBATCH --partition={FIX_ME}
17+
#SBATCH --nodes=1
18+
#SBATCH --exclusive
19+
#SBATCH --mem=0
20+
#SBATCH --output=logs/%x-%j.out
21+
#SBATCH --error=logs/%x-%j.err
22+
23+
set -eux -o pipefail
24+
25+
if [ $# -ne 1 ]; then
26+
set +x
27+
>&2 echo Usage: sbatch \(or bash\) install.sh TARGET_DIR
28+
exit 1
29+
fi
30+
31+
INSTALLER_DIR=$(pwd)
32+
TARGET_DIR=$1
33+
INSTALLER_COMMON=$INSTALLER_DIR/../../../common/installers.sh
34+
35+
>&2 echo INSTALLER_DIR=$INSTALLER_DIR
36+
>&2 echo TARGET_DIR=$TARGET_DIR
37+
>&2 echo INSTALLER_COMMON=$INSTALLER_COMMON
38+
source $INSTALLER_COMMON
39+
40+
mkdir -p $TARGET_DIR
41+
pushd $TARGET_DIR
42+
43+
# Copy basic scripts for g-leaderboard
44+
cp ${INSTALLER_DIR}/scripts/run_g-leaderboard.sh .
45+
mkdir resources
46+
cp ${INSTALLER_DIR}/resources/config_base.yaml resources/
47+
mkdir logs
48+
49+
ENV_DIR=${TARGET_DIR}/environment
50+
mkdir $ENV_DIR
51+
pushd $ENV_DIR
52+
53+
# Copy enviroment scripts
54+
cp ${INSTALLER_DIR}/install.sh .
55+
mkdir scripts
56+
57+
# Create environment.sh
58+
BASE_ENV_SHELL=${INSTALLER_DIR}/scripts/environment.sh
59+
NEW_ENV_SHELL=scripts/environment.sh
60+
cp $BASE_ENV_SHELL $NEW_ENV_SHELL
61+
62+
source $NEW_ENV_SHELL
63+
64+
# Record current environment variables
65+
set > installer_envvar.log
66+
67+
# src is used to store all resources for from-scratch builds
68+
mkdir src
69+
pushd src
70+
71+
# Install Python (function in $INSTALLER_COMMON)
72+
install_python v${PYTHON_VERSION} ${ENV_DIR}/python
73+
popd # $ENV_DIR
74+
75+
# Prepare venv
76+
python/bin/python3 -m venv venv
77+
source venv/bin/activate
78+
79+
# Install g-leaderboard
80+
pushd src
81+
git clone https://github.com/wandb/llm-leaderboard g-leaderboard -b g-leaderboard
82+
pushd g-leaderboard
83+
pip install --no-cache-dir -r requirements.txt
84+
85+
# Deploy blended run config
86+
BLENDED_RUN_CONFIG=${INSTALLER_DIR}/resources/blended_run_config.yaml
87+
cp $BLENDED_RUN_CONFIG blend_run_configs/config.yaml
88+
89+
echo "Installation done." | tee >(cat >&2)
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Ignore everything in this directory
2+
*
3+
# Except this file
4+
!.gitignore
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
run_chain: false # If you want to reuse past evaluation results in a new run, please set it to true.
2+
3+
new_run: # This setting is for blending runs without running new evaluations. If run_chain is set to true, this setting is disabled.
4+
entity: "your/WANDB/entity"
5+
project: "your/WANDB/project"
6+
run_name: "your/WANDB/run_name"
7+
8+
old_runs: # Please specify the tasks you want to carry over from past runs. Multiple runs are permissible.
9+
- run_path: "your/WANDB/run_path"
10+
tasks: # The list of tasks to take over. Please comment out tasks that do not need to be taken over.
11+
- jaster_ja_0_shot
12+
- jaster_ja_4_shot
13+
- jaster_en_0_shot
14+
- jaster_en_4_shot
15+
- mtbench_ja
16+
- mtbench_en
17+
# - run_path: "your/WANDB/run_path"
18+
# tasks:
19+
# - jaster_ja_0_shot
20+
# - jaster_ja_4_shot
21+
# - jaster_en_0_shot
22+
# - jaster_en_4_shot
23+
# - mtbench_ja
24+
# - mtbench_en
25+
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
testmode: false # If you want to test with a small amount of data, please set it to true.
2+
model_name: "<<WANDB_RUN_NAME>>" # will be used in Table
3+
4+
wandb:
5+
entity: "<<WANDB_ENTITY>>"
6+
project: "<<WANDB_PROJECT>>"
7+
run_name: "<<WANDB_RUN_NAME>>" # this run_name will be used as the name of run in leaderboard. Can be changed later
8+
9+
# Tasks to run
10+
run_llm_jp_eval_ja_0_shot: true
11+
run_llm_jp_eval_ja_few_shots: true
12+
run_llm_jp_eval_en_0_shot: true
13+
run_llm_jp_eval_en_few_shots: true
14+
run_mt_bench_ja: true
15+
run_mt_bench_en: true
16+
17+
model:
18+
api: false # if you don't use api, please set "api" as "false". If you use api, please select from "openai", "anthoropic", "google", "cohere", "mistral", "amazon_bedrock"
19+
use_wandb_artifacts: false # if you user wandb artifacts, please set true.
20+
artifacts_path: null # if you user wandb artifacts, please paste the link. if not, please leave it as "".
21+
pretrained_model_name_or_path: "<<MODEL>>" #If you use openai api, put the name of model
22+
device_map: "auto"
23+
load_in_8bit: false
24+
load_in_4bit: false
25+
26+
# for llm-jp-eval
27+
llm_jp_eval:
28+
max_seq_length: 4096
29+
target_dataset: "all" # {all, jamp, janli, jcommonsenseqa, jemhopqa, jnli, jsem, jsick, jsquad, jsts, niilc, chabsa, mmlu_en}
30+
ja_num_shots: 4 # if run_llm_jp_eval_ja_few_shots is true, please set the num of few shots. Default is 4
31+
en_num_shots: 4 # run_llm_jp_eval_en_few_shots is true, please set the num of few shots. Default is 4
32+
torch_dtype: "bf16" # {fp16, bf16, fp32}
33+
# Items that do not need to be changed unless specifically intended.
34+
dataset_artifact: "wandb-japan/llm-leaderboard/jaster:v11"
35+
dataset_dir: "/jaster/1.2.6/evaluation/test"
36+
ja:
37+
custom_prompt_template: "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:\n"
38+
custom_fewshots_template: "\n\n### 入力:\n{input}\n\n### 応答:\n{output}"
39+
en:
40+
custom_prompt_template: "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:\n"
41+
custom_fewshots_template: "\n\n### 入力:\n{input}\n\n### 応答:\n{output}"
42+
43+
# for mtbench
44+
mtbench:
45+
model_id: "<<WANDB_RUN_NAME>>" # cannot use '<', '>', ':', '"', '/', '\\', '|', '?', '*', '.'
46+
max_new_token: 1024
47+
num_gpus_per_model: 8
48+
num_gpus_total: 8
49+
max_gpu_memory: null
50+
dtype: bfloat16 # None or float32 or float16 or bfloat16
51+
use_azure: true # if you use azure openai service for evaluation, set true
52+
# for conv template # added
53+
custom_conv_template: true
54+
# the following variables will be used when custom_conv_template is set as true
55+
conv_name: "custom"
56+
conv_sep: "\n\n### "
57+
conv_stop_token_ids: "[2]"
58+
conv_stop_str: "###"
59+
conv_role_message_separator: ":\n"
60+
conv_role_only_separator: ":\n"
61+
ja:
62+
conv_system_message: "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"
63+
conv_roles: "('指示', '応答')"
64+
en:
65+
conv_system_message: "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"
66+
conv_roles: "('指示', '応答')"
67+
dataset: # Items that do not need to be changed unless specifically intended.
68+
ja:
69+
question_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_ja_question:v3"
70+
test_question_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_ja_question_small_for_test:v5"
71+
referenceanswer_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_ja_referenceanswer:v1"
72+
test_referenceanswer_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_ja_referenceanswer_small_for_test:v1"
73+
judge_prompt_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_ja_prompt:v1"
74+
bench_name: "mt_bench_ja"
75+
en:
76+
question_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_en_question:v0"
77+
test_question_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_en_question_small_for_test:v0"
78+
referenceanswer_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_en_referenceanswer:v0"
79+
test_referenceanswer_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_en_referenceanswer_small_for_test:v0"
80+
judge_prompt_artifacts_path: "wandb-japan/llm-leaderboard/mtbench_en_prompt:v0"
81+
bench_name: "mt_bench_en"
82+
83+
#==================================================================
84+
# Items that do not need to be changed unless specifically intended.
85+
#==================================================================
86+
github_version: g-eval-v1.0 #for recording
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# List of environment variables and module loads for g-leaderboard
2+
3+
export LANG=ja_JP.UTF-8
4+
export PYTHON_VERSION=3.10.14
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=g-leaderboard
3+
#SBATCH --partition=<partition>
4+
#SBATCH --exclusive
5+
#SBATCH --nodes=1
6+
#SBATCH --gpus=8
7+
#SBATCH --ntasks-per-node=8
8+
#SBATCH --output=logs/%x-%j.out
9+
#SBATCH --error=logs/%x-%j.err
10+
11+
set -eux
12+
13+
# Open file limit
14+
ulimit -n 65536 1048576
15+
16+
ENV_DIR=environment
17+
source ${ENV_DIR}/scripts/environment.sh
18+
source ${ENV_DIR}/venv/bin/activate
19+
20+
# Arguments
21+
MODEL=$1
22+
WANDB_RUN_NAME=$2
23+
24+
# Semi-fixed vars
25+
CONFIG_TEMPLATE=resources/config_base.yaml
26+
TOKENIZER=$MODEL
27+
WANDB_ENTITY=llm-jp-eval
28+
WANDB_PROJECT=test
29+
30+
# Fixed vars
31+
G_LEADERBOARD_DIR=${ENV_DIR}/src/g-leaderboard
32+
CONFIG_DIR=${G_LEADERBOARD_DIR}/configs
33+
34+
# Config settings
35+
NEW_CONFIG=${CONFIG_DIR}/config.${WANDB_PROJECT}.${WANDB_RUN_NAME}.yaml
36+
REPLACE_VARS=("MODEL" "TOKENIZER" "WANDB_ENTITY" "WANDB_PROJECT" "WANDB_RUN_NAME")
37+
38+
# Create a new config file to save the config file of each run
39+
cp $CONFIG_TEMPLATE $NEW_CONFIG
40+
41+
# Replace variables
42+
for VAR in "${REPLACE_VARS[@]}"; do
43+
VALUE=$(eval echo \${$VAR})
44+
sed -i "s|<<${VAR}>>|${VALUE}|g" $NEW_CONFIG
45+
done
46+
47+
# Create a temporal project
48+
# NOTE: This is necessary to avoid using incorrect configurations when running multiple jobs at the same time.
49+
TMP_G_LEADERBOARD_DIR=$(mktemp -d "${ENV_DIR}/src/g-leaderboard.XXXXXXXX")
50+
cp -r $G_LEADERBOARD_DIR/* $TMP_G_LEADERBOARD_DIR
51+
cp $NEW_CONFIG $TMP_G_LEADERBOARD_DIR/configs/config.yaml
52+
53+
# Run g-leaderboard
54+
SCRIPT_PATH=scripts/run_eval.py
55+
pushd $TMP_G_LEADERBOARD_DIR
56+
python $SCRIPT_PATH
57+
58+
# Clean up
59+
popd
60+
rm -rf $TMP_G_LEADERBOARD_DIR
61+
62+
echo "Done"

0 commit comments

Comments
 (0)