fmwork

⚠️ Code is currently being restructured. Release v0.1.0 has a basic version of it to support our ongoing (internal) sweeps. Version 1.0.0 (soon) should encompass this new structure, that should better support different hardware and software options.

Quick start

Install conda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Create environment and install deps:

conda create -n vllm python=3.10 -y
conda activate  vllm
pip install vllm

Get a model (e.g., https://huggingface.co/ibm-granite/granite-8b-code-base-128k):

pip install huggingface-hub
huggingface-cli download --cache-dir ./ --local-dir-use-symlinks False --revision main --local-dir models/granite-8b ibm-granite/granite-8b-code-base-128k

Clone repo and run experiment:

git clone [email protected]:IBM/fmwork.git
./fmwork/infer/vllm/driver --model_path models/granite-8b --input_size 1024 --output_size 1024 --batch_size 1,2,4 --tensor_parallel 1

This should produce blocks of outputs like:

--------------------------------------------------------------------------------
RUN 1024 / 1024 / 1 / 1
--------------------------------------------------------------------------------

FMWORK REP   1 /   3 : 1727375968.424120936 1727375976.598311213 8.174 8.0 125.3
FMWORK REP   2 /   3 : 1727375976.598364287 1727375984.859228127 8.261 8.1 124.0
FMWORK REP   3 /   3 : 1727375984.859270605 1727375993.005784506 8.147 8.0 125.7

FMWORK RES 20240926-183953.009140 1024 1024 1 1 8.204 8.0 124.8

Input size                = 1024
Output size               = 1024
Batch size                = 1
Tensor parallelism        = 1
Median iteration time (s) = 8.204
Inter-token latency (ms)  = 8.0
Throughput (tok/s)        = 124.8

FMWORK REP lines contain stats per experiment repetition (3 repetitions by default):
- Number of repetition
- Total repetitions to run
- Timestamp of rep start
- Timestamp of rep end
- Duration of rep (seconds)
- Inter-token latency for rep (milliseconds per token)
- Throughput for rep (tokens per second)
FMWORK RES line contains a summary of the experiment:
- Experiment timestamp
- Input size
- Output size
- Batch size
- Tensor parallelism size
- Median iteration duration (seconds)
- Inter-token latency (milliseconds per token)
- Throughput (tokens per second)

If saved to a file, all RES lines can be easily grep-ed for further analysis.

grep -R "FMWORK RES" outputs/ | tr / ' ' | column -t

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
infer/vllm		infer/vllm
staging		staging
utils		utils
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
fmwork.py		fmwork.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fmwork

Quick start

About

Releases 1

Packages

Contributors 10

Languages

License

IBM/fmwork

Folders and files

Latest commit

History

Repository files navigation

fmwork

Quick start

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 10

Languages

Packages