Skip to content

Latest commit

 

History

History
executable file
·
166 lines (116 loc) · 5.69 KB

README.md

File metadata and controls

executable file
·
166 lines (116 loc) · 5.69 KB

LLaMa/RWKV onnx

🚀 Please read this issue for LLaMa GPU inference

Download onnx models here:

Model Precision Size URL Demo
LLaMa-7B fp32 26GB huggingface demo_llama.py
LLaMa-7B fp16 13GB huggingface or 硬件模型库 demo_llama.py
RWKV-4-palm-430M fp16 920MB huggingface or 硬件模型库 demo_rwkv.py

News

05/18 release RWKV-4 onnx models, standalone script and LLM structure comparison

05/09 trt output wrong value until issue 2928 solved

04/19 remove GPTQ zero point guidance

04/18 export mixed-precision quant table from GPTQ-for-LLaMa

04/11 add 13GB onnx-fp16 models

04/11 add memory pool, support 2GB RAM laptop ⭐

04/10 reduce onnx model size to 26GB

04/10 support temperature add topk logits warp

04/07 add onnxruntime demo

04/05 init project

Features

  • Release LLaMa-7B and RWKV-400M onnx models and their onnxruntime standalone demo
  • No torch or transformers required
  • Support memory pool, works on 2GB laptop/PC (very slow 🐢)

Why do this ?

  1. Visualization. graphviz crashed on LLaMa model. LLM visualization tool must support nest or operator folding feature
  2. Quatization. LLM often repeat itself, just like fractal. For LLaMa quantization, loading part of decoder backbone would be enough (400MB). It could be quantized partially
  3. Embeded device. Small board IO error occurs when dd a big single file
  4. Distributed system. Inference LLM on many hybrid (FPGA/NPU/GPGPU) devices would be simple
  5. onnx tools. Device manufacturer has support onnx well, there is no reason to neglect it

Usage

Here is the graph to call LLaMa (RWKV is similar):

Try LLaMa onnxruntime demo, no torch required, and the precision has been checked.

$ python3 -m pip install -r requirements.txt
$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"
..
# If you only have 4GB memory, use `--poolsize`
$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour" --poolsize 4
..
Bonjour.

# Try more options
$ python3 demo_llama.py --help

Use demo_rwkv.py to run RWKV:

$ python3 demo_rwkv.py ${FP16_ONNX_DIR}

Export RWKV onnx

  1. git clone RWKV and download its models
  2. copy onnx_RWKV_in_150_lines.py to ChatRWKV
$ git clone https://github.com/BlinkDL/ChatRWKV --depth=1
$ cp llama.onnx/tools/onnx_RWKV_in_150_lines.py  ChatRWKV
$ cd ChatRWKV
$ mkdir models
$ python3 onnx_RWKV_in_150_lines.py

Then you would get onnx files.

$ ls -lah models
..

Export LLaMa onnx

STEP1 Convert to HF format

These models converted from alpaca huggingface.

  • If you are using LLaMa or llama.cpp, convert it to HF format first. Here are steps:

    # install transformers master
    $ git clone https://github.com/huggingface/transformers
    $ cd transformers && python3 setup.py install
    ..
    $ cd src/transformers
    $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir ${LLaMa_PATH}  --model_size 7B  --output_dir ${HF_PATH}
  • If you are using alpaca-lora, use this script to merge LoRA weights.

  • If you are using alpaca, go STEP2.

STEP2 torch.onnx.export

Checkout transformers to this hacking branch, run single inference.

$ python3 tools/export-onnx.py ${PATH_ALPACA_7B}

STEP3 convert to fp16/tvm

Use onnxconverter-common.float16

$ cd tools
$ python3 -m pip install -r requirements.txt
$ python3 convert-fp32-to-fp16.py ${FP32_PATH} ${FP16_PATH}

Or use relay.vm to convert tvm

$ cd tools
$ python3 convert-to-tvm.py ${ONNX_PATH} ${OUT_DIR}

Notes

  1. For model structure, please read LLaMa 和 RWKV 结构对比
  2. I have compared the output values of onnxruntime-cpu and torch-cuda, and the maximum error is 0.002, not bad
  3. Now demo_llama.py state is equivalent to these configurations
temperature=0.1
total_tokens=2000
top_p=1.0
top_k=40
repetition_penalty=1.0
  1. Mixed-precision kernel optimization is on the way. Here is a part of guidance.

Acknowlegements

License

GPLv3