model_api

Update TensorRT-LLM (NVIDIA#1763 )

Jun 11, 2024

db4edea · Jun 11, 2024

History

This branch is 105 commits behind NVIDIA/TensorRT-LLM:main.

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Update TensorRT-LLM (NVIDIA#1358 )	Mar 26, 2024
llama.py	llama.py	Update TensorRT-LLM (NVIDIA#1763 )	Jun 11, 2024
llama_multi_gpu.py	llama_multi_gpu.py	Update TensorRT-LLM (NVIDIA#1763 )	Jun 11, 2024
llama_quantize.py	llama_quantize.py	Update TensorRT-LLM (NVIDIA#1763 )	Jun 11, 2024

README.md

The folder contains examples to demonstrate the usage of LLaMAForCausalLM class, the low level `tensorrt_llm.builder.build"

Single GPU

python ./llama.py --hf_model_dir <hf llama dir> --engine_dir ./llama.engine

The bot read one sentence at a time and generate at max 20 tokens for you. Type "q" or "quit" to stop chatting.

Multi-GPU

Using multi GPU tensor parallel to build and run llama, and then generate on pre-defined dataset. Note that multi GPU can also support the chat scenario, need to add additional code to read input from the root process, and broadcast the tokens to all worker processes. The example only targets to demonstrate the TRT-LLM API usage here, so it uses pre-defined dataset for simplicity.

python ./llama_multi_gpu.py --hf_model_dir <hf llama path> --engine_dir ./llama.engine.tp2 -c --tp_size 2

Quantization

Using AWQ INT4 weight only algorithm to quantize the given hugging llama model first and save as TRT-LLM checkpoint, and then build TRT-LLM engine from that checkpoint and serve

python ./llama_quantize.py --hf_model_dir <hf llama path> --cache_dir ./llama.awq/ -c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

model_api

model_api

README.md

Single GPU

Multi-GPU

Quantization

Files

model_api

Directory actions

More options

Directory actions

More options

Latest commit

History

model_api

Folders and files

parent directory

README.md

Single GPU

Multi-GPU

Quantization