Skip to content

Commit 551763b

Browse files
authored
Add TorchServe Huggingface accelerate example (#304)
* Add LLM example for huggingface accelerate Signed-off-by: Dan Sun <[email protected]> * Add inputs Signed-off-by: Dan Sun <[email protected]> * Update storage uri Signed-off-by: Dan Sun <[email protected]> * Add to LLM runtime to index Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]>
1 parent 2257489 commit 551763b

File tree

7 files changed

+305
-0
lines changed

7 files changed

+305
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Serve Large Language Model with Huggingface Accelerate
2+
3+
This documentation explains how KServe supports large language model serving via `TorchServe`.
4+
The large language refers to the models that are not able to fit into a single GPU, and they need
5+
to be sharded onto multiple partitions over multiple GPUs.
6+
7+
Huggingface Accelerate can load sharded checkpoints and the maximum RAM usage will be the size of
8+
the largest shard. By setting `device_map` to true, `Accelerate` automatically determines where
9+
to put each layer of the model depending on the available resources.
10+
11+
12+
## Package the model
13+
14+
1. Download the model `bigscience/bloom-7b1` from Huggingface Hub by running
15+
```bash
16+
python Download_model.py --model_name bigscience/bloom-7b1
17+
```
18+
19+
1. Compress the model
20+
```bash
21+
zip -r model.zip model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
22+
```
23+
24+
1. Package the model
25+
Create the `setup_config.json` file with accelerate settings:
26+
* Enable `low_cpu_mem_usage` to use accelerate
27+
* Recommended `max_memory` in setup_config.json is the max size of shard.
28+
```json
29+
{
30+
"revision": "main",
31+
"max_memory": {
32+
"0": "10GB",
33+
"cpu": "10GB"
34+
},
35+
"low_cpu_mem_usage": true,
36+
"device_map": "auto",
37+
"offload_folder": "offload",
38+
"offload_state_dict": true,
39+
"torch_dtype":"float16",
40+
"max_length":"80"
41+
}
42+
```
43+
44+
```bash
45+
torch-model-archiver --model-name bloom7b1 --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json
46+
```
47+
48+
1. Upload to your cloud storage, or you can use the uploaded bloom model from KServe GCS bucket.
49+
50+
## Serve the large language model with InferenceService
51+
52+
```yaml
53+
apiVersion: serving.kserve.io/v1beta1
54+
kind: InferenceService
55+
metadata:
56+
name: "bloom7b1"
57+
spec:
58+
predictor:
59+
pytorch:
60+
runtimeVersion: 0.8.2
61+
storageUri: gs://kfserving-examples/models/torchserve/llm/Huggingface_accelerate/bloom
62+
resources:
63+
limits:
64+
cpu: "2"
65+
memory: 32Gi
66+
nvidia.com/gpu: "2"
67+
requests:
68+
cpu: "2"
69+
memory: 32Gi
70+
nvidia.com/gpu: "2"
71+
```
72+
73+
## Run the Inference
74+
75+
Now, assuming that your ingress can be accessed at
76+
`${INGRESS_HOST}:${INGRESS_PORT}` or you can follow [this instruction](../../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports)
77+
to find out your ingress IP and port.
78+
79+
```bash
80+
SERVICE_HOSTNAME=$(kubectl get inferenceservice bloom7b1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
81+
82+
curl -v \
83+
-H "Host: ${SERVICE_HOSTNAME}" \
84+
-H "Content-Type: application/json" \
85+
-d @./text.json \
86+
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/bloom7b1:predict
87+
88+
{"predictions":["My dog is cute.\nNice.\n- Hey, Mom.\n- Yeah?\nWhat color's your dog?\n- It's gray.\n- Gray?\nYeah.\nIt looks gray to me.\n- Where'd you get it?\n- Well, Dad says it's kind of...\n- Gray?\n- Gray.\nYou got a gray dog?\n- It's gray.\n- Gray.\nIs your dog gray?\nAre you sure?\nNo.\nYou sure"]}
89+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
apiVersion: serving.kserve.io/v1beta1
2+
kind: InferenceService
3+
metadata:
4+
name: "bloom-7b1"
5+
spec:
6+
predictor:
7+
pytorch:
8+
image: 0.8.2
9+
storageUri: gs://kfserving-examples/models/torchserve/llm/Huggingface_accelerate/bloom
10+
resources:
11+
limits:
12+
cpu: "2"
13+
memory: 32Gi
14+
nvidia.com/gpu: "2"
15+
requests:
16+
cpu: "2"
17+
memory: 32Gi
18+
nvidia.com/gpu: "2"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
inference_address=http://0.0.0.0:8085
2+
management_address=http://0.0.0.0:8085
3+
metrics_address=http://0.0.0.0:8082
4+
grpc_inference_port=7070
5+
grpc_management_port=7071
6+
enable_metrics_api=true
7+
metrics_format=prometheus
8+
number_of_netty_threads=4
9+
number_of_gpu=2
10+
job_queue_size=10
11+
enable_envvars_config=true
12+
model_store=/mnt/models/model-store
13+
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bloom7b1":{"1.0":{"defaultVersion":true,"marName":"bloom7b1.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
import json
2+
import logging
3+
import os
4+
import zipfile
5+
from abc import ABC
6+
7+
import torch
8+
import transformers
9+
from transformers import BloomForCausalLM, BloomTokenizerFast
10+
11+
from ts.torch_handler.base_handler import BaseHandler
12+
13+
logger = logging.getLogger(__name__)
14+
logger.info("Transformers version %s", transformers.__version__)
15+
16+
17+
TORCH_DTYPES = {
18+
"float16": torch.float16,
19+
"float32": torch.float32,
20+
"float64": torch.float64,
21+
}
22+
23+
24+
class TransformersSeqClassifierHandler(BaseHandler, ABC):
25+
"""
26+
Transformers handler class for sequence, token classification and question answering.
27+
"""
28+
29+
def __init__(self):
30+
super(TransformersSeqClassifierHandler, self).__init__()
31+
self.initialized = False
32+
33+
def initialize(self, ctx):
34+
"""In this initialize function, the BERT model is loaded and
35+
the Layer Integrated Gradients Algorithm for Captum Explanations
36+
is initialized here.
37+
Args:
38+
ctx (context): It is a JSON Object containing information
39+
pertaining to the model artifacts parameters.
40+
"""
41+
self.manifest = ctx.manifest
42+
properties = ctx.system_properties
43+
model_dir = properties.get("model_dir")
44+
45+
self.device = torch.device(
46+
"cuda:" + str(properties.get("gpu_id"))
47+
if torch.cuda.is_available() and properties.get("gpu_id") is not None
48+
else "cpu"
49+
)
50+
# Loading the model and tokenizer from checkpoint and config files based on the user's choice of mode
51+
# further setup config can be added.
52+
with zipfile.ZipFile(model_dir + "/model.zip", "r") as zip_ref:
53+
zip_ref.extractall(model_dir + "/model")
54+
55+
# read configs for the mode, model_name, etc. from setup_config.json
56+
setup_config_path = os.path.join(model_dir, "setup_config.json")
57+
if os.path.isfile(setup_config_path):
58+
with open(setup_config_path) as setup_config_file:
59+
self.setup_config = json.load(setup_config_file)
60+
else:
61+
logger.warning("Missing the setup_config.json file.")
62+
63+
self.model = BloomForCausalLM.from_pretrained(
64+
model_dir + "/model",
65+
revision=self.setup_config["revision"],
66+
max_memory={
67+
int(key) if key.isnumeric() else key: value
68+
for key, value in self.setup_config["max_memory"].items()
69+
},
70+
low_cpu_mem_usage=self.setup_config["low_cpu_mem_usage"],
71+
device_map=self.setup_config["device_map"],
72+
offload_folder=self.setup_config["offload_folder"],
73+
offload_state_dict=self.setup_config["offload_state_dict"],
74+
torch_dtype=TORCH_DTYPES[self.setup_config["torch_dtype"]],
75+
)
76+
77+
self.tokenizer = BloomTokenizerFast.from_pretrained(
78+
model_dir + "/model", return_tensors="pt"
79+
)
80+
81+
self.model.eval()
82+
logger.info("Transformer model from path %s loaded successfully", model_dir)
83+
84+
self.initialized = True
85+
86+
def preprocess(self, requests):
87+
"""Basic text preprocessing, based on the user's chocie of application mode.
88+
Args:
89+
requests (str): The Input data in the form of text is passed on to the preprocess
90+
function.
91+
Returns:
92+
list : The preprocess function returns a list of Tensor for the size of the word tokens.
93+
"""
94+
input_ids_batch = None
95+
attention_mask_batch = None
96+
for idx, data in enumerate(requests):
97+
input_text = data.get("data")
98+
if input_text is None:
99+
input_text = data.get("body")
100+
if isinstance(input_text, (bytes, bytearray)):
101+
input_text = input_text.decode("utf-8")
102+
103+
max_length = self.setup_config["max_length"]
104+
logger.info("Received text: '%s'", input_text)
105+
106+
inputs = self.tokenizer.encode_plus(
107+
input_text,
108+
max_length=int(max_length),
109+
pad_to_max_length=True,
110+
add_special_tokens=True,
111+
return_tensors="pt",
112+
)
113+
114+
input_ids = inputs["input_ids"].to(self.device)
115+
attention_mask = inputs["attention_mask"].to(self.device)
116+
# making a batch out of the recieved requests
117+
# attention masks are passed for cases where input tokens are padded.
118+
if input_ids.shape is not None:
119+
if input_ids_batch is None:
120+
input_ids_batch = input_ids
121+
attention_mask_batch = attention_mask
122+
else:
123+
input_ids_batch = torch.cat((input_ids_batch, input_ids), 0)
124+
attention_mask_batch = torch.cat(
125+
(attention_mask_batch, attention_mask), 0
126+
)
127+
return (input_ids_batch, attention_mask_batch)
128+
129+
def inference(self, input_batch):
130+
"""Predict the class (or classes) of the received text using the
131+
serialized transformers checkpoint.
132+
Args:
133+
input_batch (list): List of Text Tensors from the pre-process function is passed here
134+
Returns:
135+
list : It returns a list of the predicted value for the input text
136+
"""
137+
(input_ids_batch, _) = input_batch
138+
inferences = []
139+
input_ids_batch = input_ids_batch.to(self.device)
140+
outputs = self.model.generate(
141+
input_ids_batch,
142+
do_sample=True,
143+
max_new_tokens=int(self.setup_config["max_length"]),
144+
top_p=0.95,
145+
top_k=60,
146+
)
147+
for i, _ in enumerate(outputs):
148+
inferences.append(
149+
self.tokenizer.decode(outputs[i], skip_special_tokens=True)
150+
)
151+
152+
logger.info("Generated text: '%s'", inferences)
153+
154+
print("Generated text", inferences)
155+
return inferences
156+
157+
def postprocess(self, inference_output):
158+
"""Post Process Function converts the predicted response into Torchserve readable format.
159+
Args:
160+
inference_output (list): It contains the predicted response of the input text.
161+
Returns:
162+
(list): Returns a list of the Predictions and Explanations.
163+
"""
164+
return inference_output
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"revision": "main",
3+
"max_memory": {
4+
"0": "10GB",
5+
"cpu": "10GB"
6+
},
7+
"low_cpu_mem_usage": true,
8+
"device_map": "auto",
9+
"offload_folder": "offload",
10+
"offload_state_dict": true,
11+
"torch_dtype":"float16",
12+
"max_length":"80"
13+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"instances": [
3+
"Today the weather is really nice and I am planning on"
4+
]
5+
}

mkdocs.yml

+3
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ nav:
3939
- Torchscript: modelserving/v1beta1/triton/torchscript/README.md
4040
- Tensorflow: modelserving/v1beta1/triton/bert/README.md
4141
- AMD: modelserving/v1beta1/amd/README.md
42+
- LLM Runtime:
43+
- TorchServe LLM:
44+
- Bloom7b1: modelserving/v1beta1/llm/torchserve/accelerate/README.md
4245
- How to write a custom predictor: modelserving/v1beta1/custom/custom_model/README.md
4346
- Multi Model Serving:
4447
- Overview:

0 commit comments

Comments
 (0)