Skip to content

Commit bb606f3

Browse files
Merge pull request #55 from WallarooLabs/wallaroo_llm_managed_endpoints
2 parents d88a9d4 + f9ef48f commit bb606f3

File tree

14 files changed

+1211
-0
lines changed

14 files changed

+1211
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,4 @@ resnet50_v1.onnx
3737
wallaroo-llms/llamav2/models/llama_byop_llamav2.zip
3838
wallaroo-llms/vector-database-embedding-with-ml-orchestrations/models/byop_bge_base2.zip
3939
wallaroo-llms/vector-database-embedding-with-ml-orchestrations/vector_db_orch.zip
40+
clip-vit-base-patch-32.zip
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "22ad614e-2e4c-4635-a167-2b97c3f041af",
6+
"metadata": {},
7+
"source": [
8+
"This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/blob/main/wallaroo-llms/).\n",
9+
"\n",
10+
"## Wallaroo Deployment of Managed Inference Endpoint Models with Google Vertex\n",
11+
"\n",
12+
"The following tutorial demonstrates uploading, deploying, inferencing and monitoring a [LLM with Managed Inference Endpoints](https://staging.docs.wallaroo.ai/wallaroo-llm/wallaroo-llm-package-deployment/wallaroo-llm-monitoring-external-endpoints/).\n",
13+
"\n",
14+
"These models leverage LLMs deployed in other services, with Wallaroo providing a single source for inference requests, logging results, monitoring for hate/abuse/racism and other factors, and tracking model drift through Wallaroo assays.\n",
15+
"\n",
16+
"## Provided Models\n",
17+
"\n",
18+
"The following models are provided:\n",
19+
"\n",
20+
"* `byop_llama2_vertex_v2_9.zip`: A [Wallaroo BYOP](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-arbitrary-python/) model that uses Google Vertex as a Managed Inference Endpoint.\n",
21+
"\n",
22+
"## Prerequisites\n",
23+
"\n",
24+
"This tutorial requires:\n",
25+
"\n",
26+
"* Wallaroo 2024.1 and above\n",
27+
"* Credentials for authenticating to Google Vertex"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"id": "314c3dd4",
33+
"metadata": {},
34+
"source": [
35+
"## Tutorial Steps\n",
36+
"\n",
37+
"### Import Library\n",
38+
"\n",
39+
"The following libraries are used to upload and perform inferences on the LLM with Managed Inference Endpoints."
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"id": "4e18447c-59ca-41fc-ae5f-0435849d30fd",
46+
"metadata": {},
47+
"outputs": [],
48+
"source": [
49+
"import json\n",
50+
"import os\n",
51+
"\n",
52+
"import wallaroo\n",
53+
"from wallaroo.pipeline import Pipeline\n",
54+
"from wallaroo.deployment_config import DeploymentConfigBuilder\n",
55+
"from wallaroo.framework import Framework\n",
56+
"from wallaroo.engine_config import Architecture\n",
57+
"\n",
58+
"import pyarrow as pa\n",
59+
"import numpy as np\n",
60+
"import pandas as pd"
61+
]
62+
},
63+
{
64+
"cell_type": "markdown",
65+
"id": "9654ee15",
66+
"metadata": {},
67+
"source": [
68+
"### Connect to the Wallaroo Instance\n",
69+
"\n",
70+
"A connection to Wallaroo is opened through the Wallaroo SDK client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.\n",
71+
"\n",
72+
"This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.\n",
73+
"\n",
74+
"If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`. For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).\n",
75+
"\n",
76+
"The `request_timeout` flag is used for Wallaroo BYOP models where the file size may require additional time to complete the upload process."
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"id": "00ea4e3f-c993-4a19-9d11-cb4c1ce300ab",
83+
"metadata": {},
84+
"outputs": [],
85+
"source": [
86+
"wl = wallaroo.Client()"
87+
]
88+
},
89+
{
90+
"cell_type": "markdown",
91+
"id": "25729b33-c6de-4fdc-9467-222c2bac820c",
92+
"metadata": {},
93+
"source": [
94+
"### LLM with Managed Inference Endpoint Model Code\n",
95+
"\n",
96+
"The Wallaroo BYOP model `byop_llamav2_vertex_v2_9.zip` contains the following artifacts:\n",
97+
"\n",
98+
"* `main.py`: Python script that controls the behavior of the model.\n",
99+
"* `requirements.txt`: Python requirements file that sets the Python libraries used.\n",
100+
"\n",
101+
"The model performs the following.\n",
102+
"\n",
103+
"1. Accepts the inference request from the requester.\n",
104+
"2. Load the credentials to the Google Vertex session from the provided environmental variables. These are supplied during the [Set Deployment Configuration](#set-deployment-configuration) step. The following code shows this process.\n",
105+
"\n",
106+
" ```python\n",
107+
" credentials = Credentials.from_service_account_info(\n",
108+
" json.loads(os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"].replace(\"'\", '\"')),\n",
109+
" scopes=[\"https://www.googleapis.com/auth/cloud-platform\"],\n",
110+
" )\n",
111+
" ```\n",
112+
"\n",
113+
"3. Take the inference request, connect to Google and submit the request to the deployed LLM. The inference result is returned to the BYOP model, which is then returned.\n",
114+
"\n",
115+
" ```python\n",
116+
" def _predict(self, input_data: InferenceData):\n",
117+
" credentials.refresh(Request())\n",
118+
" token = credentials.token\n",
119+
"\n",
120+
" headers = {\n",
121+
" \"Authorization\": f\"Bearer {token}\",\n",
122+
" \"Content-Type\": \"application/json\",\n",
123+
" }\n",
124+
" prompts = input_data[\"text\"].tolist()\n",
125+
" instances = [{\"prompt\": prompt, \"max_tokens\": 200} for prompt in prompts]\n",
126+
"\n",
127+
" response = requests.post(\n",
128+
" f\"{self.model}\",\n",
129+
" json={\"instances\": instances},\n",
130+
" headers=headers,\n",
131+
" )\n",
132+
"\n",
133+
" predictions = response.json()\n",
134+
"\n",
135+
" if isinstance(predictions[\"predictions\"], str):\n",
136+
" generated_text = [\n",
137+
" prediction.split(\"Output:\\n\")[-1]\n",
138+
" for prediction in predictions[\"predictions\"]\n",
139+
" ]\n",
140+
" else:\n",
141+
" generated_text = [\n",
142+
" prediction[\"predictions\"][0].split(\"Output:\\n\")[-1]\n",
143+
" for prediction in predictions[\"predictions\"]\n",
144+
" ]\n",
145+
"\n",
146+
" return {\"generated_text\": np.array(generated_text)}\n",
147+
" ```\n",
148+
"\n",
149+
"This model is contained in a Wallaroo pipeline which accepts the inference request, then returns the final result back to the requester."
150+
]
151+
},
152+
{
153+
"cell_type": "markdown",
154+
"id": "0f2597e1",
155+
"metadata": {},
156+
"source": [
157+
"### Upload LLM with Managed Inference Endpoint Model\n",
158+
"\n",
159+
"Uploading models uses the Wallaroo Client `upload_model` method, which takes the following parameters:\n",
160+
"\n",
161+
"| Parameter | Type | Description |\n",
162+
"|---|---|---|\n",
163+
"| `name` | `string` (*Required*) | The name of the model. Model names are unique **per workspace**. Models that are uploaded with the same name are assigned as a new **version** of the model. |\n",
164+
"| `path` | `string` (*Required*) | The path to the model file being uploaded. |\n",
165+
"| `framework` |`string` (*Required*) | The framework of the model from `wallaroo.framework`. |\n",
166+
"| `input_schema` | `pyarrow.lib.Schema` (*Required*) | The input schema in Apache Arrow schema format. |\n",
167+
"| `output_schema` | `pyarrow.lib.Schema` (*Required*) | The output schema in Apache Arrow schema format. |\n",
168+
"\n",
169+
"The following shows the upload parameters for the `byop_llama2_vertex_v2_9.zip` Wallaroo BYOP model with the following input and output schema:\n",
170+
"\n",
171+
"* Input:\n",
172+
" * `text` (*String*): The input text.\n",
173+
"* Output:\n",
174+
" * `generated_text` (*String*): The result returned from the GPT 3.5 model as a Managed Inference Endpoint.\n",
175+
"\n",
176+
"The uploaded model reference is saved to the variable `model`."
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": null,
182+
"id": "d983d3f0-c19d-44e1-8c05-62200f3e0854",
183+
"metadata": {},
184+
"outputs": [],
185+
"source": [
186+
"input_schema = pa.schema([\n",
187+
" pa.field(\"text\", pa.string()),\n",
188+
"])\n",
189+
"\n",
190+
"output_schema = pa.schema([\n",
191+
" pa.field(\"generated_text\", pa.string())\n",
192+
"])"
193+
]
194+
},
195+
{
196+
"cell_type": "code",
197+
"execution_count": null,
198+
"id": "2cba287f-ad6f-4c39-a10e-e4db47404c89",
199+
"metadata": {},
200+
"outputs": [],
201+
"source": [
202+
"model = wl.upload_model('byop-llama-vertex-v1', \n",
203+
" './models/byop_llama2_vertex_v2_9.zip',\n",
204+
" framework=Framework.CUSTOM,\n",
205+
" input_schema=input_schema,\n",
206+
" output_schema=output_schema,\n",
207+
")\n",
208+
"model"
209+
]
210+
},
211+
{
212+
"cell_type": "markdown",
213+
"id": "91e052ce-5407-4995-aaf6-e1411a838cf8",
214+
"metadata": {},
215+
"source": [
216+
"### Set Deployment Configuration\n",
217+
"\n",
218+
"The deployment configuration sets the resources assigned to the LLM with Managed Inference Endpoint. For this example, following resources are applied.\n",
219+
"\n",
220+
"* `byop_llama2_vertex_v2_9.zip`: 2 cpus, 1 Gi RAM, plus the environmental variable `GOOGLE_APPLICATION_CREDENTIALS` loaded from the file `credentials.json`."
221+
]
222+
},
223+
{
224+
"cell_type": "code",
225+
"execution_count": null,
226+
"id": "2d8e3996-734c-4f7f-8922-5df049fd28c0",
227+
"metadata": {},
228+
"outputs": [],
229+
"source": [
230+
"deployment_config = DeploymentConfigBuilder() \\\n",
231+
" .cpus(1).memory('2Gi') \\\n",
232+
" .sidekick_cpus(model, 2) \\\n",
233+
" .sidekick_memory(model, '1Gi') \\\n",
234+
" .sidekick_env(model, {\"GOOGLE_APPLICATION_CREDENTIALS\": str(json.load(open(\"credentials.json\", 'r')))}) \\\n",
235+
" .build()"
236+
]
237+
},
238+
{
239+
"cell_type": "markdown",
240+
"id": "e9151986",
241+
"metadata": {},
242+
"source": [
243+
"### Deploy Model\n",
244+
"\n",
245+
"To deploy the model:\n",
246+
"\n",
247+
"1. We build a Wallaroo pipeline and assign the model as a pipeline step. For this tutorial it is called `llama-vertex-pipe`.\n",
248+
"2. The pipeline is deployed with the deployment configuration.\n",
249+
"3. Once the resources allocation is complete, the model is ready for inferencing.\n",
250+
"\n",
251+
"See [Model Deploy](https://docs.wallaroo.ai/wallaroo-llm/wallaroo-llm-package-deployment/) for more details on deploying LLMs in Wallaroo."
252+
]
253+
},
254+
{
255+
"cell_type": "code",
256+
"execution_count": null,
257+
"id": "335576e5-c0ff-4c99-81f5-65e849235d1f",
258+
"metadata": {},
259+
"outputs": [],
260+
"source": [
261+
"pipeline = wl.build_pipeline(\"llama-vertex-pipe\")\n",
262+
"pipeline.add_model_step(model)\n",
263+
"pipeline.deploy(deployment_config=deployment_config)"
264+
]
265+
},
266+
{
267+
"cell_type": "markdown",
268+
"id": "80e00a9a-891e-4c20-9f6e-604d5b6b0276",
269+
"metadata": {},
270+
"source": [
271+
"### Generate Inference Request\n",
272+
"\n",
273+
"The inference request will be submitted as a pandas DataFrame as a text entry."
274+
]
275+
},
276+
{
277+
"cell_type": "code",
278+
"execution_count": null,
279+
"id": "fb2fa598-e32f-432e-a7b9-71d247f6de83",
280+
"metadata": {},
281+
"outputs": [],
282+
"source": [
283+
"input_data = pd.DataFrame({'text': ['What happened to the Serge llama?', 'How are you doing?']})"
284+
]
285+
},
286+
{
287+
"cell_type": "markdown",
288+
"id": "ac742d79",
289+
"metadata": {},
290+
"source": [
291+
"### Submit Inference Request\n",
292+
"\n",
293+
"The inference request is submitted to the pipeline with the `infer` method, which accepts either:\n",
294+
"\n",
295+
"* pandas DataFrame\n",
296+
"* Apache Arrow Table\n",
297+
"\n",
298+
"The results are returned in the same format as submitted. For this example, a pandas DataFrame is submitted, so a pandas DataFrame is returned. The final generated text is displayed."
299+
]
300+
},
301+
{
302+
"cell_type": "code",
303+
"execution_count": null,
304+
"id": "a418f8fc-9c2a-4b9f-864f-1b2680f971a4",
305+
"metadata": {},
306+
"outputs": [],
307+
"source": [
308+
"pipeline.infer(input_data)"
309+
]
310+
},
311+
{
312+
"cell_type": "markdown",
313+
"id": "30b03c14-0519-4f47-9a91-e47ad5965d84",
314+
"metadata": {},
315+
"source": [
316+
"## Undeploy"
317+
]
318+
},
319+
{
320+
"cell_type": "code",
321+
"execution_count": null,
322+
"id": "481b0019-b217-4af7-862d-ac3cae0fcb37",
323+
"metadata": {},
324+
"outputs": [],
325+
"source": [
326+
"pipeline.undeploy()"
327+
]
328+
}
329+
],
330+
"metadata": {
331+
"kernelspec": {
332+
"display_name": "Python 3 (ipykernel)",
333+
"language": "python",
334+
"name": "python3"
335+
},
336+
"language_info": {
337+
"codemirror_mode": {
338+
"name": "ipython",
339+
"version": 3
340+
},
341+
"file_extension": ".py",
342+
"mimetype": "text/x-python",
343+
"name": "python",
344+
"nbconvert_exporter": "python",
345+
"pygments_lexer": "ipython3",
346+
"version": "3.9.13"
347+
}
348+
},
349+
"nbformat": 4,
350+
"nbformat_minor": 5
351+
}

0 commit comments

Comments
 (0)