Skip to content

Commit

Permalink
feat: deploy llm blogpost part2 (#41)
Browse files Browse the repository at this point in the history
* feat: deploy llm blogpost part2

* update part 1 with link to part 2
  • Loading branch information
alarv authored Feb 3, 2025
1 parent 501510f commit d07969a
Show file tree
Hide file tree
Showing 5 changed files with 173 additions and 2 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
169 changes: 169 additions & 0 deletions app/blog/(posts)/run-and-deploy-an-llm-part2/page.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
import MdxLayout from "../../components/mdx-layout";
import {Link} from "@nextui-org/react";

export const metadata = {
title: 'Deploy a streaming LLM with Terraform, Kubernetes and vLLM: the complete stack (part 2)',
publishDate: '2025-02-03T00:00:00Z',
textPreview: 'Want to deploy your LLM with real-time streaming responses? Here\'s a complete guide covering backend, frontend, and deployment.',
author: {
name: 'Alex Arvanitidis',
avatarUrl: 'https://d2zoqz4gyxc03g.cloudfront.net/avatars/af2a132a-5997-4b52-936b-35695452035b.jpg',
description: <Link isExternal href="https://app.jaqpot.org/dashboard/user/alarv" size="sm">
@alarv
</Link>,
},
timeToReadInMin: 10
};

![Deploy LLM](/vllm-logo-text-light.png)

# Deploying LLMs on AWS EKS with GPU Support: A Quick Guide

As promised in [part 1](https://jaqpot.org/blog/run-and-deploy-an-llm), here's how we deployed our LLM workload into production using AWS, Terraform, and Kubernetes with GPU support. Let's dive into the infrastructure setup that made this possible!

## A note on models and hardware requirements

Before we jump into the infrastructure setup, let's talk about what we're actually deploying. We're using pretrained models here - specifically pulling the weights from Hugging Face. Training these models from scratch would require massive computational resources and isn't necessary for most use cases.

However, even for inference (running the model to get predictions), LLMs need serious computational power. A GPU isn't just nice to have - it's a necessity. We're using AWS's g4dn.2xlarge instances which come with NVIDIA T4 GPUs. These instances can handle the load, but they come at a cost - expect to pay around $1 per hour of usage. This might sound steep, but it's actually quite reasonable considering the computational power you're getting.

## Setting up GPU nodes with Terraform

First up, we needed some serious GPU power. Here's the slice of our Terraform configuration that sets up g4dn.2xlarge instances:

```hcl
# GPU node group
gpu_node_group = {
create_launch_template = false
use_custom_launch_template = false
disk_size = 50
instance_types = ["g4dn.2xlarge"]
min_size = 1
subnet_ids = slice(dependency.vpc.outputs.private_subnets, 0, 2)
max_size = 2
desired_size = 1
ami_type = "AL2_x86_64_GPU"
iam_role_attach_cni_policy = true
labels = {
"environment" = "production"
"workload-type" = "gpu"
"nvidia.com/gpu" = "true"
}
taints = [{
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}]
update_config = {
max_unavailable = 1
}
}
```

This configuration ensures our GPU workloads get their own dedicated nodes - no resource competition here!

## Configuring GPU node selection

Next, we made sure our LLM gets priority access to those GPU resources. Here's how we configured our helm values:

```yaml
resources:
limits:
nvidia.com/gpu: 1
cpu: "8000m"
memory: "32Gi"
requests:
nvidia.com/gpu: 1
cpu: "6000m"
memory: "30Gi"
nodeSelector:
workload-type: gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```
## Our LLM container setup
For speed and efficiency, we went with [vLLM](https://github.com/vllm-project/vllm) for our LLM service. Here's our Kubernetes configuration:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service
spec:
template:
spec:
containers:
- name: llm
image: ghcr.io/vllm-project/vllm:latest
command: [
"vllm",
"serve",
"mistralai/Mistral-7B-Instruct-v0.3",
"--trust-remote-code",
"--enable-chunked-prefill",
"--max_num_batched_tokens", "1024",
"--dtype=half",
"--tokenizer-mode", "mistral",
"--gpu-memory-utilization", "1.0",
"--max-model-len", "6944",
"--enforce-eager"
]
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: llm-model-storage
```
Depending on your hardware, you may have to twiddle a bit with the vllm arguments. I got very helpful warning messages from vLLM each time and searching their issues on GitHub I found a solution each time!
Of course, you have to understand what each argument does and what you're changing and for that I'd suggest following the Hugging face course on NLP and the transformers library [here](https://huggingface.co/learn/nlp-course/en/chapter1/1)
## Storage configuration
For those massive model weights, we set up persistent storage:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: jaqpot
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp2
volumeMode: Filesystem
```
## The grand finale: Deployment time!
Here's how we orchestrated everything (drumroll, please 🥁):
1. First, we let Terraform work its infrastructure magic with the GPU nodes
2. Then, we waited for our new nodes to be ready
3. Next, we created the persistent volume claim
4. Finally, we deployed our LLM service
Some key things we monitor:
- GPU utilization tracking through Prometheus/Grafana
- Autoscaling based on our GPU metrics
- Comprehensive logging and monitoring
And there you have it! Our LLM is now running smoothly on AWS EKS with GPU support. If you're planning something similar, definitely check out the vLLM documentation and AWS EKS best practices for more insights.
Here's the link for the mistral-7b model that we have deployed on the Jaqpot platform: [mistral-7b](https://app.jaqpot.org/dashboard/models/1994/description)
On part 3 we will fine-tune our LLM using the transformers library provided by Hugging Face, stay tuned!
export default function MDXPage({children}) {
return <MdxLayout metadata={metadata}>{children}</MdxLayout>
}
2 changes: 2 additions & 0 deletions app/blog/(posts)/run-and-deploy-an-llm/page.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,8 @@ P.S. This exact setup is currently running on the Jaqpot platform at [https://ap

P.S.2 The code for this tutorial is available on GitHub at [https://github.com/ntua-unit-of-control-and-informatics/jaqpot-Llama3.2-1B-Instruct](https://github.com/ntua-unit-of-control-and-informatics/jaqpot-Llama3.2-1B-Instruct).

P.S.3 [Part 2](https://jaqpot.org/blog/run-and-deploy-an-llm-part2) is now published! Check it out!

export default function MDXPage({children}) {
return <MdxLayout metadata={metadata}>{children}</MdxLayout>
}
4 changes: 2 additions & 2 deletions mdx-components.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ export function useMDXComponents(components: MDXComponents): MDXComponents {
},
img(props: any) {
return (
<div style={{ position: 'relative', width: '100%', height: '300px' }}>
<span style={{ display: 'block', position: 'relative', width: '100%', height: '300px' }}>
<Image
fill
sizes="100vw"
Expand All @@ -30,7 +30,7 @@ export function useMDXComponents(components: MDXComponents): MDXComponents {
}}
{...(props as ImageProps)}
/>
</div>
</span>
);
},
p({ children }) {
Expand Down
Binary file added public/vllm-logo-text-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d07969a

Please sign in to comment.