-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
407 additions
and
26 deletions.
There are no files selected for viewing
10 changes: 9 additions & 1 deletion
10
app/blog/(posts)/create-your-first-dataset-with-jaqpotpy/page.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
10 changes: 9 additions & 1 deletion
10
app/blog/(posts)/deploy-your-first-model-with-jaqpot/page.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,280 @@ | ||
import MdxLayout from "../../components/mdx-layout"; | ||
import {Link} from "@nextui-org/react"; | ||
|
||
export const metadata = { | ||
title: 'Building a streaming LLM with Next.js, FastAPI & Docker: the complete stack (part 1)', | ||
publishDate: '2024-12-19T00:00:00Z', | ||
textPreview: 'Want to deploy your LLM with real-time streaming responses? Here\'s a complete guide covering backend, frontend, and deployment.', | ||
author: { | ||
name: 'Alex Arvanitidis', | ||
avatarUrl: 'https://d2zoqz4gyxc03g.cloudfront.net/avatars/af2a132a-5997-4b52-936b-35695452035b.jpg', | ||
description: <Link isExternal href="https://app.jaqpot.org/dashboard/user/alarv" size="sm"> | ||
@alarv | ||
</Link>, | ||
} | ||
}; | ||
|
||
![Deploy LLM](/deploy-llm.svg) | ||
|
||
Want to build a modern LLM application with real-time streaming responses? Here's a complete guide covering backend, frontend, and deployment. We'll be using some of the most popular technologies in the modern web development ecosystem: Next.js with the Next UI component library, FastAPI for our backend, and Docker for containerization. | ||
|
||
## Preparing your LLM weights | ||
|
||
First, let's talk about the model weights. The Llama model comes in various sizes (7B, 13B, 70B parameters) and versions (Llama 3, Llama 3 Chat). For this tutorial, we'll use the Llama 3.2-1B-Instruct model, which is a good balance between performance and resource requirements. | ||
|
||
Before we can use the model, we need to convert the weights to a format that works with the Hugging Face transformers library. This conversion process is necessary because Llama models are originally distributed in their own format, but we want to use them with the transformers ecosystem. | ||
|
||
Here's the command to convert the weights: | ||
|
||
```bash | ||
python convert_llama_weights_to_hf.py \ | ||
--input_dir ~/.llama/checkpoints/Llama3.2-1B-Instruct/ \ | ||
--model_size 1B \ | ||
--output_dir ./llama \ | ||
--llama_version 3.2 | ||
``` | ||
|
||
Let's break down what's happening: | ||
- `--input_dir`: points to where your original Llama weights are stored | ||
- `--model_size`: specifies the model size (1B in our case) | ||
- `--output_dir`: where the converted weights will be saved | ||
- `--llama_version`: specifies which version of Llama we're using | ||
|
||
The conversion process creates a model that's compatible with the transformers library's LlamaForCausalLM architecture. This enables us to use all the powerful features of the transformers library, including: | ||
- Efficient tokenization | ||
- Model parallelism | ||
- Gradient checkpointing | ||
- Different generation strategies | ||
|
||
For more details about Llama model usage and optimization tips, check out the [official documentation](https://huggingface.co/docs/transformers/main/en/model_doc/llama#usage-tips). | ||
|
||
## Setting up the backend | ||
|
||
Our backend is built with FastAPI, a modern Python web framework known for its speed and ease of use. Here's our complete model service that handles predictions and streaming: | ||
|
||
```python | ||
import asyncio | ||
from queue import Queue | ||
from threading import Thread | ||
|
||
import joblib | ||
import pandas as pd | ||
import torch | ||
from fastapi.responses import StreamingResponse | ||
from jaqpot_api_client.models.prediction_request import PredictionRequest | ||
|
||
from src.streamer import CustomStreamer | ||
|
||
class ModelService: | ||
def __init__(self): | ||
self.model = joblib.load('model.pkl') | ||
self.tokenizer = joblib.load('tokenizer.pkl') | ||
|
||
def infer(self, request: PredictionRequest) -> StreamingResponse: | ||
# Convert input list to DataFrame | ||
input_data = pd.DataFrame(request.dataset.input) | ||
|
||
last_index = input_data.index[-1] | ||
input_row = input_data.iloc[last_index] | ||
|
||
prompt = input_row['prompt'] | ||
|
||
return StreamingResponse(self.response_generator(prompt), | ||
media_type='text/event-stream') | ||
|
||
def start_generation(self, query: str, streamer): | ||
prompt = f"""{query}""" | ||
device = "cuda" if torch.cuda.is_available() else "cpu" | ||
self.model = self.model.to(device) | ||
inputs = self.tokenizer([prompt], return_tensors="pt").to(device) | ||
generation_kwargs = dict(inputs, streamer=streamer, | ||
max_new_tokens=4096, temperature=0.1) | ||
thread = Thread(target=self.model.generate, kwargs=generation_kwargs) | ||
thread.start() | ||
|
||
async def response_generator(self, query: str): | ||
streamer_queue = Queue() | ||
streamer = CustomStreamer(streamer_queue, self.tokenizer, True) | ||
|
||
self.start_generation(query, streamer) | ||
|
||
while True: | ||
value = streamer_queue.get() | ||
if value is None: | ||
break | ||
yield value | ||
streamer_queue.task_done() | ||
await asyncio.sleep(0.1) | ||
``` | ||
|
||
Let's examine the key components of our service: | ||
|
||
1.**Model initialization**: We load both the model and tokenizer from saved files. This happens once when the service starts, keeping them in memory for quick access. | ||
|
||
2.**Inference endpoint**: The `infer` method handles incoming requests. It expects a prompt and returns a streaming response. | ||
|
||
3.**Generation process**: Text generation happens in a separate thread to prevent blocking. We use GPU acceleration when available. | ||
|
||
4.**Streaming setup**: We use FastAPI's StreamingResponse to send tokens back to the client as they're generated, providing a real-time experience. | ||
|
||
## Creating the frontend | ||
|
||
Our frontend is built with Next.js and the Next UI component library, providing a modern and responsive interface. Next UI gives us a set of beautiful, accessible components that work great with Next.js 13+. | ||
|
||
We'll break down our frontend into two main components: | ||
|
||
### Navigation component | ||
```typescript | ||
export default function LLMNavigation({ model }: LLMTabsProps) { | ||
const params = useParams<{ datasetId: string }>(); | ||
const [selectedKeys, setSelectedKeys] = useState<Set<string>>( | ||
params.datasetId ? new Set([params.datasetId]) : new Set<string>(), | ||
); | ||
|
||
// handle chat history and navigation | ||
return ( | ||
<div className="flex flex-row gap-5"> | ||
<div className="flex max-w-48 flex-col gap-3"> | ||
<Button | ||
color="primary" | ||
onPress={async () => { | ||
router.push(`${pathnameWithoutDatasetId}/new`); | ||
}} | ||
> | ||
start new chat | ||
</Button> | ||
<Table | ||
selectedKeys={selectedKeys} | ||
onSelectionChange={(keys) => { | ||
router.push(`${pathnameWithoutDatasetId}/${[...keys][0]}`); | ||
}} | ||
selectionMode="single" | ||
> | ||
{/* table content */} | ||
</Table> | ||
</div> | ||
<LLMForm model={model} datasetId={params.datasetId} /> | ||
</div> | ||
); | ||
} | ||
``` | ||
|
||
The navigation component manages chat history and session navigation. It uses Next UI's Table and Button components for a polished look. | ||
|
||
### Chat form component | ||
```typescript | ||
export function LLMForm({ model, datasetId }: LLMFormProps) { | ||
const [isFormLoading, setIsFormLoading] = useState(false); | ||
const [currentResponse, setCurrentResponse] = useState<string>(); | ||
const [chatHistory, setChatHistory] = useState<Array<ChatMessage>>([]); | ||
|
||
const createStreamingPrediction = async ( | ||
modelId: string, | ||
datasetId: string, | ||
streamingPredictionRequestDto: { prompt: string }, | ||
) => { | ||
const apiResponse = await fetch( | ||
`/api/models/${modelId}/predict/stream/${datasetId}`, | ||
{ | ||
method: 'POST', | ||
headers: { 'Content-Type': 'text/event-stream' }, | ||
body: JSON.stringify(streamingPredictionRequestDto), | ||
}, | ||
); | ||
|
||
if (!apiResponse.body) return; | ||
|
||
const reader = apiResponse.body | ||
.pipeThrough(new TextDecoderStream()) | ||
.getReader(); | ||
let fullResponse = ''; | ||
|
||
while (true) { | ||
const { value, done } = await reader.read(); | ||
if (done) break; | ||
|
||
if (value) { | ||
fullResponse += value + ' '; | ||
setCurrentResponse(fullResponse); | ||
} | ||
} | ||
|
||
return fullResponse; | ||
}; | ||
|
||
const handleFormSubmit = async () => { | ||
// handle form submission and streaming | ||
try { | ||
const response = await createStreamingPrediction( | ||
model.id!.toString(), | ||
dataset.id!.toString(), | ||
{ prompt }, | ||
); | ||
|
||
setChatHistory((prev) => [ | ||
...prev, | ||
{ prompt, output: response }, | ||
]); | ||
} catch (error) { | ||
// error handling | ||
} | ||
}; | ||
|
||
return ( | ||
<Card> | ||
<CardBody> | ||
<div ref={chatContainerRef}> | ||
{/* chat messages */} | ||
<form> | ||
<Textarea | ||
onKeyDown={handleKeyPress} | ||
endContent={ | ||
<Button | ||
isIconOnly | ||
onPress={handleFormSubmit} | ||
isLoading={isFormLoading} | ||
> | ||
<ArrowUpIcon /> | ||
</Button> | ||
} | ||
/> | ||
</form> | ||
</div> | ||
</CardBody> | ||
</Card> | ||
); | ||
} | ||
``` | ||
|
||
The chat form component is where the magic happens. It: | ||
- Manages the chat interface using Next UI components (Card, Button, Textarea) | ||
- Handles real-time streaming using the web streams API | ||
- Maintains chat history and UI state | ||
- Provides a smooth typing experience with loading states | ||
- Implements auto-scrolling for new messages | ||
- Supports keyboard shortcuts for better usability | ||
|
||
Key features of our frontend implementation: | ||
- Real-time streaming using the web streams API | ||
- Responsive design with Next UI components | ||
- Chat history management with state persistence | ||
- Clean, accessible UI with loading states | ||
- Auto-scrolling chat container | ||
- Keyboard shortcuts (enter to send) | ||
- Error handling and recovery | ||
|
||
## Deployment | ||
|
||
Once your application is ready: | ||
1. Build your docker image with the model service | ||
2. Set up your Next.js frontend | ||
3. Configure your environment variables | ||
4. Deploy using your preferred hosting solution | ||
|
||
In part 2, I will explain how you can deploy this setup on AWS using an EC2 instance with a GPU (g4dn.xlarge) on your EKS cluster with a GPU node group that will only be used by your LLM pod. | ||
|
||
P.s. This exact setup is currently running on the Jaqpot platform at `https://app.jaqpot.org/dashboard/models/1983/description`. Check it out to see it in action! | ||
|
||
export default function MDXPage({children}) { | ||
return <MdxLayout metadata={metadata}>{children}</MdxLayout> | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.