diff --git a/higgs/README.md b/higgs/README.md new file mode 100644 index 00000000..b47c269d --- /dev/null +++ b/higgs/README.md @@ -0,0 +1,134 @@ +# Higgs Audio v2 Generation 3B vLLM Truss + +Higgs Audio v2 Generation 3B is a multimodal model that can generate audio content based on text and audio inputs. + +This is a [Truss](https://truss.baseten.co/) for Higgs Audio using the vLLM OpenAI Compatible server. This Truss bypasses the need for writing a `model.py` and instead runs `vllm serve` directly at startup and uses the HTTP endpoint provided by `vLLM` OpenAI Compatible Server to directly serve requests. + +## Deployment + +First, clone this repository: + +```sh +git clone https://github.com/basetenlabs/truss-examples.git +cd higgs +``` + +Before deployment: + +1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys). +2. Install the latest version of Truss: `pip install --upgrade truss` +3. Retrieve your Hugging Face token from the [settings](https://huggingface.co/settings/tokens). +4. Set your Hugging Face token as a Baseten secret [here](https://app.baseten.co/settings/secrets) with the key `hf_access_token`. Note that you will *not* be able to successfully deploy Higgs Audio without doing this. + +With `higgs` as your working directory, you can deploy the model with: + +```sh +truss push --publish +``` + +Paste your Baseten API key if prompted. + +For more information, see [Truss documentation](https://truss.baseten.co). + +## vLLM OpenAI Compatible Server + +This Truss demonstrates how to start [vLLM's OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) without the need for a `model.py` through the `docker_server.start_command` option. + +The server is configured with the following parameters: +- **Model**: `bosonai/higgs-audio-v2-generation-3B-base` +- **Audio Tokenizer**: `bosonai/higgs-audio-v2-tokenizer` +- **Max Model Length**: 8192 tokens +- **GPU Memory Utilization**: 80% +- **Audio Limit**: 50 audio inputs per prompt +- **MM Preprocessor Cache**: Disabled for better performance + +## API Documentation + +The API follows the OpenAI ChatCompletion format. You can interact with the model using the standard ChatCompletion interface. + +Example usage: + +```python +from openai import OpenAI + +model_id = "YOUR_MODEL_ID" # Replace with your model ID + +client = OpenAI( + api_key="YOUR-API-KEY", + base_url=f"https://model-{model_id}.api.baseten.co/environments/production/sync/v1" +) + +response = client.chat.completions.create( + model="higgs-audio-v2-generation-3B-base", + messages=[ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "Generate audio based on this description: upbeat electronic music" + }, + { + "type": "audio_url", + "audio_url": {"url": "https://example.com/reference-audio.wav"} + } + ] + } + ], + max_tokens=512, + temperature=0.7 +) + +print(response.choices[0].message.content) +``` + +### Streaming Example + +```python +response = client.chat.completions.create( + model="higgs-audio-v2-generation-3B-base", + messages=[ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "Create ambient background music" + } + ] + } + ], + stream=True, + max_tokens=512 +) + +for chunk in response: + content = chunk.choices[0].delta.content + if content: + print(content, end="", flush=True) +``` + +## Model Features + +- **Audio Generation**: Generate high-quality audio content from text descriptions +- **Multimodal Input**: Accepts both text and audio inputs for context-aware generation +- **Flexible Length**: Supports up to 8192 tokens with configurable output length +- **Streaming Support**: Real-time streaming responses for interactive applications +- **Audio Tokenization**: Uses specialized audio tokenizer for optimal performance + +## Configuration Options + +The model is configured with several performance optimizations: + +- **GPU Memory Utilization**: Set to 80% for optimal memory usage +- **Audio Limit**: Up to 50 audio inputs per prompt for complex audio generation tasks +- **Disabled MM Preprocessor Cache**: Ensures consistent performance across requests +- **Max Model Length**: 8192 tokens for extended context handling + +## Support + +If you have any questions or need assistance, please open an issue in this repository or contact our support team. + +## License + +Please refer to the original model's license at [bosonai/higgs-audio-v2-generation-3B-base](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base) for usage terms and conditions. diff --git a/higgs/config.yaml b/higgs/config.yaml new file mode 100644 index 00000000..36611b4b --- /dev/null +++ b/higgs/config.yaml @@ -0,0 +1,43 @@ +description: Higgs Audio v2 Generation 3B - Audio generation model with vLLM +base_image: + image: bosonai/higgs-audio-vllm:latest +model_metadata: + repo_id: bosonai/higgs-audio-v2-generation-3B-base + example_model_input: { + "model": "higgs-audio-v2-generation-3B-base", + "messages": [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "Generate audio based on this description" + }, + { + "type": "audio_url", + "audio_url": {"url": "https://example.com/sample.wav"} + } + ] + } + ], + "max_tokens": 512, + "temperature": 0.7 + } + tags: + - openai-compatible + - audio-generation + - multimodal +docker_server: + start_command: sh -c "vllm serve bosonai/higgs-audio-v2-generation-3B-base --served-model-name higgs-audio-v2-generation-3B-base --limit-mm-per-prompt audio=50 --max-model-len 8192 --port 8000 --gpu-memory-utilization 0.8 --disable-mm-preprocessor-cache" + readiness_endpoint: /health + liveness_endpoint: /health + predict_endpoint: /v1/chat/completions + server_port: 8000 +resources: + accelerator: H100 + use_gpu: true +runtime: + predict_concurrency: 16 +model_name: Higgs Audio v2 Generation 3B +environment_variables: + VLLM_LOGGING_LEVEL: INFO