This Open LLM Framework serves as a powerful and flexible tool for generating text embeddings and chat completions using state-of-the-art and open source language models. By leveraging models Transformers, this enables various natural language processing (NLP) tasks to be performed via simple HTTP endpoints similar to openai endpoints.
-
Provides an easy-to-use API interface to leverage powerful NLP models locally without needing deep expertise in machine learning.
-
{TD - Allow integration with various applications, including chatbots, content creation tools, and recommendation systems.}
-
Supports multiple models from Transformers library, enabling diverse NLP tasks.
-
Utilizes GPU acceleration when available to enhance processing speed and efficiency.
-
Tunneling to give access to other endpoints
-
Reduces dependency on external APIs, potentially lowering operational costs.
-
Enables control over the computational resources used, optimizing for cost and performance.
For GraphRAG:
-
- Python >= 3.10
- Docker >= 23.0.3
# Clone the Repository
git clone https://github.com/rushizirpe/open-llm-server.git
# Install Dependencies
cd open-llm-server
pip install -e .
# Launch server
llm-server start --host 127.0.0.1 --port 8888 --reload
Params:
start
: Start the serverstop
: Stop the serverstatus
: Check the server status--host
: Specify the host IP (default: 127.0.0.1)--port
: Specify the port number (default: 8888)--reload
: Enable auto-reload for development
- Chat Completions:
/v1/chat/completions
- Embeddings:
/v1/embeddings
- System Metrics:
/v1/metrics
# Pull Docker Image
docker pull thisisrishi/open-llm-server
# Run Docker
docker run -it -p 8888:8888 thisisrishi/open-llm-server:latest
OR
# Run on Custom Port
docker run -e PORT=8000 -p 8000:8000 thisisrishi/open-llm-server:latest
OR
# Create and Start Container
docker compose up
-
- URL:
/
- Method:
GET
- Description: Check the status of the API and the availability of a GPU.
- URL:
-
Usage
curl http://localhost:8888/
- Response:
{
"status": "System Status: Operational",
"gpu": "Available",
"gpu_details": {
"GPU 0": {
"compute_capability": "(8, 9)",
"device_name": "NVIDIA L4"
}
}
}
-
- URL:
/v1/embeddings
- Method:
POST
- Description: Generate embeddings for a list of input texts using a specified model.
- URL:
-
Usage
curl http://localhost:8888/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer DUMMY_KEY" \
-d '{"input": "the quick brown fox", "model": "nomic-ai/nomic-embed-text-v1.5"}'
- Response:
{
"object": "list",
"data": [
{"embedding": [0.56324344, 0.25775233, -0.123355], "index": 0},
{"embedding": [0.30823462, -0.23636326, 0.543345], "index": 1}
],
"model": "nomic-ai/nomic-embed-text-v1.5",
"usage": {"total_tokens": 5}
}
-
- URL:
/v1/chat/completions
- Method:
POST
- Description: Generate chat completions based on conversation history using a specified model.
- URL:
-
Request Body:
{
"model": "openai-community/gpt2",
"messages": [
{"role": "user", "content": "Hi!"},
{"role": "assistant", "content": "Hi there! How can I help you today?"}
],
"max_tokens": 150,
"temperature": 0.7,
"top_p": 1.0,
"n": 1,
"stop": null
}
- Usage
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer DUMMY_KEY" \
-d '{"model": "openai-community/gpt2","messages": [{"role": "user", "content": "Hi!"}],"max_tokens": 150,"temperature": 0.7}'
- Response:
{
"choices": [
{"index": 0, "text": "Hello, I can help you with a variety of tasks, such as ..."}
]
}
This project is licensed under the MIT License. See the LICENSE file for more details.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
For any inquiries or support, please contact me.