Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: <title> Local LLM and Loacl embedding error? #605

Closed
1444141859 opened this issue Jul 18, 2024 · 4 comments
Closed

[Issue]: <title> Local LLM and Loacl embedding error? #605

1444141859 opened this issue Jul 18, 2024 · 4 comments
Labels
community_support Issue handled by community members

Comments

@1444141859
Copy link

Describe the issue

Use vllm to launch a local large model, in the style of openai,but it won't work

Steps to reproduce

step1:python -m vllm.entrypoints.openai.api_server --max-model-len 6144 --gpu-memory-utilization 0.95 --disable-log-stats --served-model-name Qwen2-7B-Instruct --model /mnt/workspace/Qwen2-7B-Instruct
step2: start embedding

import os
from contextlib import asynccontextmanager
from typing import List, Union

import tiktoken
import torch
import uvicorn
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from sse_starlette.sse import EventSourceResponse

Set up limit request time

EventSourceResponse.DEFAULT_PING_INTERVAL = 1000

EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', '/mnt/workspace/m3e-base')

@asynccontextmanager
async def lifespan(app: FastAPI):
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

app = FastAPI(lifespan=lifespan)

app.add_middleware(
CORSMiddleware,
allow_origins=[""],
allow_credentials=True,
allow_methods=["
"],
allow_headers=["*"],
)

class CompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int

class EmbeddingResponse(BaseModel):
data: list
model: str
object: str
usage: CompletionUsage

class EmbeddingRequest(BaseModel):
input: Union[List[str], str]
model: str

@app.post("/v1/embeddings", response_model=EmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):
if isinstance(request.input, str):
embeddings = [embedding_model.encode(request.input)]
else:
embeddings = [embedding_model.encode(text) for text in request.input]
embeddings = [embedding.tolist() for embedding in embeddings]

def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding('cl100k_base')
    num_tokens = len(encoding.encode(string))
    return num_tokens

response = {
    "data": [
        {
            "object": "embedding",
            "embedding": embedding,
            "index": index
        }
        for index, embedding in enumerate(embeddings)
    ],
    "model": request.model,
    "object": "list",
    "usage": CompletionUsage(
        prompt_tokens=sum(len(text.split()) for text in request.input),
        completion_tokens=0,
        total_tokens=sum(num_tokens_from_string(text) for text in request.input),
    )
}
return response

if name == "main":
# load Embedding
embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")
uvicorn.run(app, host='0.0.0.0', port=8001, workers=1)

step3:pip install graphrag
step4:mkdir -p ./ragtest/input
step5:curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt
step6:python -m graphrag.index --init --root ./ragtest
step7: Modify the yml file

GraphRAG Config Used

No response

Logs and screenshots

encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: Qwen2-7B-Instruct
model_supports_json: false # recommended if this is available for your model.
max_tokens: 2000
request_timeout: 180.0
api_base: http://localhost:8000/v1/

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: m3e-base
api_base: http://localhost:8001/v1/
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:
@1444141859 1444141859 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Jul 18, 2024
@karthik-codex
Copy link

The local search with embeddings from Ollama now works.
You can read full guide here:
https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f
Here is the link to the repo:
https://github.com/karthik-codex/autogen_graphRAG

@Nomore0912
Copy link

你 端口号是不是错了?

@rushizirpe
Copy link

If you want to use open-source models, I've created a repository for deploying Hugging Face models to local endpoints, offering functionality similar to OpenAI APIs. You can find the repo here: https://github.com/rushizirpe/open-llm-server

Also, I've prepared a Colab notebook for the Graphrag Demo. You might want to take a look: https://colab.research.google.com/drive/1uhFDnih1WKrSRQHisU-L6xw6coapgR51?usp=sharing.
If you don't have access to GPUs like the A100, you'll need a GROQ_API_KEY (which is free with certain limitations), you can obtain it from: https://console.groq.com/keys

@natoverse
Copy link
Collaborator

Consolidating alternate model issues here: #657

@natoverse natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024
@natoverse natoverse added community_support Issue handled by community members and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community_support Issue handled by community members
Projects
None yet
Development

No branches or pull requests

5 participants