-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ollama] GraphRAG Community Support for running Ollama #345
Comments
Embeddings are not working with Ollama... I was able to get things working with Ollama for the entities and openai for embeddings. |
Working config can be found here: #339 (comment) |
Ollama works as expected GRAPHRAG_API_KEY=123
GRAPHRAG_API_BASE=http://172.17.0.1:11434/v1
# GRAPHRAG_LLM_MODEL=llama3:instruct
GRAPHRAG_LLM_MODEL=codestral
GRAPHRAG_LLM_THREAD_COUNT=4
GRAPHRAG_LLM_CONCURRENT_REQUESTS=8
GRAPHRAG_LLM_MAX_TOKENS=2048
GRAPHRAG_EMBEDDING_API_BASE=http://172.17.0.1:11435/v1
GRAPHRAG_EMBEDDING_MODEL=mxbai-embed-large
API shapes
OAIJSON.stringify({
object: "list",
data: [
...results.map((r, i) => ({
object: "embedding",
index: i,
embedding: r.embedding,
})),
],
model,
usage: {
prompt_tokens: 0,
total_tokens: 0,
},
}) OllamaJSON.stringify({
model,
prompt: input,
}) |
Sorry for what might be obvious... but how do you run this proxy? When I run ollama serve it only listen on the default port and not on 11435 What do you use to run this proxy? |
@bmaltais, no worries! 11435 is a proxy server written in JS/Node to specifically map request/response between OAI and Ollama formats, I didn't list the whole code as it's pretty much from the Node docs |
This is what I was afraid of ;-) I guess I will wait for something to be built by someone. I don't understand enough about node.js to build this. |
Can you please explain how did you do this..for embeddings api.. |
It works with ollama embedding by changing the file in /opt/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py with from typing_extensions import Unpack from graphrag.llm.base import BaseLLM from .openai_configuration import OpenAIConfiguration import ollama class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
|
Can you please provide the complete /opt/anaconda3/envs/graphrag/lib/python3.11/site-packages/graphrag/llm/openai/openai_embeddings_llm.py replacement code and also the settings file. |
@SpaceLearner Does it work when you try to query? I adapted your code to work with langchain, it create the embeddings... but when I try to do a local query I get an error. This is my embeddings version:
This the error:
I suspect the query embeddings code also need to be modified... |
hack the file with the fellowing contents(tips: only fix # Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""OpenAI Embedding model implementation."""
import asyncio
from collections.abc import Callable
from typing import Any
import numpy as np
import tiktoken
from tenacity import (
AsyncRetrying,
RetryError,
Retrying,
retry_if_exception_type,
stop_after_attempt,
wait_exponential_jitter,
)
from graphrag.query.llm.base import BaseTextEmbedding
from graphrag.query.llm.oai.base import OpenAILLMImpl
from graphrag.query.llm.oai.typing import (
OPENAI_RETRY_ERROR_TYPES,
OpenaiApiType,
)
from graphrag.query.llm.text_utils import chunk_text
from graphrag.query.progress import StatusReporter
from langchain_community.embeddings import OllamaEmbeddings
class OpenAIEmbedding(BaseTextEmbedding, OpenAILLMImpl):
"""Wrapper for OpenAI Embedding models."""
def __init__(
self,
api_key: str | None = None,
azure_ad_token_provider: Callable | None = None,
model: str = "text-embedding-3-small",
deployment_name: str | None = None,
api_base: str | None = None,
api_version: str | None = None,
api_type: OpenaiApiType = OpenaiApiType.OpenAI,
organization: str | None = None,
encoding_name: str = "cl100k_base",
max_tokens: int = 8191,
max_retries: int = 10,
request_timeout: float = 180.0,
retry_error_types: tuple[type[BaseException]] = OPENAI_RETRY_ERROR_TYPES, # type: ignore
reporter: StatusReporter | None = None,
):
OpenAILLMImpl.__init__(
self=self,
api_key=api_key,
azure_ad_token_provider=azure_ad_token_provider,
deployment_name=deployment_name,
api_base=api_base,
api_version=api_version,
api_type=api_type, # type: ignore
organization=organization,
max_retries=max_retries,
request_timeout=request_timeout,
reporter=reporter,
)
self.model = model
self.encoding_name = encoding_name
self.max_tokens = max_tokens
self.token_encoder = tiktoken.get_encoding(self.encoding_name)
self.retry_error_types = retry_error_types
def embed(self, text: str, **kwargs: Any) -> list[float]:
"""
Embed text using OpenAI Embedding's sync function.
For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
"""
token_chunks = chunk_text(
text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
)
chunk_embeddings = []
chunk_lens = []
for chunk in token_chunks:
try:
embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
chunk_embeddings.append(embedding)
chunk_lens.append(chunk_len)
# TODO: catch a more specific exception
except Exception as e: # noqa BLE001
self._reporter.error(
message="Error embedding chunk",
details={self.__class__.__name__: str(e)},
)
continue
chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
return chunk_embeddings.tolist()
async def aembed(self, text: str, **kwargs: Any) -> list[float]:
"""
Embed text using OpenAI Embedding's async function.
For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
"""
token_chunks = chunk_text(
text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
)
chunk_embeddings = []
chunk_lens = []
embedding_results = await asyncio.gather(*[
self._aembed_with_retry(chunk, **kwargs) for chunk in token_chunks
])
embedding_results = [result for result in embedding_results if result[0]]
chunk_embeddings = [result[0] for result in embedding_results]
chunk_lens = [result[1] for result in embedding_results]
chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens) # type: ignore
chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
return chunk_embeddings.tolist()
def _embed_with_retry(
self, text: str | tuple, **kwargs: Any
) -> tuple[list[float], int]:
try:
retryer = Retrying(
stop=stop_after_attempt(self.max_retries),
wait=wait_exponential_jitter(max=10),
reraise=True,
retry=retry_if_exception_type(self.retry_error_types),
)
for attempt in retryer:
with attempt:
embedding = (
OllamaEmbeddings(
model=self.model,
).embed_query(text)
or []
)
return (embedding, len(text))
except RetryError as e:
self._reporter.error(
message="Error at embed_with_retry()",
details={self.__class__.__name__: str(e)},
)
return ([], 0)
else:
# TODO: why not just throw in this case?
return ([], 0)
async def _aembed_with_retry(
self, text: str | tuple, **kwargs: Any
) -> tuple[list[float], int]:
try:
retryer = AsyncRetrying(
stop=stop_after_attempt(self.max_retries),
wait=wait_exponential_jitter(max=10),
reraise=True,
retry=retry_if_exception_type(self.retry_error_types),
)
async for attempt in retryer:
with attempt:
embedding = (
await OllamaEmbeddings(
model=self.model,
).embed_query(text) or [] )
return (embedding, len(text))
except RetryError as e:
self._reporter.error(
message="Error at embed_with_retry()",
details={self.__class__.__name__: str(e)},
)
return ([], 0)
else:
# TODO: why not just throw in this case?
return ([], 0) |
It seems I have it working now. It returns nothings if I set llm to llama3, but works ok when switching to mistral. |
To change the openai request format to the one supported by ollama, setting only requires the base_url parameter, for example, api_base: http://localhost:8000/v1
|
@gdhua, your prompt-fu failed you, this proxy server doesn't transform embeddings API between OAI/Ollama formats. @bmaltais, here's the final version of the proxy I ended up using. There was another issue with the fact GraphRAG sends raw token IDs into the embeddings API, rather than non-tokenised raw text. Proxy server for OpenAI <-> Ollama embeddings
import os
import sys
import json
import logging
import asyncio
from aiohttp import web
import aiohttp
import tiktoken
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
config = {
"proxy_port": int(os.environ.get("PROXY_PORT", 11435)),
"api_url": os.environ.get("OLLAMA_ENDPOINT"),
"tiktoken_encoding": "cl100k_base"
}
encoding = tiktoken.get_encoding(config["tiktoken_encoding"])
async def handle_embeddings(request):
try:
body = await request.json()
model = body["model"]
input_data = body["input"]
print(f"/v1/embeddings handler {str(input_data)[:100]}")
if isinstance(input_data, str):
input_data = [input_data]
results = await asyncio.gather(*[fetch_embeddings(model, i) for i in input_data])
response_data = {
"object": "list",
"data": [
{
"object": "embedding",
"index": i,
"embedding": r["embedding"]
} for i, r in enumerate(results)
],
"model": model,
"usage": {
"prompt_tokens": 0,
"total_tokens": 0
}
}
return web.json_response(response_data)
except Exception as e:
print(f"Error: {str(e)}")
return web.Response(status=500)
async def fetch_embeddings(model, input_text):
if isinstance(input_text, int):
input_text = encoding.decode([input_text])
# If array of ints - decode the logits with tiktoken
if isinstance(input_text, list):
input_text = encoding.decode(input_text)
if not isinstance(input_text, str):
raise ValueError(f"Input is not a string: {input_text}")
async with aiohttp.ClientSession() as session:
async with session.post(
f"{config['api_url']}/api/embeddings",
headers={"Content-Type": "application/json"},
json={"model": model, "prompt": input_text}
) as response:
text = await response.text()
json_data = json.loads(text)
print(f"Embeddings: {input_text[:50]}... -> {text[:50]}...")
return json_data
def main():
print('Starting embeddings proxy...')
if not config["api_url"]:
raise ValueError("OLLAMA_ENDPOINT environment variable is required")
app = web.Application()
app.router.add_post("/v1/embeddings", handle_embeddings)
web.run_app(app, port=config["proxy_port"], host="0.0.0.0")
if __name__ == "__main__":
main() A few caveats:
|
@xiaoquisme , errors when using However, a disclaimer is that my llama3 sometimes even forgets to (where gpt rarely does) answer in the structure of json at all for queries like "Can you give me a joke for people read about this". I think this may only be fixed by improving the prompts or using a more "obedient" model. |
this worked for me. |
I'm making this thread as our official discussion place for Ollama setup and troubleshooting. |
this is a temp hacked solution for ollama |
Thanks. For anyone who don't use langchain and just want to use ollama's embedding model, you can make these changes and it will work for global query answering:
And yes, when doing local query there will still be an error concerning another function in this same .py file. |
I was able to get GraphRAG + Ollama up and running. However indexing took several hours. E.g. does it make sense to edit the parallelisation section in the yaml file somehow? What is the default num_threads since it is commented out in the initially created file? PS: Not sure if this is the right place to ask this, but the title says '[GraphRAG Community Support for running Ollama]' Thanks for any idea / comment! |
Unfortunately how graphrag work at the moment is very very GPU demanding. I think this is what will keep it from being used by users on local computer. graphrag take an hour to process one book that take 10 seconds to process on my custom RAG system. And my custom system is using a query augmentation strategy using RAG answers to the original question to produce a better question along with important keywords list back as the augmented question. This ensure better embedding matching across the whole document and produce better answer than graphrag (and even notebooklm) most of the time… I can share the code if you are interested. |
That is why I am looking for ways to speed it up. Anyone with ideas that go beyond the default settings? |
Consolidating Ollama-related issues: #657 |
Your process sounds interesting care to share it? |
Been a while since I touched it. Was working pretty well. Let me see if I can push the code to github so you can have a look at it. |
@Tipik1n Here is the repo: https://github.com/bmaltais/AIResearcher Quick how to use:
To enhance the question type Here is the sample output of a question on "The Project Gutenberg eBook of A Room with a View":
|
i think this is a discussion w/ graphrag + ollama...?? |
Hi, Unfortunatly it is not leveraging GraphRAG. I was just providing a link to a custom RAG solution that perform pretty well. As much as I like GraphRAG, it is too resource demanding for the added benefits. |
is there a working example for using Ollama? Or is it not supposed to work? Did try, but without any success.
Thanks in advance
The text was updated successfully, but these errors were encountered: