-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MetaData LLM call #20
base: main
Are you sure you want to change the base?
Add MetaData LLM call #20
Conversation
Related to CatchTheTornado#16 Add an optional LLM call for generating tags and summary of the file. * **app/main.py** - Add a new endpoint `/llm_tags_summary` to generate tags and summary using the LLM. - Update the `OllamaGenerateRequest` class to include a new field `generate_tags_summary`. - Update the `generate_llama` function to handle the new `generate_tags_summary` field. * **app/tasks.py** - Add a new function `generate_tags_summary` to generate tags and summary using the LLM. - Update the `ocr_task` function to include an optional call to `generate_tags_summary` after extracting text. * **client/cli.py** - Add a new command `llm_tags_summary` for generating tags and summary. - Update the `main` function to handle the new `llm_tags_summary` command. * **.env.example** - Add a new environment variable `LLM_TAGS_SUMMARY_API_URL`.
@@ -116,3 +117,25 @@ async def generate_llama(request: OllamaGenerateRequest): | |||
|
|||
generated_text = response.get("response", "") | |||
return {"generated_text": generated_text} | |||
|
|||
@app.post("/llm_tags_summary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please name itllm_metadata
and use this name instead of tags_summary
|
||
return extracted_text | ||
|
||
def generate_tags_summary(prompt, model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename it to generate_metadata
instead
@@ -59,7 +59,27 @@ def ocr_task(self, pdf_bytes, strategy_name, pdf_hash, ocr_cache, prompt, model) | |||
num_chunk += 1 | |||
extracted_text += chunk['response'] | |||
|
|||
self.update_state(state='DONE', meta={'progress': 100 , 'status': 'Processing done!', 'start_time': start_time, 'elapsed_time': time.time() - start_time}) # Example progress update | |||
# Optional call to generate tags and summary | |||
if prompt and model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an option generate_metadata
- by default set to true; only when set the metadata is being generated - no material if prompt
was given or not
Thanks this is cool! I requested minor changes. When applied and when #10 merged I'll also ask you to extend this feature for the metadata to be used within storage strategies to format file names (to use the tags or other fields within file names) |
One more thing - please use the defined prompt (can be a env variable with some nice default) to have the prompt used for metadata configurable It should return JSON object with metadata: {
title: "",
filename_title: "",
tags: "",
"summary": ""
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't mess the tags and metadata within the output - it should be used for naming strategy saved within celery task output and in the storage and could be possible to get by the webapi
I can work on these extensions once you fix the other changes I suggested in the PR
Sorry for so many changes - this tasks was simply not yet specified 😅
Endpoint to generate tags and summary using Llama 3.1 model (and other models) via the Ollama API. | ||
""" | ||
print(request) | ||
if not request.prompt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata request is different than the main request with prompt.
Should be executed after the main LLM call on top of its results
It should use metadata prompt defined in env variable for configuration purposes
Then it should always return JSON object - I proposed its structure in the other comment
# Optional call to generate tags and summary | ||
if prompt and model: | ||
tags_summary = generate_tags_summary(prompt, model) | ||
extracted_text += "\n\nTags and Summary:\n" + tags_summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When generate_metadata option defined it should be not returned within the general output but stored in the different json filename according to #10, used for file name strategies and stored within other field in the celerey result
We probably need an additional endpoint to get just the metadata stored for specific celery task
Related to #16
Add an optional LLM call for generating tags and summary of the file.
app/main.py
/llm_tags_summary
to generate tags and summary using the LLM.OllamaGenerateRequest
class to include a new fieldgenerate_tags_summary
.generate_llama
function to handle the newgenerate_tags_summary
field.app/tasks.py
generate_tags_summary
to generate tags and summary using the LLM.ocr_task
function to include an optional call togenerate_tags_summary
after extracting text.client/cli.py
llm_tags_summary
for generating tags and summary.main
function to handle the newllm_tags_summary
command..env.example
LLM_TAGS_SUMMARY_API_URL
.