Add MetaData LLM call #20

chavan-arvind · 2024-11-05T08:07:58Z

Related to #16

Add an optional LLM call for generating tags and summary of the file.

app/main.py
- Add a new endpoint /llm_tags_summary to generate tags and summary using the LLM.
- Update the OllamaGenerateRequest class to include a new field generate_tags_summary.
- Update the generate_llama function to handle the new generate_tags_summary field.
app/tasks.py
- Add a new function generate_tags_summary to generate tags and summary using the LLM.
- Update the ocr_task function to include an optional call to generate_tags_summary after extracting text.
client/cli.py
- Add a new command llm_tags_summary for generating tags and summary.
- Update the main function to handle the new llm_tags_summary command.
.env.example
- Add a new environment variable LLM_TAGS_SUMMARY_API_URL.

Related to CatchTheTornado#16 Add an optional LLM call for generating tags and summary of the file. * **app/main.py** - Add a new endpoint `/llm_tags_summary` to generate tags and summary using the LLM. - Update the `OllamaGenerateRequest` class to include a new field `generate_tags_summary`. - Update the `generate_llama` function to handle the new `generate_tags_summary` field. * **app/tasks.py** - Add a new function `generate_tags_summary` to generate tags and summary using the LLM. - Update the `ocr_task` function to include an optional call to `generate_tags_summary` after extracting text. * **client/cli.py** - Add a new command `llm_tags_summary` for generating tags and summary. - Update the `main` function to handle the new `llm_tags_summary` command. * **.env.example** - Add a new environment variable `LLM_TAGS_SUMMARY_API_URL`.

pkarw · 2024-11-05T19:30:51Z

app/main.py

@@ -116,3 +117,25 @@ async def generate_llama(request: OllamaGenerateRequest):

    generated_text = response.get("response", "")
    return {"generated_text": generated_text}
+
+@app.post("/llm_tags_summary")


Please name itllm_metadata and use this name instead of tags_summary

pkarw · 2024-11-05T19:31:54Z

app/tasks.py


    return extracted_text
+
+def generate_tags_summary(prompt, model):


Please rename it to generate_metadata instead

pkarw · 2024-11-05T19:32:55Z

app/tasks.py

@@ -59,7 +59,27 @@ def ocr_task(self, pdf_bytes, strategy_name, pdf_hash, ocr_cache, prompt, model)
            num_chunk += 1
            extracted_text += chunk['response']

-    self.update_state(state='DONE', meta={'progress': 100 , 'status': 'Processing done!', 'start_time': start_time, 'elapsed_time': time.time() - start_time})  # Example progress update
+    # Optional call to generate tags and summary
+    if prompt and model:


Please add an option generate_metadata - by default set to true; only when set the metadata is being generated - no material if prompt was given or not

pkarw · 2024-11-05T19:34:17Z

Thanks this is cool!

I requested minor changes. When applied and when #10 merged I'll also ask you to extend this feature for the metadata to be used within storage strategies to format file names (to use the tags or other fields within file names)

pkarw · 2024-11-05T19:35:58Z

One more thing - please use the defined prompt (can be a env variable with some nice default) to have the prompt used for metadata configurable

It should return JSON object with metadata:

{
 title: "",
 filename_title: "",
 tags: "",
 "summary": ""

pkarw

We can't mess the tags and metadata within the output - it should be used for naming strategy saved within celery task output and in the storage and could be possible to get by the webapi

I can work on these extensions once you fix the other changes I suggested in the PR

Sorry for so many changes - this tasks was simply not yet specified 😅

pkarw · 2024-11-05T19:37:33Z

app/main.py

+    Endpoint to generate tags and summary using Llama 3.1 model (and other models) via the Ollama API.
+    """
+    print(request)
+    if not request.prompt:


Metadata request is different than the main request with prompt.

Should be executed after the main LLM call on top of its results

It should use metadata prompt defined in env variable for configuration purposes

Then it should always return JSON object - I proposed its structure in the other comment

pkarw · 2024-11-05T19:40:39Z

app/tasks.py

+    # Optional call to generate tags and summary
+    if prompt and model:
+        tags_summary = generate_tags_summary(prompt, model)
+        extracted_text += "\n\nTags and Summary:\n" + tags_summary


When generate_metadata option defined it should be not returned within the general output but stored in the different json filename according to #10, used for file name strategies and stored within other field in the celerey result

We probably need an additional endpoint to get just the metadata stored for specific celery task

pkarw requested changes Nov 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MetaData LLM call #20

Add MetaData LLM call #20

chavan-arvind commented Nov 5, 2024

pkarw Nov 5, 2024

pkarw Nov 5, 2024

pkarw Nov 5, 2024

pkarw commented Nov 5, 2024

pkarw commented Nov 5, 2024

pkarw left a comment

pkarw Nov 5, 2024

pkarw Nov 5, 2024


		return extracted_text

		def generate_tags_summary(prompt, model):

Add MetaData LLM call #20

Are you sure you want to change the base?

Add MetaData LLM call #20

Conversation

chavan-arvind commented Nov 5, 2024

pkarw Nov 5, 2024

Choose a reason for hiding this comment

pkarw Nov 5, 2024

Choose a reason for hiding this comment

pkarw Nov 5, 2024

Choose a reason for hiding this comment

pkarw commented Nov 5, 2024

pkarw commented Nov 5, 2024

pkarw left a comment

Choose a reason for hiding this comment

pkarw Nov 5, 2024

Choose a reason for hiding this comment

pkarw Nov 5, 2024

Choose a reason for hiding this comment