Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for passing in longer context fragments, stored in a deduped table #617

Open
simonw opened this issue Nov 6, 2024 · 45 comments
Open
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Nov 6, 2024

The attachments feature makes it easy to attach video and audio from a URL or file while ensuring that even if that prompt component is used many times over it only ends up stored once in the database - in the attachments table using a sha256 hash as the ID:

llm/llm/models.py

Lines 28 to 39 in febbc04

def id(self):
# Hash of the binary content, or of '{"url": "https://..."}' for URL attachments
if self._id is None:
if self.content:
self._id = hashlib.sha256(self.content).hexdigest()
elif self.path:
self._id = hashlib.sha256(open(self.path, "rb").read()).hexdigest()
else:
self._id = hashlib.sha256(
json.dumps({"url": self.url}).encode("utf-8")
).hexdigest()
return self._id

I find myself wanting the same thing for blocks of text: I'd like to construct e.g. a context that contains the full docs for LLM and run multiple prompts against that without storing many copies of the same exact text in the database.

One possible solution would be to re-use the attachments mechanism, with a new default plain text attachment type that all models support:

llm -m gpt-4o-mini \
  -a https://gist.githubusercontent.com/simonw/f7b251a05b834a9b60edff7e06d31572/raw/3ad2fb03d57aad7b616ec40455c355ec51d2e3db/llm-docs.md \
  'how do I embed a file?'
@simonw simonw added the enhancement New feature or request label Nov 6, 2024
@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Pros of this approach:

  • Reuses existing mechanism, no new tables or columns needed
  • Neatly solves the need for reused long text prompts

Cons:

  • A bit confusing: attachments were for multi-modal models, now we have a special type that does something else?
  • Potentially confusing UI: llm -a URL could point to all kinds of text files, how do we know which ones get pasted into the prompt?
  • The attachments table uses a BLOB for the content, but we want to store strings. We can encode those as utf-8 in a blob but that might make querying the database directly confusing.
  • It sucks that you now have to join against attachments just to see what the text version of the prompt was!

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Alternative approaches:

  • Do nothing. Store long context prompts many many times in the same table.
  • Introduce a new non-attachments mechanism for this, so e.g. the responses table can optionally use a foreign key to a new reused_text table which works a bit like attachments do right now.
  • A mechanism where a response can point to some other response and say I used the same prompt as this one.
  • ALL prompt strings are moved to a new table with a SHA256 ID column, and all responses have foreign keys to that for their prompt_id and system_id.

That last option is actually reasonably elegant. It feels a bit nasty to me to have a whole bunch of very short values that are stored in a separate table, but it may be that most prompts and system prompts are long enough that this might be worthwhile anyway.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Ideally whatever mechanism I use here should work for both system and regular prompts.

If using existing attachments we could invent artificial content types of llm/prompt and llm/system for this.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Question: in my current database how many responses have duplicates in prompts or system prompts?

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Simple dupe detection query:

sqlite-utils "$(llm logs path)" "
select substr(prompt, 0, 20), length(prompt), count(*), length(prompt) * count(*) as size
from responses
group by prompt
having count(*) > 1 order by size desc" -t
substr(prompt, 0, 20)      length(prompt)    count(*)    size
-----------------------  ----------------  ----------  ------
<documents>                        332310           2  664620
<docume
src/util.rs                        288313           2  576626
---
use
<documents>                        112753           4  451012
<docume
jasondavies:                        95647           4  382588
shnkr
882542F3884314B:                   152700           2  305400
c
null                               147256           2  294512
<p>Something t
diff --git a/.githu                111221           2  222442
null                               109634           2  219268
<p>Something t
zerojames:                         102411           2  204822
jayniz:
Start Timestamp;End                 92772           2  185544
#!/usr/bin/env pyth                 54755           3  164265
## v2.7.3                           53962           3  161886

*2023-08

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

I think I need to prototype the approach where every prompt and system string is de-duped into a separate foreign key table, then run it against a copy of my database and compare the sizes.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Prototype:

cp "$(llm logs path)" /tmp/logs.db                                                                       
sqlite-utils extract logs.db responses system --table prompt --rename system prompt --fk-column system_id
sqlite-utils extract logs.db responses prompt                                                            

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

The resulting DB was even bigger for some reason. A lot of the prompts/system prompts are duplicated in the prompt_json already.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Oh here's why it's bigger, it put an index on the new table:

CREATE UNIQUE INDEX [idx_prompt_prompt]
    ON [prompt] ([prompt]);

I really need to do the thing where there's a hash column which has an index on it and is used for lookups, but the foreign key remains an integer so we don't bloat the responses table.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Dropping that index and running vacuum dropped the DB size from 170MB to 144MB, but the original is 136MB to we still gained 8MB somehow. Probably just because we needed a bunch of extra pages for the new table.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Given these problems with the prompt_id / system_id approach I'm inclined to consider that original -a long-file.txt attachment idea instead.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Or I could have optional prompt_id and system_id columns which foreign key against either attachments or some new prompts table - but continue to use the existing prompt and system columns for short prompts.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

If I insert the prompts as utf-8 binary data into that content column they look like this in Datasette:

CleanShot 2024-11-06 at 14 01 40@2x

Inserting them as text DOES work, but it feels really gnarly to have a BLOB column that sometimes has text in it, plus in the code Attachment.content is typed as a bytes:

llm/llm/models.py

Lines 20 to 26 in febbc04

@dataclass
class Attachment:
type: Optional[str] = None
path: Optional[str] = None
url: Optional[str] = None
content: Optional[bytes] = None
_id: Optional[str] = None

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

Cleanest option here would be to have a content_text column which is a string and use that instead of content for non-binary attachment content.

Google Gemini models actually DO accept plain text attachments already: https://ai.google.dev/gemini-api/docs/document-processing?lang=rest

Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 document pages. Document pages must be in one of the following text data MIME types:

  • PDF - application/pdf
  • JavaScript - application/x-javascript, text/javascript
  • Python - application/x-python, text/x-python
  • TXT - text/plain
  • HTML - text/html
  • CSS - text/css
  • Markdown - text/md
  • CSV - text/csv
  • XML - text/xml
  • RTF - text/rtf

Each document page is equivalent to 258 tokens.

@simonw
Copy link
Owner Author

simonw commented Nov 6, 2024

So maybe this ticket is actually about supporting plain text attachments and having them work across ALL models, in addition to whatever is going on here with Gemini.

I wonder if any other models out there support attachments like text/md and text/xml and suchlike?

@simonw
Copy link
Owner Author

simonw commented Nov 7, 2024

I started playing with a llm-docs idea that could benefit from this: https://github.com/simonw/llm-docs

I currently run that like this:

curl -s "https://raw.githubusercontent.com/simonw/llm-docs/refs/heads/main/version-docs/$(llm --version | cut -d' ' -f3).txt" | \
  llm -m gpt-4o-mini 'how do I embed a binary file?'

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

Here's a much simpler idea. I introduce two new llm prompt arguments:

  • llm -f prompt.txt - runs the prompt from that file (alias --file)
  • llm -u https://... - runs the prompt fetched from that URL (alias --url)

And maybe two more that do the same for system prompts:

  • llm --sf prompt.txt - alias --system-file
  • llm --su https://... - alias --system-url

In addition, I treat cat prompt.txt | llm as working the same as llm -f -

In all cases the text is anticipated to be longer than a regular prompt, and hence should be de-duplicated and stored in a separate table.

That table might be called prompts or inputs or maybe even contexts - it will then have two nullable foreign keys from responses for prompt_id and system_id.

This avoids the confusion that would come from reusing the existing attachments mechanism, and the UI for it feels pretty natural to me - especially since -f and -u are not yet used for anything else.

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

Current database tables are:

  • conversations
  • responses
  • attachments
  • prompt_attachments

I'm going to call the new table contexts - I think prompts is confusing as those are already stored as part of responses and I like that it reflects the noun context which is key to thinking about how to use LLMs.

@simonw simonw changed the title Idea: text attachments that are pasted directly into the prompt Reusable contexts that get persisted in a de-duplicated manner Nov 8, 2024
@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

How should these affect the existing templates feature?

One option: templates work exactly as they do right now, so when you do llm -f prompt.txt --save foo a foo.yaml template is created with the contents of that file.

But... this actually brings up the point that templates right now are pretty inefficient. Using a template 100 times will drop 100 copies of that template into the responses table, one for each response!

Fixing this is probably a separate issue. I could start having templates automatically use this mechanism - being treated effectively as piped in / -f prompts and system prompts. But... that won't work for templates that include replacement params.

I could make it so replacement params work against these -u and -f inputs, and store which params were used for a given prompt. I'd need to add a params optional JSON column to responses for this.

I can do that as a separate piece of work.

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

If I'm doing this - supporting longer prompts using --file and --url - I could implement it such that IF one of those things is provided then the prompt is automatically stashed in the contexts table.

But as I started prototyping that out I realized that an alternative rule could simple be that any prompts over the length of X characters are automatically stored in that table instead.

This also isn't a question I can get wrong. If I decide to make all prompts longer than 50 characters stored in contexts instead, and then later decide that was a bad idea, I can change the policy without breaking anything - the code I've written that knows how to read from either responses.prompt or the thing pointed to by responses.prompt_id will continue to work regardless.

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

... in which case the --file and --url options are a distraction. That feature exists already in the form of curl ... | llm and llm < prompt.txt - if the only reason I was adding those is as a signal that longer prompts should be stored independently I don't need the new features at all if I go with a "prompts longer than X" design instead.

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

Here's as far as I got prototyping the --file and --url options:

diff --git a/llm/cli.py b/llm/cli.py
index 6a6fb2c..0f5f112 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -147,6 +147,16 @@ def cli():
 @click.argument("prompt", required=False)
 @click.option("-s", "--system", help="System prompt to use")
 @click.option("model_id", "-m", "--model", help="Model to use")
+@click.option("file", "-f", "--file", type=click.File(), help="Read prompt from file")
+@click.option("url", "-u", "--url", help="Read prompt from URL")
+@click.option(
+    "system_file",
+    "--sf",
+    "--system-file",
+    type=click.File(),
+    help="Read system prompt from file",
+)
+@click.option("system_url", "--su", "--system-url", help="Read system prompt from URL")
 @click.option(
     "attachments",
     "-a",
@@ -203,6 +213,10 @@ def prompt(
     prompt,
     system,
     model_id,
+    file,
+    url,
+    system_file,
+    system_url,
     attachments,
     attachment_types,
     options,
@@ -245,6 +259,16 @@ def prompt(
     def read_prompt():
         nonlocal prompt
 
+        prompt_bits = []
+
+        # Is there a file to read from?
+        if file:
+            prompt_bits.append(file.read())
+        if url:
+            response = httpx.get(url)
+            response.raise_for_status()
+            prompt_bits.append(response.text)
+
         # Is there extra prompt available on stdin?
         stdin_prompt = None
         if not sys.stdin.isatty():
@@ -258,10 +282,12 @@ def prompt(
 
         if (
             prompt is None
-            and not save
+            and file is None
+            and url is None
             and sys.stdin.isatty()
             and not attachments
             and not attachment_types
+            and not save
         ):
             # Hang waiting for input to stdin (unless --save)
             prompt = sys.stdin.read()

@simonw simonw changed the title Reusable contexts that get persisted in a de-duplicated manner Store prompts and system prompts longer than X characters in de-duped table Nov 8, 2024
@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

I think I need to change this code that populates the responses table:

llm/llm/models.py

Lines 221 to 238 in 5d1d723

response = {
"id": response_id,
"model": self.model.model_id,
"prompt": self.prompt.prompt,
"system": self.prompt.system,
"prompt_json": self._prompt_json,
"options_json": {
key: value
for key, value in dict(self.prompt.options).items()
if value is not None
},
"response": self.text(),
"response_json": self.json(),
"conversation_id": conversation.id,
"duration_ms": self.duration_ms(),
"datetime_utc": self.datetime_utc(),
}
db["responses"].insert(response)

And code that reads from that table in two places:

llm/llm/cli.py

Lines 685 to 706 in 5d1d723

LOGS_COLUMNS = """ responses.id,
responses.model,
responses.prompt,
responses.system,
responses.prompt_json,
responses.options_json,
responses.response,
responses.response_json,
responses.conversation_id,
responses.duration_ms,
responses.datetime_utc,
conversations.name as conversation_name,
conversations.model as conversation_model"""
LOGS_SQL = """
select
{columns}
from
responses
left join conversations on responses.conversation_id = conversations.id{extra_where}
order by responses.id desc{limit}
"""

llm/llm/models.py

Lines 284 to 319 in 5d1d723

@classmethod
def from_row(cls, db, row):
from llm import get_model
model = get_model(row["model"])
response = cls(
model=model,
prompt=Prompt(
prompt=row["prompt"],
model=model,
attachments=[],
system=row["system"],
options=model.Options(**json.loads(row["options_json"])),
),
stream=False,
)
response.id = row["id"]
response._prompt_json = json.loads(row["prompt_json"] or "null")
response.response_json = json.loads(row["response_json"] or "null")
response._done = True
response._chunks = [row["response"]]
# Attachments
response.attachments = [
Attachment.from_row(arow)
for arow in db.query(
"""
select attachments.* from attachments
join prompt_attachments on attachments.id = prompt_attachments.attachment_id
where prompt_attachments.response_id = ?
order by prompt_attachments."order"
""",
[row["id"]],
)
]
return response

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

I could even define a new responses_full SQL view that inflates the foreign key system and regular prompts automatically, for developer convenience. It could extract out the attachments as JSON too.

@simonw
Copy link
Owner Author

simonw commented Nov 8, 2024

But what do I do about the prompt_json columns storing huge copies of the strings anyway?

I worked up this little system with the help of ChatGPT Code Interpreter: https://chatgpt.com/share/672d84c0-7360-8006-8333-f7e743237942

def apply_replacements(obj, replacements):
    if isinstance(obj, dict):
        return {k: apply_replacements(v, replacements) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [apply_replacements(item, replacements) for item in obj]
    elif isinstance(obj, str):
        replaced_parts = []
        last_index = 0
        found = False

        for key, value in replacements.items():
            index = obj.find(key)
            while index != -1:
                found = True
                if index > last_index:
                    replaced_parts.append(obj[last_index:index])
                replaced_parts.append(value)
                last_index = index + len(key)
                index = obj.find(key, last_index)

        if found:
            if last_index < len(obj):
                replaced_parts.append(obj[last_index:])
            return {"$r": replaced_parts}
        else:
            return obj
    else:
        return obj


def reverse_replacements(obj, replacements):
    return _reverse_replacements(obj, {v: k for k, v in replacements.items()})


def _reverse_replacements(obj, replacements):
    if isinstance(obj, dict):
        if "$r" in obj:
            # Reconstruct the original string from the list
            return "".join(
                (replacements[part] if isinstance(part, int) else part)
                for part in obj["$r"]
            )
        else:
            return {k: _reverse_replacements(v, replacements) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [_reverse_replacements(item, replacements) for item in obj]
    else:
        return obj

So now you can do this:

from llm.utils import apply_replacements, reverse_replacements
replacements = {"This is a pretty long string at this point": 1, "And so is this one": 2}
json_object = {
    "foo": {
        "bar": {
            "baz": "this includes This is a pretty long string at this point",
            "qux": [44, "And so is this one"],
            "quux": "This has This is a pretty long string at this point and And so is this one as well"
        }
    }
}

replaced = apply_replacements(json_object, replacements)
orig = reverse_replacements(replaced, replacements)

And the replaced version looks like this:

{'foo': {'bar': {'baz': {'$r': ['this includes ', 1]},
                 'quux': {'$r': ['This has ', 1, ' and ', 2, ' as well']},
                 'qux': [44, {'$r': [2]}]}}}

@simonw
Copy link
Owner Author

simonw commented Nov 9, 2024

The last remaining challenge is search. The system currently has a responses_fts table that indexes the prompt and response columns from responses:

llm/docs/logging.md

Lines 164 to 168 in 5d1d723

CREATE VIRTUAL TABLE [responses_fts] USING FTS5 (
[prompt],
[response],
content=[responses]
);

That design doesn't work if some of the prompts are in responses.prompt and others are in the new contexts table and related by a foreign key instead.

I could probably denormalize this, but that seems like it could be wasteful since the whole point of this exercise is to avoid storing the same long prompts hundreds of times... but if we store hundreds of copies in the FTS index table have we really improved things?

So I likely need a separate contexts_fts table... but now I'm stuck trying to do relevance scoring across two different FTS tables, which doesn't really work.

I think I'll have to accept a reduction in relevance scoring because of this. That's OK - the search feature isn't really a signature big deal, it's more of a convenience. I don't think it matters too much if the relevance scoring is a bit out of whack.

@simonw
Copy link
Owner Author

simonw commented Nov 9, 2024

Note that response is always stored separately in the responses table - it's only prompt (of the two columns in the responses_fts table) that is affected by this new table design.

@simonw
Copy link
Owner Author

simonw commented Nov 9, 2024

I think the migrations I added broke response_fts entirely, maybe by rewriting the rowid values when I ran .transform() to move some table columns around such that I need a new migration to re-index the table entirely.

simonw added a commit that referenced this issue Nov 13, 2024
@simonw
Copy link
Owner Author

simonw commented Nov 13, 2024

Pushed my WIP to a branch: b2fce50

@simonw
Copy link
Owner Author

simonw commented Nov 13, 2024

I'm second-guessing the design for this feature again now.

As implemented this kicks in by magic if the prompt is longer than a certain threshold:

llm/llm/models.py

Lines 230 to 236 in b2fce50

for context, column in (
(prompt, "prompt"),
(system, "system"),
):
if context is not None and len(context) > PROMPT_THRESHOLD:
hash = hashlib.sha256(context.encode("utf-8")).hexdigest()
rows = list(db.query("select id from contexts where hash = ?", [hash]))

This assumes that prompts and system prompts will be entirely duplicated occasionally.

But maybe that's not the smartest way to do this. What if instead these reusable contexts could be specified by the user directly and could even be concatenated together?

Imagine being able to do something like this:

llm -f readme.md -f docs/usage.md 'How do I install this?'

Here we are concatenating two files together. Those files could be stored in two records in contexts, which would allow them to be combined in different ways in the future while still avoiding duplicating storage for them.

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2024

Relevant: https://twitter.com/alexalbert__/status/1857457290917589509

You can now access all of our docs concatenated as a single plain text file that can be fed in to any LLM.

Here's the url route: http://docs.anthropic.com/llms-full.txt

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2024

... if I DO go the -f route I think it should accept both file paths and URLs, because that's what the -a option does already:

llm -m claude-3.5-sonnet -f http://docs.anthropic.com/llms-full.txt 'how do I request a stream with curl?'

Note that http://docs.anthropic.com/llms-full.txt is a redirect to the https version, so following redirects is important - I just tried this:

curl 'http://docs.anthropic.com/llms-full.txt' | llm -m claude-3-5-haiku-latest 'a haiku about this document'

And got:

Here's a haiku about this HTTP 301 Moved Permanently response:

Cloudflare whispers soft
Redirected web page falls
Elsewhere, silently

The haiku captures the essence of the 301 status code - a gentle, permanent redirection signaled by Cloudflare, with the page moving to a new location in a quiet, understated manner.

@simonw simonw changed the title Store prompts and system prompts longer than X characters in de-duped table Option for passing in longer context fragments, stored in a deduped table Nov 16, 2024
@simonw
Copy link
Owner Author

simonw commented Nov 16, 2024

I am going to call these fragments. The CLI option will be -f/--fragment and will accept a file Path or a URL or a -.

I am going to expose these in the Python API as well - as a prompt.fragments list of strings.

Open question: how about letting fragments be specified by ID as well? That way users could store them deliberately in the database.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2024

There is definitely something interesting about being able to run a prompt that references multiple fragments. Imagine something like this:

llm -f datasette -f llm "ideas for Datasette plugins powered by LLM"

Where datasette is a fragment ID for a copy of the Datasette docs and llm is a fragment ID for the LLM docs.

How to tell the difference between -f datasette meaning the Datasette stored fragment and meaning a file on disk called datasette? Fragment ID wins, if you want the file instead use -f ./datasette instead to clarify.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2024

The only reason to expose fragments in the Python prompt.fragments = [str] API is the code for persisting a prompt and response to a database knows to treat them specially.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2024

This design implies the need for a family of CLI commands for managing fragments and their aliases. Maybe:

# list (truncated) stored fragments
llm fragments list
# show one (takes alias, ID or hash)
llm fragments show ALIAS
# search them
llm fragments list -q LLM
# add one with an alias
llm fragments set llm ./llm.txt

Can you remove fragments? Doing so could break existing DB records that rely on them - but maybe there's private information in them that you didn't mean to save?

I thank you can remove them, and doing so will cause those affected records to show that the fragments have been removed while still retuning the logged response.

# Remove stored fragment
llm fragments remove llm

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2024

So the fragments table has:

  • id - numeric ID used in M2M tables
  • hash - string hash used for deduping
  • content string content
  • alias - null allowed unique indexed alias string

I might have a created date time on there too. And maybe even a source URL or filename.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2024

One thing this new design does not take into account is system prompts.

Maybe --sf X/--system-fragment X?

Would also need a prompt.system_fragments list of strings - and that's a bit weird because of the thing where some models don't let you change system prompt half way through a conversation (a problem we have with prompt.system already so at least it's not a new problem).

@simonw
Copy link
Owner Author

simonw commented Nov 17, 2024

Made a start on this here - I've implemented the schema and the code such that prompts are aware of it, but I'm not yet persisting the fragments to the new database tables.

llm/llm/models.py

Lines 90 to 133 in 6c355c1

@dataclass
class Prompt:
_prompt: str
model: "Model"
fragments: Optional[List[str]]
attachments: Optional[List[Attachment]]
_system: Optional[str]
system_fragments: Optional[List[str]]
prompt_json: Optional[str]
options: "Options"
def __init__(
self,
prompt,
model,
*,
fragments=None,
attachments=None,
system=None,
system_fragments=None,
prompt_json=None,
options=None,
):
self._prompt = prompt
self.model = model
self.attachments = list(attachments or [])
self.fragments = fragments or []
self._system = system
self.system_fragments = system_fragments or []
self.prompt_json = prompt_json
self.options = options or {}
@property
def prompt(self):
return "\n".join(self.fragments + [self._prompt])
@property
def system(self):
bits = [
bit.strip()
for bit in (self.system_fragments + [self._system or ""])
if bit.strip()
]
return "\n\n".join(bits)

@simonw
Copy link
Owner Author

simonw commented Nov 18, 2024

A slight oddity with this design for aliases: if you set the alias of a piece of content to one thing, then set a different alias for the same byte-for-byte piece of content from elsewhere, the original alias will be forgotten.

That feels confusing, and doesn't match with how aliases work for models where one model can have multiple aliases.

Potential solutions:

  1. Do nothing, accept this - it's a rare edge case
  2. Warn the user if they attempt to use llm fragments set ... on content that has an alias according to its hash already
  3. Change the DB schema - instead of alias being a nullable unique column on the fragments table it should instead be a separate fragment_aliases table which can assign multiple aliases to a single hash

That 3rd option is the least surprising in my opinion. It also removes the slightly weird unique-but-nullable alias column from the fragments table.

@simonw
Copy link
Owner Author

simonw commented Nov 18, 2024

db["fragment_aliases"].create(
    {
        "alias": str,
        "fragment_id": int,
    },
    foreign_keys=("fragment_id", "fragments", "id"),
    pk=("alias",),
)

The primary key IS the alias, since you can't have multiple aliases with the same name.

@simonw
Copy link
Owner Author

simonw commented Nov 18, 2024

I think removing an alias should use garbage collection rules: if the aliased fragment is referenced by at least one other alias OR is referenced in the prompt_fragments or system_fragments table it is left alone, if no references to it remain then it is deleted.

@simonw
Copy link
Owner Author

simonw commented Nov 18, 2024

I'm not going to bother implementing garbage collection for the moment.

@simonw
Copy link
Owner Author

simonw commented Nov 18, 2024

This is now mostly working. The prompt_json column is a problem though - it's storing a duplicate of the whole prompt when the point of this exercise is to cut down on how much duplicated junk we save in that table.

CleanShot 2024-11-17 at 21 17 26@2x

@simonw simonw mentioned this issue Nov 18, 2024
10 tasks
simonw added a commit that referenced this issue Nov 18, 2024
@taygetea
Copy link

i see you just touched this but for what its worth a few days ago i tried to add a text file as an attachment assuming it would work and was surprised it didnt. was the mental model i had. so this is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants