Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dumping and reloading logs.db with sqlite-diffable FTS inconsistencies #680

Open
giuli007 opened this issue Dec 18, 2024 · 0 comments
Open

Comments

@giuli007
Copy link

giuli007 commented Dec 18, 2024

https://github.com/simonw/sqlite-diffable can be used to dump the content of a sqlite db as text files and reloading them, this is useful to keep track of data/schemas using git.

I tried to use it for dumping and re-loading the llm logs db and faced some difficulties. Not sure if these are actual problems that need fixing so I am opening this to ask the questions and provide some context.

$ mkdir dump
$ cp "$(llm logs path)" .
$ sqlite-utils tables logs.db --columns --table --counts  # show contents of original db
table                    count  columns
---------------------  -------  -------------------------------------------------------------------------------------------------------------------------------------------------
_llm_migrations             14  ['name', 'applied_at']
conversations               12  ['id', 'name', 'model']
responses                   15  ['id', 'model', 'prompt', 'system', 'prompt_json', 'options_json', 'response', 'response_json', 'conversation_id', 'duration_ms', 'datetime_utc']
responses_fts               15  ['prompt', 'response']
responses_fts_data          21  ['id', 'block']
responses_fts_idx           19  ['segid', 'term', 'pgno']
responses_fts_docsize       15  ['id', 'sz']
responses_fts_config         1  ['k', 'v']
attachments                  0  ['id', 'type', 'path', 'url', 'content']
prompt_attachments           0  ['response_id', 'attachment_id', 'order']
$ sqlite-diffable dump logs.db dump/ --all  # dump all tables

when trying to reload it

$ sqlite-diffable load logs_recovered.db dump/
Error: table 'responses_fts_docsize' already exists

Use the --replace option to over-write existing tables
$ sqlite-utils tables logs_recovered.db --columns --table --counts
table                    count  columns
---------------------  -------  -------------------------------------------------------------------------------------------------------------------------------------------------
conversations               12  ['id', 'name', 'model']
responses                   15  ['id', 'model', 'prompt', 'system', 'prompt_json', 'options_json', 'response', 'response_json', 'conversation_id', 'duration_ms', 'datetime_utc']
responses_fts               15  ['prompt', 'response']
responses_fts_data           8  ['id', 'block']
responses_fts_idx            6  ['segid', 'term', 'pgno']
responses_fts_docsize       15  ['id', 'sz']
responses_fts_config         1  ['k', 'v']

AFAIU this happens because the database uses the FTS module for full-text search and this means that, upon creation and population of the responses_fts table, 4 other tables are automatically created and populated (responses_fts_data, responses_fts_idx, responses_fts_docsize, responses_fts_config), this is explained here https://www.sqlite.org/fts5.html#fts5_data_structures

So I tired to using the suggested --replace option and even though it leads to successful completion of the sqlite-diffable load command, it then errors when trying to query the database (redacted content for brevity)

$ sqlite-diffable load logs_recovered.db dump/ --replace
$ sqlite-utils tables logs_recovered.db --columns --table --counts
Traceback (most recent call last):
  File "/Users/giuli077/.local/bin/sqlite-utils", line 8, in <module>
    sys.exit(cli())
             ^^^^^
...
  File "/Users/giuli077/.local/pipx/venvs/sqlite-utils/lib/python3.11/site-packages/sqlite_utils/db.py", line 1315, in count_where
    return self.db.execute(sql, where_args or []).fetchone()[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/giuli077/.local/pipx/venvs/sqlite-utils/lib/python3.11/site-packages/sqlite_utils/db.py", line 533, in execute
    return self.conn.execute(sql, parameters)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.DatabaseError: vtable constructor failed: responses_fts

I haven't dug into this specific error but I assume something goes wrong because the values of the response_fts_* tables that are automatically created when responses_fts is first loaded are replaced by values of the contents of those tables from the dump and that leads to some inconsistency in that table data.

I then decided to exclude from the load the host FTS-specific tables and only load the rest

$ rm dump/responses_fts_*
$ sqlite-diffable load logs_recovered_partial.db dump/
$ sqlite-utils tables logs_recovered_partial.db --columns --table --counts
table                    count  columns
---------------------  -------  -------------------------------------------------------------------------------------------------------------------------------------------------
conversations               12  ['id', 'name', 'model']
responses                   15  ['id', 'model', 'prompt', 'system', 'prompt_json', 'options_json', 'response', 'response_json', 'conversation_id', 'duration_ms', 'datetime_utc']
responses_fts               15  ['prompt', 'response']
responses_fts_data           8  ['id', 'block']
responses_fts_idx            6  ['segid', 'term', 'pgno']
responses_fts_docsize       15  ['id', 'sz']
responses_fts_config         1  ['k', 'v']
prompt_attachments           0  ['response_id', 'attachment_id', 'order']
_llm_migrations             14  ['name', 'applied_at']
attachments                  0  ['id', 'type', 'path', 'url', 'content']

this works but I noticed there is some inconsistency with the number of rows in the responses_fts_data (8 rows) and responses_fts_idx (6 rows) which in the original db had more rows (21 and 19 respectively)

I am not an expert on how these tables work, I had a look at the changes in responses_fts_idx.ndjson (which are vaguely human-readable compared to those in responses_fts_data.ndjson), it seems there are a bunch of removed rows with very similar content "b''" and also some others

diff --git a/dump/responses_fts_idx.ndjson b/dump/responses_fts_idx.ndjson
index f2446b4..285c8e9 100644
--- a/dump/responses_fts_idx.ndjson
+++ b/dump/responses_fts_idx.ndjson
@@ -1,19 +1,6 @@
 [1, "b''", 2]
-[2, "b''", 2]
-[3, "b''", 2]
-[4, "b''", 2]
-[5, "b''", 2]
-[5, "b'0paths'", 4]
-[6, "b''", 2]
-[6, "b'0rou'", 4]
-[7, "b''", 2]
-[7, "b'0returni'", 4]
-[8, "b''", 2]
-[9, "b''", 2]
-[10, "b''", 2]
-[11, "b''", 2]
-[12, "b''", 2]
-[13, "b''", 2]
-[14, "b''", 2]
-[15, "b''", 2]
-[15, "b'0readl'", 4]
+[1, "b'0configured'", 4]
+[1, "b'0fr'", 6]
+[1, "b'0logg'", 8]
+[1, "b'0repr'", 10]
+[1, "b'0timed'", 12]

full-text search still seems to work on the re-loaded logs_recovered_partial.db so I am not sure this is an actual problem.

My question(s): is this behaviour ok? Would the missing data in responses_fts_data and responses_fts_idx cause any potential issue that is not immediately obvious to me? or can I happlily rely on sqlite doing the right thing when it recreates and repopulates the responses_fts_* tables upon re-loading of responses_fts?

Note: I do realise this question might not be closely related to the llm cli tool and it is quite likely just a clarification about FTS and using sqlite-diffable with databases that use it.
Because this is the use-case I encountered I decided to ask in this project anyway :).
It is perhaps an opportunity to consider explicitly pointing people at sqlite-diffable as a way to version-control the conversations they have via llm (maybe to add somewhere in the docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant