feat: add vector migration script to re-ingest chunks with new models#57
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an operational script to re-embed and re-upload all chunk_index rows from the SQL database into the currently configured vector store, intended to support embedding-model and/or vector-backend migrations.
Changes:
- Introduces
scripts/migrate_vector_db.pyto load the configured embedding model, iterate overchunk_indexin batches, and callvector_repo.create_documents(...). - Adds readiness checks and basic progress logging around batch uploads.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tokens_count=cast(int, chunk_sql.tokens_count), | ||
| language=cast(str, chunk_sql.language), | ||
| embedding_model=embedding_model_name, | ||
| created_at=cast(datetime, chunk_sql.created_at or datetime.now()), |
There was a problem hiding this comment.
ChunkIndexModel.created_at is timezone-aware and non-nullable, but the fallback datetime.now() is naive and can produce inconsistent timestamps in vector metadata. Prefer using chunk_sql.created_at directly (or a UTC-aware fallback such as datetime.now(timezone.utc)), and avoid emitting naive datetimes.
| created_at=cast(datetime, chunk_sql.created_at or datetime.now()), | |
| created_at=cast(datetime, chunk_sql.created_at), |
| extra_data = ( | ||
| cast(Dict[str, Any], chunk_sql.extra) | ||
| if isinstance(chunk_sql.extra, dict) | ||
| else {} | ||
| ) | ||
| if chunk_sql.vector_store_type: | ||
| extra_data["original_vector_store_type"] = ( | ||
| chunk_sql.vector_store_type | ||
| ) |
There was a problem hiding this comment.
extra_data reuses the dict object from chunk_sql.extra and then mutates it (adds original_vector_store_type). This can mark the ORM row as dirty and can also cause surprising side effects within the session. Make a shallow copy (e.g., dict(chunk_sql.extra) / {**chunk_sql.extra}) before mutating.
| total_chunks = db.query(ChunkIndexModel).count() | ||
| logger.info(f"Total chunks to migrate: {total_chunks}") | ||
|
|
||
| offset = 0 | ||
| while offset < total_chunks: | ||
| chunk_models_sql = ( | ||
| db.query(ChunkIndexModel) | ||
| .order_by(ChunkIndexModel.created_at) | ||
| .offset(offset) | ||
| .limit(batch_size) | ||
| .all() | ||
| ) |
There was a problem hiding this comment.
The count() + OFFSET/LIMIT pagination pattern becomes very slow on large tables, and SQLAlchemy will also keep each loaded ORM object in the session identity map (risking unbounded memory growth during long migrations). Consider keyset pagination (e.g., by (created_at, id)) or iterating with yield_per/streaming results, and expunging/clearing the session per batch.
| # create_documents will internally call the EmbeddingService for the texts and save them | ||
| vector_repo.create_documents(documents) | ||
|
|
||
| offset += batch_size |
There was a problem hiding this comment.
This script reuses the existing chunk_sql.id as the vector document ID. For backends that don't upsert on insert (e.g., Chroma uses add_texts in create_documents), rerunning this migration against a non-empty collection can fail with duplicate-ID errors or create duplicates. Consider either deleting/clearing the target collection up front (or documenting that scripts/clear_vector_db.py must be run first), or adding an explicit upsert/replace behavior for migration mode.
There was a problem hiding this comment.
Applied in commit 81f10f5. Added a --clear CLI flag that calls clear_vector_db() before the migration begins, preventing duplicate-ID errors on backends that don't upsert on insert (e.g. ChromaDB). Run with python scripts/migrate_vector_db.py --clear for a clean migration, or without the flag to preserve existing data.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
==========================================
- Coverage 80.83% 80.41% -0.42%
==========================================
Files 86 86
Lines 6675 6721 +46
Branches 767 773 +6
==========================================
+ Hits 5396 5405 +9
- Misses 1036 1071 +35
- Partials 243 245 +2 ☔ View full report in Codecov by Sentry. |
Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/dba9f09e-11b2-4cbe-90cf-5d4e90521cc6 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>
…y shallow copy Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/dba9f09e-11b2-4cbe-90cf-5d4e90521cc6 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>
…cate-ID risk Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/396faf44-8d73-42c9-92f6-f11c05b58f01 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>
This pull request introduces a new migration script to facilitate re-ingesting all chunk data from the SQL database into the configured vector database. The script is particularly useful when migrating to a new embedding model or switching vector database backends. It reads all chunk records, re-embeds them using the current embedding model, and uploads them in batches to the vector store.
Key additions:
Vector database migration utility:
scripts/migrate_vector_db.pythat reads allchunk_indexrecords from the SQL database, generates embeddings using the current model, and uploads them to the configured vector database in batches. This helps ensure all chunks are stored using the latest embedding model and vector store setup.--clearflag that clears the target vector collection before migrating, preventing duplicate-ID errors on backends that do not upsert on insert (e.g. ChromaDB). Equivalent to runningscripts/clear_vector_db.pyfirst.Performance and correctness:
(created_at, id)instead ofOFFSET/LIMIT, keeping queries fast on large tables and avoiding unbounded memory growth. The SQLAlchemy session is expunged after each batch.chunk_sql.extrais shallow-copied before mutation to avoid marking ORM rows as dirty.created_atis taken directly from the SQL record (timezone-aware, non-nullable) with no naive-datetime fallback.Robustness and logging: