Skip to content

feat: add vector migration script to re-ingest chunks with new models#57

Merged
ericksonlopes merged 9 commits into
mainfrom
feature/vector-migration-script
Apr 7, 2026
Merged

feat: add vector migration script to re-ingest chunks with new models#57
ericksonlopes merged 9 commits into
mainfrom
feature/vector-migration-script

Conversation

@ericksonlopes

@ericksonlopes ericksonlopes commented Apr 7, 2026

Copy link
Copy Markdown
Owner

This pull request introduces a new migration script to facilitate re-ingesting all chunk data from the SQL database into the configured vector database. The script is particularly useful when migrating to a new embedding model or switching vector database backends. It reads all chunk records, re-embeds them using the current embedding model, and uploads them in batches to the vector store.

Key additions:

Vector database migration utility:

  • Added a new script scripts/migrate_vector_db.py that reads all chunk_index records from the SQL database, generates embeddings using the current model, and uploads them to the configured vector database in batches. This helps ensure all chunks are stored using the latest embedding model and vector store setup.
  • Added a --clear flag that clears the target vector collection before migrating, preventing duplicate-ID errors on backends that do not upsert on insert (e.g. ChromaDB). Equivalent to running scripts/clear_vector_db.py first.

Performance and correctness:

  • Pagination uses keyset pagination on (created_at, id) instead of OFFSET/LIMIT, keeping queries fast on large tables and avoiding unbounded memory growth. The SQLAlchemy session is expunged after each batch.
  • chunk_sql.extra is shallow-copied before mutation to avoid marking ORM rows as dirty.
  • created_at is taken directly from the SQL record (timezone-aware, non-nullable) with no naive-datetime fallback.

Robustness and logging:

  • The script includes logging at each major step, error handling for migration failures, and checks to ensure the vector repository is ready before proceeding.
  • On failure, the script exits with a non-zero status code and logs the full traceback to aid debugging in automated runs.

@ericksonlopes ericksonlopes self-assigned this Apr 7, 2026
Copilot AI review requested due to automatic review settings April 7, 2026 14:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operational script to re-embed and re-upload all chunk_index rows from the SQL database into the currently configured vector store, intended to support embedding-model and/or vector-backend migrations.

Changes:

  • Introduces scripts/migrate_vector_db.py to load the configured embedding model, iterate over chunk_index in batches, and call vector_repo.create_documents(...).
  • Adds readiness checks and basic progress logging around batch uploads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/migrate_vector_db.py Outdated
tokens_count=cast(int, chunk_sql.tokens_count),
language=cast(str, chunk_sql.language),
embedding_model=embedding_model_name,
created_at=cast(datetime, chunk_sql.created_at or datetime.now()),

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkIndexModel.created_at is timezone-aware and non-nullable, but the fallback datetime.now() is naive and can produce inconsistent timestamps in vector metadata. Prefer using chunk_sql.created_at directly (or a UTC-aware fallback such as datetime.now(timezone.utc)), and avoid emitting naive datetimes.

Suggested change
created_at=cast(datetime, chunk_sql.created_at or datetime.now()),
created_at=cast(datetime, chunk_sql.created_at),

Copilot uses AI. Check for mistakes.
Comment thread scripts/migrate_vector_db.py Outdated
Comment on lines +65 to +73
extra_data = (
cast(Dict[str, Any], chunk_sql.extra)
if isinstance(chunk_sql.extra, dict)
else {}
)
if chunk_sql.vector_store_type:
extra_data["original_vector_store_type"] = (
chunk_sql.vector_store_type
)

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_data reuses the dict object from chunk_sql.extra and then mutates it (adds original_vector_store_type). This can mark the ORM row as dirty and can also cause surprising side effects within the session. Make a shallow copy (e.g., dict(chunk_sql.extra) / {**chunk_sql.extra}) before mutating.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment thread scripts/migrate_vector_db.py Outdated
Comment on lines +47 to +58
total_chunks = db.query(ChunkIndexModel).count()
logger.info(f"Total chunks to migrate: {total_chunks}")

offset = 0
while offset < total_chunks:
chunk_models_sql = (
db.query(ChunkIndexModel)
.order_by(ChunkIndexModel.created_at)
.offset(offset)
.limit(batch_size)
.all()
)

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count() + OFFSET/LIMIT pagination pattern becomes very slow on large tables, and SQLAlchemy will also keep each loaded ORM object in the session identity map (risking unbounded memory growth during long migrations). Consider keyset pagination (e.g., by (created_at, id)) or iterating with yield_per/streaming results, and expunging/clearing the session per batch.

Copilot uses AI. Check for mistakes.
Comment thread scripts/migrate_vector_db.py Outdated
Comment on lines +97 to +100
# create_documents will internally call the EmbeddingService for the texts and save them
vector_repo.create_documents(documents)

offset += batch_size

Copilot AI Apr 7, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script reuses the existing chunk_sql.id as the vector document ID. For backends that don't upsert on insert (e.g., Chroma uses add_texts in create_documents), rerunning this migration against a non-empty collection can fail with duplicate-ID errors or create duplicates. Consider either deleting/clearing the target collection up front (or documenting that scripts/clear_vector_db.py must be run first), or adding an explicit upsert/replace behavior for migration mode.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commit 81f10f5. Added a --clear CLI flag that calls clear_vector_db() before the migration begins, preventing duplicate-ID errors on backends that don't upsert on insert (e.g. ChromaDB). Run with python scripts/migrate_vector_db.py --clear for a clean migration, or without the flag to preserve existing data.

Comment thread scripts/migrate_vector_db.py
@codecov

codecov Bot commented Apr 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.92982% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.41%. Comparing base (f0ddc5b) to head (6d4a77c).
⚠️ Report is 15 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/application/workers.py 41.17% 20 Missing ⚠️
src/application/use_cases/manage_voice_profiles.py 23.52% 13 Missing ⚠️
...tion/api/routes/voice_profile_management_router.py 78.94% 3 Missing and 1 partial ⚠️
...cation/use_cases/diarization_ingestion_use_case.py 57.14% 3 Missing ⚠️
src/presentation/api/dependencies.py 50.00% 3 Missing ⚠️
src/domain/exception/youtube_exceptions.py 0.00% 2 Missing ⚠️
...on/use_cases/process_audio_diarization_pipeline.py 85.71% 1 Missing ⚠️
...c/infrastructure/services/voice_profile_service.py 50.00% 1 Missing ⚠️
src/presentation/api/routes/ingest_router.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
- Coverage   80.83%   80.41%   -0.42%     
==========================================
  Files          86       86              
  Lines        6675     6721      +46     
  Branches      767      773       +6     
==========================================
+ Hits         5396     5405       +9     
- Misses       1036     1071      +35     
- Partials      243      245       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot AI and others added 2 commits April 7, 2026 14:44
…cate-ID risk

Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/396faf44-8d73-42c9-92f6-f11c05b58f01

Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>
@ericksonlopes ericksonlopes added sql SQL queries, stored procedures, and database query optimization issues vector database Vector database operations, embeddings, and semantic search implementation issues backend Backend services, API development, and server-side logic related issues labels Apr 7, 2026
@ericksonlopes ericksonlopes merged commit c0af155 into main Apr 7, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Backend services, API development, and server-side logic related issues sql SQL queries, stored procedures, and database query optimization issues vector database Vector database operations, embeddings, and semantic search implementation issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants