feat: add vector migration script to re-ingest chunks with new models by ericksonlopes · Pull Request #57 · ericksonlopes/WhatYouSaid

ericksonlopes · 2026-04-07T14:32:06Z

This pull request introduces a new migration script to facilitate re-ingesting all chunk data from the SQL database into the configured vector database. The script is particularly useful when migrating to a new embedding model or switching vector database backends. It reads all chunk records, re-embeds them using the current embedding model, and uploads them in batches to the vector store.

Key additions:

Vector database migration utility:

Added a new script scripts/migrate_vector_db.py that reads all chunk_index records from the SQL database, generates embeddings using the current model, and uploads them to the configured vector database in batches. This helps ensure all chunks are stored using the latest embedding model and vector store setup.
Added a --clear flag that clears the target vector collection before migrating, preventing duplicate-ID errors on backends that do not upsert on insert (e.g. ChromaDB). Equivalent to running scripts/clear_vector_db.py first.

Performance and correctness:

Pagination uses keyset pagination on (created_at, id) instead of OFFSET/LIMIT, keeping queries fast on large tables and avoiding unbounded memory growth. The SQLAlchemy session is expunged after each batch.
chunk_sql.extra is shallow-copied before mutation to avoid marking ORM rows as dirty.
created_at is taken directly from the SQL record (timezone-aware, non-nullable) with no naive-datetime fallback.

Robustness and logging:

The script includes logging at each major step, error handling for migration failures, and checks to ensure the vector repository is ready before proceeding.
On failure, the script exits with a non-zero status code and logs the full traceback to aid debugging in automated runs.

Copilot

Pull request overview

Adds an operational script to re-embed and re-upload all chunk_index rows from the SQL database into the currently configured vector store, intended to support embedding-model and/or vector-backend migrations.

Changes:

Introduces scripts/migrate_vector_db.py to load the configured embedding model, iterate over chunk_index in batches, and call vector_repo.create_documents(...).
Adds readiness checks and basic progress logging around batch uploads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T14:37:07Z

+                    tokens_count=cast(int, chunk_sql.tokens_count),
+                    language=cast(str, chunk_sql.language),
+                    embedding_model=embedding_model_name,
+                    created_at=cast(datetime, chunk_sql.created_at or datetime.now()),


ChunkIndexModel.created_at is timezone-aware and non-nullable, but the fallback datetime.now() is naive and can produce inconsistent timestamps in vector metadata. Prefer using chunk_sql.created_at directly (or a UTC-aware fallback such as datetime.now(timezone.utc)), and avoid emitting naive datetimes.

Suggested change

created_at=cast(datetime, chunk_sql.created_at or datetime.now()),

created_at=cast(datetime, chunk_sql.created_at),

Copilot · 2026-04-07T14:37:08Z

+                extra_data = (
+                    cast(Dict[str, Any], chunk_sql.extra)
+                    if isinstance(chunk_sql.extra, dict)
+                    else {}
+                )
+                if chunk_sql.vector_store_type:
+                    extra_data["original_vector_store_type"] = (
+                        chunk_sql.vector_store_type
+                    )


extra_data reuses the dict object from chunk_sql.extra and then mutates it (adds original_vector_store_type). This can mark the ORM row as dirty and can also cause surprising side effects within the session. Make a shallow copy (e.g., dict(chunk_sql.extra) / {**chunk_sql.extra}) before mutating.

@copilot apply changes based on this feedback

Copilot · 2026-04-07T14:37:08Z

+        total_chunks = db.query(ChunkIndexModel).count()
+        logger.info(f"Total chunks to migrate: {total_chunks}")
+
+        offset = 0
+        while offset < total_chunks:
+            chunk_models_sql = (
+                db.query(ChunkIndexModel)
+                .order_by(ChunkIndexModel.created_at)
+                .offset(offset)
+                .limit(batch_size)
+                .all()
+            )


The count() + OFFSET/LIMIT pagination pattern becomes very slow on large tables, and SQLAlchemy will also keep each loaded ORM object in the session identity map (risking unbounded memory growth during long migrations). Consider keyset pagination (e.g., by (created_at, id)) or iterating with yield_per/streaming results, and expunging/clearing the session per batch.

Copilot · 2026-04-07T14:37:08Z

+            # create_documents will internally call the EmbeddingService for the texts and save them
+            vector_repo.create_documents(documents)
+
+            offset += batch_size


This script reuses the existing chunk_sql.id as the vector document ID. For backends that don't upsert on insert (e.g., Chroma uses add_texts in create_documents), rerunning this migration against a non-empty collection can fail with duplicate-ID errors or create duplicates. Consider either deleting/clearing the target collection up front (or documenting that scripts/clear_vector_db.py must be run first), or adding an explicit upsert/replace behavior for migration mode.

@copilot apply changes based on this feedback

Applied in commit 81f10f5. Added a --clear CLI flag that calls clear_vector_db() before the migration begins, preventing duplicate-ID errors on backends that don't upsert on insert (e.g. ChromaDB). Run with python scripts/migrate_vector_db.py --clear for a clean migration, or without the flag to preserve existing data.

codecov · 2026-04-07T14:38:16Z

Codecov Report

❌ Patch coverage is 71.92982% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.41%. Comparing base (f0ddc5b) to head (6d4a77c).
⚠️ Report is 15 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/application/workers.py	41.17%	20 Missing ⚠️
src/application/use_cases/manage_voice_profiles.py	23.52%	13 Missing ⚠️
...tion/api/routes/voice_profile_management_router.py	78.94%	3 Missing and 1 partial ⚠️
...cation/use_cases/diarization_ingestion_use_case.py	57.14%	3 Missing ⚠️
src/presentation/api/dependencies.py	50.00%	3 Missing ⚠️
src/domain/exception/youtube_exceptions.py	0.00%	2 Missing ⚠️
...on/use_cases/process_audio_diarization_pipeline.py	85.71%	1 Missing ⚠️
...c/infrastructure/services/voice_profile_service.py	50.00%	1 Missing ⚠️
src/presentation/api/routes/ingest_router.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
- Coverage   80.83%   80.41%   -0.42%     
==========================================
  Files          86       86              
  Lines        6675     6721      +46     
  Branches      767      773       +6     
==========================================
+ Hits         5396     5405       +9     
- Misses       1036     1071      +35     
- Partials      243      245       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/dba9f09e-11b2-4cbe-90cf-5d4e90521cc6 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>

…y shallow copy Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/dba9f09e-11b2-4cbe-90cf-5d4e90521cc6 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>

…cate-ID risk Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/396faf44-8d73-42c9-92f6-f11c05b58f01 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>

…nup lints

feat: add vector migration script to re-ingest chunks with new models

c47fbd0

ericksonlopes self-assigned this Apr 7, 2026

Copilot AI review requested due to automatic review settings April 7, 2026 14:32

Copilot started reviewing on behalf of ericksonlopes April 7, 2026 14:32 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Copilot started work on behalf of ericksonlopes April 7, 2026 14:40 View session

Copilot AI and others added 2 commits April 7, 2026 14:44

fix(scripts): address all review feedback on migrate_vector_db.py

3d0dcdb

Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/dba9f09e-11b2-4cbe-90cf-5d4e90521cc6 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>

fix(scripts): move keyset cursor capture before document loop, clarif…

9b0926c

…y shallow copy Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/dba9f09e-11b2-4cbe-90cf-5d4e90521cc6 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>

Copilot finished work on behalf of ericksonlopes April 7, 2026 14:47

Copilot started work on behalf of ericksonlopes April 7, 2026 14:52 View session

feat(scripts): add --clear flag to migrate_vector_db to resolve dupli…

81f10f5

…cate-ID risk Agent-Logs-Url: https://github.com/ericksonlopes/WhatYouSaid/sessions/396faf44-8d73-42c9-92f6-f11c05b58f01 Co-authored-by: ericksonlopes <62525983+ericksonlopes@users.noreply.github.com>

Copilot finished work on behalf of ericksonlopes April 7, 2026 14:56

ericksonlopes added sql SQL queries, stored procedures, and database query optimization issues vector database Vector database operations, embeddings, and semantic search implementation issues backend Backend services, API development, and server-side logic related issues labels Apr 7, 2026

ericksonlopes added 5 commits April 7, 2026 12:02

chore: exclude scripts from testing, coverage and static analysis

b390dd9

style: fix import sorting and formatting with ruff across the repository

509dc7f

chore: increase ruff line-length to 100

0a3b002

feat: implement async voice training to resolve 504 timeouts and clea…

6b7c354

…nup lints

test: fix failing voice profile router test after async refactor

6d4a77c

ericksonlopes merged commit c0af155 into main Apr 7, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vector migration script to re-ingest chunks with new models#57

feat: add vector migration script to re-ingest chunks with new models#57
ericksonlopes merged 9 commits into
mainfrom
feature/vector-migration-script

ericksonlopes commented Apr 7, 2026 •

edited by Copilot AI

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

ericksonlopes Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

ericksonlopes Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Uh oh!

codecov Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	created_at=cast(datetime, chunk_sql.created_at or datetime.now()),
	created_at=cast(datetime, chunk_sql.created_at),

Conversation

ericksonlopes commented Apr 7, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ericksonlopes Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ericksonlopes Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksonlopes commented Apr 7, 2026 •

edited by Copilot AI

Loading

codecov Bot commented Apr 7, 2026 •

edited

Loading