feat: Implementation of Duplication Tests and Sidebar UX Polishing#60
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end duplicate-chunk detection (DB + service + API + UI), expands multi-context filtering UX, and refactors SQL connector imports to the new src.infrastructure.connectors.connector_sql module.
Changes:
- Introduce chunk-duplicate persistence + service logic, API endpoints, and background worker integration (including Alembic migrations).
- Add tests for duplicate repository/service/router; adjust test DB connector wiring.
- Frontend: add “Duplicates” view, global multi-select context sidebar with search, and related navigation/i18n updates.
Reviewed changes
Copilot reviewed 60 out of 62 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| tmp/cleanup_db.py | Scratch DB cleanup helper for duplicates/is_active. |
| tests/presentation/api/routes/test_duplicate_router.py | API tests for duplicates endpoints via dependency overrides. |
| tests/infrastructure/services/test_voice_profile_service.py | Whitespace-only cleanup in existing tests. |
| tests/infrastructure/services/test_chunk_duplicate_service.py | Unit tests for duplicate service logic. |
| tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.py | SQL repo tests for duplicates CRUD/filtering. |
| tests/conftest.py | Update test DB connector monkeypatch to new connector module. |
| test_playlist.py | Adds a manual YouTube playlist script but named like a pytest test. |
| test_dispatcher.py | Adds a manual dispatcher script but named like a pytest test. |
| src/presentation/api/schemas/duplicate_schemas.py | New Pydantic schemas for duplicates API. |
| src/presentation/api/routes/settings_router.py | Switch Connector import to new connector module. |
| src/presentation/api/routes/ingest_router.py | Ensure background YouTube ingestions always have a job id + metadata. |
| src/presentation/api/routes/duplicate_router.py | New REST routes for listing/updating duplicates + analysis + deactivate. |
| src/presentation/api/routes/audio_diarization_and_recognition_router.py | Support multiple subject_id query params. |
| src/presentation/api/dependencies.py | Wire duplicate repo/service dependencies; switch DBSessionFactory import. |
| src/infrastructure/services/chunk_duplicate_service.py | New service for detecting/registering duplicates and deactivation. |
| src/infrastructure/repositories/sql/user_repository.py | Switch Connector import to new connector module. |
| src/infrastructure/repositories/sql/models/voice_record.py | Switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/user.py | Switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/knowledge_subject.py | Switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/ingestion_job.py | Switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/diarization_record.py | Switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/content_source.py | Switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/chunk_index.py | Add is_active column; switch Base import to new connector module. |
| src/infrastructure/repositories/sql/models/chunk_duplicate.py | New ORM model for chunk_duplicates. |
| src/infrastructure/repositories/sql/knowledge_subject_repository.py | Switch Connector import to new connector module. |
| src/infrastructure/repositories/sql/ingestion_job_repository.py | Switch Connector import to new connector module. |
| src/infrastructure/repositories/sql/diarization_repository.py | Add multi-subject filtering; safer casts/UUID parsing changes. |
| src/infrastructure/repositories/sql/content_source_repository.py | Switch Connector import to new connector module. |
| src/infrastructure/repositories/sql/chunk_index_repository.py | Switch Connector import; add update_is_active. |
| src/infrastructure/repositories/sql/chunk_duplicate_repository.py | New repository for duplicates table. |
| src/infrastructure/extractors/youtube_extractor.py | Validate downloaded MP3 header/size after yt-dlp download. |
| src/infrastructure/connectors/connector_sql.py | Minor cleanup; centralized SQLAlchemy engine/session/base. |
| src/infrastructure/connectors/init.py | Package init for connectors module. |
| src/domain/entities/chunk_duplicate_entity.py | New domain entity for duplicate groups. |
| src/application/workers.py | Enqueue duplicate detection after ingestions; add duplicate worker; job-status improvements. |
| src/application/use_cases/diarization_ingestion_use_case.py | Refactor into smaller helpers + rollback method; transcript formatting tweaks. |
| scripts/migrate_vector_db.py | Switch DB session import to new connector module. |
| scripts/dump_database.py | Switch engine import (currently incorrect path). |
| scripts/clear_sql_db.py | Switch engine/Base import (currently incorrect path). |
| README.md | Restructure/expand README and quick-start instructions. |
| main.py | Register duplicates router + duplicate worker task. |
| frontend/src/types.ts | Add duplicates view state + ChunkDuplicate type. |
| frontend/src/store/AppContext.tsx | Add duplicates view option; remove auto-subject selection; persist empty selection. |
| frontend/src/services/api.ts | Add duplicates API methods; allow multi subject_id for diarizations. |
| frontend/src/locales/pt-BR.json | Add i18n for sidebar context + duplicates view + misc keys. |
| frontend/src/locales/en.json | Add i18n for sidebar context + duplicates view + misc keys. |
| frontend/src/components/SubjectIcon.tsx | Centralize subject icon mapping (lucide). |
| frontend/src/components/SidebarContext.tsx | New multi-select context sidebar with search + “All means empty selection”. |
| frontend/src/components/Sidebar.tsx | Add duplicates nav item; add expandable group behavior. |
| frontend/src/components/SearchView.tsx | Add local contexts popover; refine search toolbar UX; icon updates. |
| frontend/src/components/LocalContextSelector.tsx | Use shared SubjectIcon instead of emoji fallback. |
| frontend/src/components/KnowledgeAdminView.tsx | Use shared SubjectIcon list instead of duplicating icons. |
| frontend/src/components/DuplicatesView.tsx | New UI for reviewing duplicates + pagination + actions. |
| frontend/src/components/DiarizationView.tsx | Use global sidebar; allow multi-subject filter in API call; small audio cleanup. |
| frontend/src/components/ContextSelector.tsx | Show subject icons in selector UI. |
| frontend/src/components/AddSubjectModal.tsx | Use shared SubjectIcon list instead of duplicating icons. |
| frontend/src/App.tsx | Add duplicates route; add global right sidebar for data views; multi-subject filtering for sources. |
| docs/issues/issue-duplication-tests-ux.md | Tracks the work items included in this PR. |
| alembic/versions/84524e052673_add_content_source_id_to_duplicates.py | Migration to add content_source_id FK to duplicates. |
| alembic/versions/646a175ac845_add_chunk_duplicates_table_and_is_.py | Migration to create duplicates table + chunk_index.is_active. |
| alembic/env.py | Switch Base import; safer context.as_batch access in rewrite hooks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # DiarizationRepository needs a DB session | ||
| from src.infrastructure.repositories.sql.connector import Session as DBSession | ||
| from infrastructure.connectors.connector_sql import Session as DBSession |
There was a problem hiding this comment.
This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). There is no top-level infrastructure package in the repo; the module lives under src/infrastructure/..., so this will raise ModuleNotFoundError at runtime. Use from src.infrastructure.connectors.connector_sql import Session as DBSession here.
| from infrastructure.connectors.connector_sql import Session as DBSession | |
| from src.infrastructure.connectors.connector_sql import Session as DBSession |
|
|
||
| def _audio_diarization_subprocess(cmd_dict: dict): | ||
| """Run audio diarization in a separate process to avoid torch/CUDA thread deadlocks.""" | ||
| from infrastructure.connectors.connector_sql import ( |
There was a problem hiding this comment.
This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.
| from infrastructure.connectors.connector_sql import ( | |
| from src.infrastructure.connectors.connector_sql import ( |
| try: | ||
| from src.infrastructure.extractors.youtube_extractor import YoutubeExtractor | ||
| from src.infrastructure.repositories.sql.connector import ( | ||
| from infrastructure.connectors.connector_sql import ( |
There was a problem hiding this comment.
This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.
| from infrastructure.connectors.connector_sql import ( | |
| from src.infrastructure.connectors.connector_sql import ( |
| logger.error("Audio diarization subprocess exited with code %d", process.exitcode) | ||
| if cmd.diarization_id: | ||
| from src.infrastructure.repositories.sql.connector import ( | ||
| from infrastructure.connectors.connector_sql import ( |
There was a problem hiding this comment.
This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.
| from infrastructure.connectors.connector_sql import ( | |
| from src.infrastructure.connectors.connector_sql import ( |
| sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))) | ||
|
|
||
| from src.infrastructure.repositories.sql.connector import engine, Base | ||
| from infrastructure.connectors.connector_sql import engine, Base |
There was a problem hiding this comment.
This import path is missing the src. prefix (from infrastructure.connectors.connector_sql import engine, Base). The SQL connector module lives under src.infrastructure..., so this will raise ModuleNotFoundError. Import from src.infrastructure.connectors.connector_sql instead.
| from infrastructure.connectors.connector_sql import engine, Base | |
| from src.infrastructure.connectors.connector_sql import engine, Base |
| > | ||
| <span className="flex items-center gap-2 font-medium"> | ||
| <Database className="w-3.5 h-3.5" /> | ||
| {t('sidebar.contexts.title')} • All |
There was a problem hiding this comment.
The contexts popover uses a hardcoded "• All" suffix ({t('sidebar.contexts.title')} • All). Since sidebar.contexts.title_all / common.all translations exist, this should be pulled from i18n to avoid mixed-language UI and duplicated strings.
| {t('sidebar.contexts.title')} • All | |
| {t('sidebar.contexts.title')} • {t('common.all')} |
| # Enrich entities with chunk content if needed for UI | ||
| results = [] | ||
| for entity in entities: | ||
| resp = ChunkDuplicateResponse.model_validate(entity) | ||
| chunks_info = [] | ||
| for cid in entity.chunk_ids: | ||
| chunk = chunk_service.get_by_id(cid) | ||
| if chunk: |
There was a problem hiding this comment.
list_duplicates enriches each duplicate group by calling chunk_service.get_by_id(...) inside nested loops. This creates an N+1 query pattern (up to limit * len(chunk_ids) SQL calls) and can become slow as duplicates grow. Prefer fetching chunks in bulk (e.g., repository WHERE id IN (...)) and mapping them in-memory before building the response.
| from src.infrastructure.extractors.youtube_extractor import YoutubeExtractor | ||
|
|
||
|
|
||
| def test_playlist(): | ||
| # Use the playlist provided by the user | ||
| playlist_url = "https://www.youtube.com/watch?v=dlQG02mwTD0&list=PLG47XsLEf0LdvYtX_zU7E_y_C1TgjgU59" | ||
|
|
||
| print(f"Testing playlist extraction for: {playlist_url}") | ||
| extractor = YoutubeExtractor(language="pt") | ||
|
|
||
| try: | ||
| videos = extractor.extract_playlist_videos(playlist_url) | ||
| print(f"Extracted {len(videos)} videos.") | ||
| for i, url in enumerate(videos[:5]): | ||
| print(f" {i+1}: {url}") | ||
|
|
||
| if not videos: | ||
| print("FAILED: No videos extracted.") | ||
| else: | ||
| print("SUCCESS: Videos extracted.") | ||
| except Exception as e: | ||
| print(f"ERROR: {e}") | ||
|
|
||
| if __name__ == "__main__": | ||
| test_playlist() |
There was a problem hiding this comment.
This file is named like a pytest test module (test_*.py) but appears to be an ad-hoc/manual script (network-dependent behavior, print-only checks, __main__ runner). Pytest will collect and execute it in CI, which is likely to be flaky/offline-failing. Move it under scripts/ / tmp/, rename it so pytest won’t collect it, or mark the module/tests as skipped.
| import os | ||
| import sys | ||
| from unittest.mock import MagicMock | ||
|
|
||
| # Add the project root to sys.path | ||
| sys.path.append(os.getcwd()) | ||
|
|
||
| # Mock out dependencies before importing workers | ||
| sys.modules['src.presentation.api.dependencies'] = MagicMock() | ||
| mock_job_service = MagicMock() | ||
|
|
||
| class MockContext: | ||
| def __init__(self, job_svc): | ||
| self.job_service = job_svc | ||
|
|
||
| import src.presentation.api.dependencies as deps # noqa: E402 | ||
|
|
||
| deps.resolve_ingestion_context = MagicMock(return_value=MockContext(mock_job_service)) | ||
|
|
||
| from src.application.dtos.commands.ingest_youtube_command import IngestYoutubeCommand # noqa: E402 | ||
| from src.application.dtos.enums.youtube_data_type import YoutubeDataType # noqa: E402 | ||
| from src.application.workers import run_youtube_dispatcher_worker # noqa: E402 | ||
|
|
||
|
|
||
| def test_dispatcher(): | ||
| # Setup mock app state | ||
| mock_app = MagicMock() | ||
| mock_app.state.task_queue = MagicMock() | ||
|
|
||
| # Patch _get_app to return our mock | ||
| import src.application.workers as workers # noqa: E402 | ||
| workers._get_app = MagicMock(return_value=mock_app) | ||
|
|
||
| # Create command for a small playlist (or the user's one) | ||
| # Using a known small playlist to speed up test if possible | ||
| # But user's URL is fine since we are testing logic | ||
| url = "https://www.youtube.com/watch?v=dlQG02mwTD0&list=PLG47XsLEf0LdvYtX_zU7E_y_C1TgjgU59" | ||
| cmd = IngestYoutubeCommand( | ||
| video_url=url, | ||
| data_type=YoutubeDataType.PLAYLIST, | ||
| ingestion_job_id="test-job-uuid", | ||
| subject_id="test-subject" | ||
| ) | ||
|
|
||
| print(f"Testing dispatcher with URL: {url}") | ||
| run_youtube_dispatcher_worker(cmd) | ||
|
|
||
| # Assertions | ||
| print("\nVerifying Job Service calls:") | ||
| for call in mock_job_service.update_job_status.call_args_list: | ||
| print(f" - Status update: {call[1]}") | ||
|
|
||
| print("\nVerifying Task Queue calls:") | ||
| enqueue_calls = mock_app.state.task_queue.enqueue.call_args_list | ||
| print(f" - Tasks enqueued: {len(enqueue_calls)}") | ||
| if len(enqueue_calls) > 0: | ||
| print(f" - First task URL: {enqueue_calls[0][0][1].video_url}") | ||
|
|
||
| if __name__ == "__main__": | ||
| test_dispatcher() |
There was a problem hiding this comment.
This file is named like a pytest test module (test_dispatcher.py) but is a manual script (prints, __main__ runner, relies on worker internals). It will be collected by pytest in CI and can cause flaky failures. Move it under scripts/ / tmp/, rename it so pytest won’t collect it, or explicitly skip it.
Description
Implemented a comprehensive test suite for the chunk duplication feature, covering repository, service, and API layers. Additionally, improved the sidebar UX by enabling simple-toggle multi-selection, fixing indicator icon bugs, and adding a search-by-name field for subjects.
Tasks
tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.pytests/infrastructure/services/test_chunk_duplicate_service.pytests/presentation/api/routes/test_duplicate_router.pySidebarContext.tsxto enable simple toggle selection for multiple bases.SidebarContext.tsx.SidebarContext.tsx.tests/conftest.pyimport path for infrastructure.Additional Context
The sidebar changes eliminate the need for Ctrl+Click, making the multi-knowledge selection more discoverable. The search field ensures usability as the number of knowledge bases grows.