feat: Implementation of Duplication Tests and Sidebar UX Polishing by ericksonlopes · Pull Request #60 · ericksonlopes/WhatYouSaid

ericksonlopes · 2026-04-08T15:11:45Z

Description

Implemented a comprehensive test suite for the chunk duplication feature, covering repository, service, and API layers. Additionally, improved the sidebar UX by enabling simple-toggle multi-selection, fixing indicator icon bugs, and adding a search-by-name field for subjects.

Tasks

Create SQL repository tests for chunk duplicates tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.py
Create service tests for duplicate detection logic tests/infrastructure/services/test_chunk_duplicate_service.py
Create API router tests for duplicate endpoints tests/presentation/api/routes/test_duplicate_router.py
Update SidebarContext.tsx to enable simple toggle selection for multiple bases.
Fix Check icon bug in multi-selection in SidebarContext.tsx.
Add search filter field in SidebarContext.tsx.
Fix tests/conftest.py import path for infrastructure.

Additional Context

The sidebar changes eliminate the need for Ctrl+Click, making the multi-knowledge selection more discoverable. The search field ensures usability as the number of knowledge bases grows.

…ons in search view

…ements

codecov · 2026-04-08T15:18:42Z

Codecov Report

❌ Patch coverage is 59.84043% with 151 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.76%. Comparing base (b293786) to head (f6a50a9).
⚠️ Report is 8 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/application/workers.py	19.73%	56 Missing and 5 partials ⚠️
...infrastructure/services/chunk_duplicate_service.py	72.22%	15 Missing and 5 partials ⚠️
...ure/repositories/sql/chunk_duplicate_repository.py	75.38%	14 Missing and 2 partials ⚠️
src/infrastructure/extractors/youtube_extractor.py	27.77%	13 Missing ⚠️
...ructure/repositories/sql/chunk_index_repository.py	13.33%	13 Missing ⚠️
src/presentation/api/routes/duplicate_router.py	67.50%	10 Missing and 3 partials ⚠️
...ructure/repositories/sql/diarization_repository.py	14.28%	6 Missing ⚠️
...cation/use_cases/diarization_ingestion_use_case.py	92.72%	2 Missing and 2 partials ⚠️
src/presentation/api/routes/ingest_router.py	69.23%	2 Missing and 2 partials ⚠️
src/presentation/api/dependencies.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
- Coverage   80.86%   79.76%   -1.11%     
==========================================
  Files          86       89       +3     
  Lines        6737     7052     +315     
  Branches      777      821      +44     
==========================================
+ Hits         5448     5625     +177     
- Misses       1041     1164     +123     
- Partials      248      263      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot

Pull request overview

Adds end-to-end duplicate-chunk detection (DB + service + API + UI), expands multi-context filtering UX, and refactors SQL connector imports to the new src.infrastructure.connectors.connector_sql module.

Changes:

Introduce chunk-duplicate persistence + service logic, API endpoints, and background worker integration (including Alembic migrations).
Add tests for duplicate repository/service/router; adjust test DB connector wiring.
Frontend: add “Duplicates” view, global multi-select context sidebar with search, and related navigation/i18n updates.

Reviewed changes

Copilot reviewed 60 out of 62 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
tmp/cleanup_db.py	Scratch DB cleanup helper for duplicates/is_active.
tests/presentation/api/routes/test_duplicate_router.py	API tests for duplicates endpoints via dependency overrides.
tests/infrastructure/services/test_voice_profile_service.py	Whitespace-only cleanup in existing tests.
tests/infrastructure/services/test_chunk_duplicate_service.py	Unit tests for duplicate service logic.
tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.py	SQL repo tests for duplicates CRUD/filtering.
tests/conftest.py	Update test DB connector monkeypatch to new connector module.
test_playlist.py	Adds a manual YouTube playlist script but named like a pytest test.
test_dispatcher.py	Adds a manual dispatcher script but named like a pytest test.
src/presentation/api/schemas/duplicate_schemas.py	New Pydantic schemas for duplicates API.
src/presentation/api/routes/settings_router.py	Switch Connector import to new connector module.
src/presentation/api/routes/ingest_router.py	Ensure background YouTube ingestions always have a job id + metadata.
src/presentation/api/routes/duplicate_router.py	New REST routes for listing/updating duplicates + analysis + deactivate.
src/presentation/api/routes/audio_diarization_and_recognition_router.py	Support multiple `subject_id` query params.
src/presentation/api/dependencies.py	Wire duplicate repo/service dependencies; switch DBSessionFactory import.
src/infrastructure/services/chunk_duplicate_service.py	New service for detecting/registering duplicates and deactivation.
src/infrastructure/repositories/sql/user_repository.py	Switch Connector import to new connector module.
src/infrastructure/repositories/sql/models/voice_record.py	Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/user.py	Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/knowledge_subject.py	Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/ingestion_job.py	Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/diarization_record.py	Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/content_source.py	Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/chunk_index.py	Add `is_active` column; switch Base import to new connector module.
src/infrastructure/repositories/sql/models/chunk_duplicate.py	New ORM model for `chunk_duplicates`.
src/infrastructure/repositories/sql/knowledge_subject_repository.py	Switch Connector import to new connector module.
src/infrastructure/repositories/sql/ingestion_job_repository.py	Switch Connector import to new connector module.
src/infrastructure/repositories/sql/diarization_repository.py	Add multi-subject filtering; safer casts/UUID parsing changes.
src/infrastructure/repositories/sql/content_source_repository.py	Switch Connector import to new connector module.
src/infrastructure/repositories/sql/chunk_index_repository.py	Switch Connector import; add `update_is_active`.
src/infrastructure/repositories/sql/chunk_duplicate_repository.py	New repository for duplicates table.
src/infrastructure/extractors/youtube_extractor.py	Validate downloaded MP3 header/size after yt-dlp download.
src/infrastructure/connectors/connector_sql.py	Minor cleanup; centralized SQLAlchemy engine/session/base.
src/infrastructure/connectors/init.py	Package init for connectors module.
src/domain/entities/chunk_duplicate_entity.py	New domain entity for duplicate groups.
src/application/workers.py	Enqueue duplicate detection after ingestions; add duplicate worker; job-status improvements.
src/application/use_cases/diarization_ingestion_use_case.py	Refactor into smaller helpers + rollback method; transcript formatting tweaks.
scripts/migrate_vector_db.py	Switch DB session import to new connector module.
scripts/dump_database.py	Switch engine import (currently incorrect path).
scripts/clear_sql_db.py	Switch engine/Base import (currently incorrect path).
README.md	Restructure/expand README and quick-start instructions.
main.py	Register duplicates router + duplicate worker task.
frontend/src/types.ts	Add `duplicates` view state + `ChunkDuplicate` type.
frontend/src/store/AppContext.tsx	Add `duplicates` view option; remove auto-subject selection; persist empty selection.
frontend/src/services/api.ts	Add duplicates API methods; allow multi `subject_id` for diarizations.
frontend/src/locales/pt-BR.json	Add i18n for sidebar context + duplicates view + misc keys.
frontend/src/locales/en.json	Add i18n for sidebar context + duplicates view + misc keys.
frontend/src/components/SubjectIcon.tsx	Centralize subject icon mapping (lucide).
frontend/src/components/SidebarContext.tsx	New multi-select context sidebar with search + “All means empty selection”.
frontend/src/components/Sidebar.tsx	Add duplicates nav item; add expandable group behavior.
frontend/src/components/SearchView.tsx	Add local contexts popover; refine search toolbar UX; icon updates.
frontend/src/components/LocalContextSelector.tsx	Use shared `SubjectIcon` instead of emoji fallback.
frontend/src/components/KnowledgeAdminView.tsx	Use shared `SubjectIcon` list instead of duplicating icons.
frontend/src/components/DuplicatesView.tsx	New UI for reviewing duplicates + pagination + actions.
frontend/src/components/DiarizationView.tsx	Use global sidebar; allow multi-subject filter in API call; small audio cleanup.
frontend/src/components/ContextSelector.tsx	Show subject icons in selector UI.
frontend/src/components/AddSubjectModal.tsx	Use shared `SubjectIcon` list instead of duplicating icons.
frontend/src/App.tsx	Add duplicates route; add global right sidebar for data views; multi-subject filtering for sources.
docs/issues/issue-duplication-tests-ux.md	Tracks the work items included in this PR.
alembic/versions/84524e052673_add_content_source_id_to_duplicates.py	Migration to add `content_source_id` FK to duplicates.
alembic/versions/646a175ac845_add_chunk_duplicates_table_and_is_.py	Migration to create duplicates table + `chunk_index.is_active`.
alembic/env.py	Switch Base import; safer `context.as_batch` access in rewrite hooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T15:20:40Z


        # DiarizationRepository needs a DB session
-        from src.infrastructure.repositories.sql.connector import Session as DBSession
+        from infrastructure.connectors.connector_sql import Session as DBSession


This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). There is no top-level infrastructure package in the repo; the module lives under src/infrastructure/..., so this will raise ModuleNotFoundError at runtime. Use from src.infrastructure.connectors.connector_sql import Session as DBSession here.

Suggested change

from infrastructure.connectors.connector_sql import Session as DBSession

from src.infrastructure.connectors.connector_sql import Session as DBSession

Copilot · 2026-04-08T15:20:40Z


 def _audio_diarization_subprocess(cmd_dict: dict):
    """Run audio diarization in a separate process to avoid torch/CUDA thread deadlocks."""
+    from infrastructure.connectors.connector_sql import (


This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.

Suggested change

from infrastructure.connectors.connector_sql import (

from src.infrastructure.connectors.connector_sql import (

Copilot · 2026-04-08T15:20:41Z

    try:
-        from src.infrastructure.extractors.youtube_extractor import YoutubeExtractor
-        from src.infrastructure.repositories.sql.connector import (
+        from infrastructure.connectors.connector_sql import (


This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.

Suggested change

from infrastructure.connectors.connector_sql import (

from src.infrastructure.connectors.connector_sql import (

Copilot · 2026-04-08T15:20:41Z

            logger.error("Audio diarization subprocess exited with code %d", process.exitcode)
            if cmd.diarization_id:
-                from src.infrastructure.repositories.sql.connector import (
+                from infrastructure.connectors.connector_sql import (


This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.

Suggested change

from infrastructure.connectors.connector_sql import (

from src.infrastructure.connectors.connector_sql import (

Copilot · 2026-04-08T15:20:45Z

 sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

-from src.infrastructure.repositories.sql.connector import engine, Base
+from infrastructure.connectors.connector_sql import engine, Base


This import path is missing the src. prefix (from infrastructure.connectors.connector_sql import engine, Base). The SQL connector module lives under src.infrastructure..., so this will raise ModuleNotFoundError. Import from src.infrastructure.connectors.connector_sql instead.

Suggested change

from infrastructure.connectors.connector_sql import engine, Base

from src.infrastructure.connectors.connector_sql import engine, Base

Copilot · 2026-04-08T15:20:45Z

+                    >
+                      <span className="flex items-center gap-2 font-medium">
+                        <Database className="w-3.5 h-3.5" />
+                        {t('sidebar.contexts.title')} • All


The contexts popover uses a hardcoded "• All" suffix ({t('sidebar.contexts.title')} • All). Since sidebar.contexts.title_all / common.all translations exist, this should be pulled from i18n to avoid mixed-language UI and duplicated strings.

Suggested change

{t('sidebar.contexts.title')} • All

{t('sidebar.contexts.title')} • {t('common.all')}

Copilot · 2026-04-08T15:20:45Z

+    # Enrich entities with chunk content if needed for UI
+    results = []
+    for entity in entities:
+        resp = ChunkDuplicateResponse.model_validate(entity)
+        chunks_info = []
+        for cid in entity.chunk_ids:
+            chunk = chunk_service.get_by_id(cid)
+            if chunk:


list_duplicates enriches each duplicate group by calling chunk_service.get_by_id(...) inside nested loops. This creates an N+1 query pattern (up to limit * len(chunk_ids) SQL calls) and can become slow as duplicates grow. Prefer fetching chunks in bulk (e.g., repository WHERE id IN (...)) and mapping them in-memory before building the response.

Copilot · 2026-04-08T15:20:46Z

+from src.infrastructure.extractors.youtube_extractor import YoutubeExtractor
+
+
+def test_playlist():
+    # Use the playlist provided by the user
+    playlist_url = "https://www.youtube.com/watch?v=dlQG02mwTD0&list=PLG47XsLEf0LdvYtX_zU7E_y_C1TgjgU59"
+
+    print(f"Testing playlist extraction for: {playlist_url}")
+    extractor = YoutubeExtractor(language="pt")
+
+    try:
+        videos = extractor.extract_playlist_videos(playlist_url)
+        print(f"Extracted {len(videos)} videos.")
+        for i, url in enumerate(videos[:5]):
+            print(f"  {i+1}: {url}")
+
+        if not videos:
+            print("FAILED: No videos extracted.")
+        else:
+            print("SUCCESS: Videos extracted.")
+    except Exception as e:
+        print(f"ERROR: {e}")
+
+if __name__ == "__main__":
+    test_playlist()


This file is named like a pytest test module (test_*.py) but appears to be an ad-hoc/manual script (network-dependent behavior, print-only checks, __main__ runner). Pytest will collect and execute it in CI, which is likely to be flaky/offline-failing. Move it under scripts/ / tmp/, rename it so pytest won’t collect it, or mark the module/tests as skipped.

Copilot · 2026-04-08T15:20:46Z

+import os
+import sys
+from unittest.mock import MagicMock
+
+# Add the project root to sys.path
+sys.path.append(os.getcwd())
+
+# Mock out dependencies before importing workers
+sys.modules['src.presentation.api.dependencies'] = MagicMock()
+mock_job_service = MagicMock()
+
+class MockContext:
+    def __init__(self, job_svc):
+        self.job_service = job_svc
+
+import src.presentation.api.dependencies as deps  # noqa: E402
+
+deps.resolve_ingestion_context = MagicMock(return_value=MockContext(mock_job_service))
+
+from src.application.dtos.commands.ingest_youtube_command import IngestYoutubeCommand  # noqa: E402
+from src.application.dtos.enums.youtube_data_type import YoutubeDataType  # noqa: E402
+from src.application.workers import run_youtube_dispatcher_worker  # noqa: E402
+
+
+def test_dispatcher():
+    # Setup mock app state
+    mock_app = MagicMock()
+    mock_app.state.task_queue = MagicMock()
+
+    # Patch _get_app to return our mock
+    import src.application.workers as workers  # noqa: E402
+    workers._get_app = MagicMock(return_value=mock_app)
+
+    # Create command for a small playlist (or the user's one)
+    # Using a known small playlist to speed up test if possible
+    # But user's URL is fine since we are testing logic
+    url = "https://www.youtube.com/watch?v=dlQG02mwTD0&list=PLG47XsLEf0LdvYtX_zU7E_y_C1TgjgU59"
+    cmd = IngestYoutubeCommand(
+        video_url=url,
+        data_type=YoutubeDataType.PLAYLIST,
+        ingestion_job_id="test-job-uuid",
+        subject_id="test-subject"
+    )
+
+    print(f"Testing dispatcher with URL: {url}")
+    run_youtube_dispatcher_worker(cmd)
+
+    # Assertions
+    print("\nVerifying Job Service calls:")
+    for call in mock_job_service.update_job_status.call_args_list:
+        print(f"  - Status update: {call[1]}")
+
+    print("\nVerifying Task Queue calls:")
+    enqueue_calls = mock_app.state.task_queue.enqueue.call_args_list
+    print(f"  - Tasks enqueued: {len(enqueue_calls)}")
+    if len(enqueue_calls) > 0:
+        print(f"  - First task URL: {enqueue_calls[0][0][1].video_url}")
+
+if __name__ == "__main__":
+    test_dispatcher()


This file is named like a pytest test module (test_dispatcher.py) but is a manual script (prints, __main__ runner, relies on worker internals). It will be collected by pytest in CI and can cause flaky failures. Move it under scripts/ / tmp/, rename it so pytest won’t collect it, or explicitly skip it.

…int fixes

ericksonlopes added 3 commits April 7, 2026 17:19

feat: harden diarization ingestion and refactor readme

cfe4bf8

fix(frontend): implement shared SubjectIcon and fix missing translati…

fd708b1

…ons in search view

feat: implementation of chunk duplication tests and sidebar UX improv…

f9de053

…ements

Copilot AI review requested due to automatic review settings April 8, 2026 15:11

ericksonlopes linked an issue Apr 8, 2026 that may be closed by this pull request

feat: Implementation of Duplication Tests and Knowledge Context UX Enhancements #59

Closed

7 tasks

Copilot started reviewing on behalf of ericksonlopes April 8, 2026 15:12 View session

style: fix lint and typing errors across project

f4cf6a6

Copilot AI reviewed Apr 8, 2026

View reviewed changes

ericksonlopes added 3 commits April 8, 2026 13:06

feat: complete chunk duplication system with worker integration and l…

99ad8d1

…int fixes

fix: update sql connector imports and fix failing tests

3961904

style: fix linting issues (import sorting and line length)

f6a50a9

ericksonlopes added sql SQL queries, stored procedures, and database query optimization issues backend Backend services, API development, and server-side logic related issues feature New feature or request frontend Frontend UI/UX, web interface, and client-side development issues labels Apr 8, 2026

ericksonlopes merged commit c8119ae into main Apr 8, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implementation of Duplication Tests and Sidebar UX Polishing#60

feat: Implementation of Duplication Tests and Sidebar UX Polishing#60
ericksonlopes merged 7 commits into
mainfrom
feat/duplication-tests-ux

ericksonlopes commented Apr 8, 2026

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from infrastructure.connectors.connector_sql import Session as DBSession
	from src.infrastructure.connectors.connector_sql import Session as DBSession

	{t('sidebar.contexts.title')} • All
	{t('sidebar.contexts.title')} • {t('common.all')}

Conversation

ericksonlopes commented Apr 8, 2026

Description

Tasks

Additional Context

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 8, 2026 •

edited

Loading