Skip to content

feat: Implementation of Duplication Tests and Sidebar UX Polishing#60

Merged
ericksonlopes merged 7 commits into
mainfrom
feat/duplication-tests-ux
Apr 8, 2026
Merged

feat: Implementation of Duplication Tests and Sidebar UX Polishing#60
ericksonlopes merged 7 commits into
mainfrom
feat/duplication-tests-ux

Conversation

@ericksonlopes

Copy link
Copy Markdown
Owner

Description

Implemented a comprehensive test suite for the chunk duplication feature, covering repository, service, and API layers. Additionally, improved the sidebar UX by enabling simple-toggle multi-selection, fixing indicator icon bugs, and adding a search-by-name field for subjects.

Tasks

  • Create SQL repository tests for chunk duplicates tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.py
  • Create service tests for duplicate detection logic tests/infrastructure/services/test_chunk_duplicate_service.py
  • Create API router tests for duplicate endpoints tests/presentation/api/routes/test_duplicate_router.py
  • Update SidebarContext.tsx to enable simple toggle selection for multiple bases.
  • Fix Check icon bug in multi-selection in SidebarContext.tsx.
  • Add search filter field in SidebarContext.tsx.
  • Fix tests/conftest.py import path for infrastructure.

Additional Context

The sidebar changes eliminate the need for Ctrl+Click, making the multi-knowledge selection more discoverable. The search field ensures usability as the number of knowledge bases grows.

Copilot AI review requested due to automatic review settings April 8, 2026 15:11
@ericksonlopes ericksonlopes linked an issue Apr 8, 2026 that may be closed by this pull request
7 tasks
@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 59.84043% with 151 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.76%. Comparing base (b293786) to head (f6a50a9).
⚠️ Report is 8 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/application/workers.py 19.73% 56 Missing and 5 partials ⚠️
...infrastructure/services/chunk_duplicate_service.py 72.22% 15 Missing and 5 partials ⚠️
...ure/repositories/sql/chunk_duplicate_repository.py 75.38% 14 Missing and 2 partials ⚠️
src/infrastructure/extractors/youtube_extractor.py 27.77% 13 Missing ⚠️
...ructure/repositories/sql/chunk_index_repository.py 13.33% 13 Missing ⚠️
src/presentation/api/routes/duplicate_router.py 67.50% 10 Missing and 3 partials ⚠️
...ructure/repositories/sql/diarization_repository.py 14.28% 6 Missing ⚠️
...cation/use_cases/diarization_ingestion_use_case.py 92.72% 2 Missing and 2 partials ⚠️
src/presentation/api/routes/ingest_router.py 69.23% 2 Missing and 2 partials ⚠️
src/presentation/api/dependencies.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
- Coverage   80.86%   79.76%   -1.11%     
==========================================
  Files          86       89       +3     
  Lines        6737     7052     +315     
  Branches      777      821      +44     
==========================================
+ Hits         5448     5625     +177     
- Misses       1041     1164     +123     
- Partials      248      263      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end duplicate-chunk detection (DB + service + API + UI), expands multi-context filtering UX, and refactors SQL connector imports to the new src.infrastructure.connectors.connector_sql module.

Changes:

  • Introduce chunk-duplicate persistence + service logic, API endpoints, and background worker integration (including Alembic migrations).
  • Add tests for duplicate repository/service/router; adjust test DB connector wiring.
  • Frontend: add “Duplicates” view, global multi-select context sidebar with search, and related navigation/i18n updates.

Reviewed changes

Copilot reviewed 60 out of 62 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
tmp/cleanup_db.py Scratch DB cleanup helper for duplicates/is_active.
tests/presentation/api/routes/test_duplicate_router.py API tests for duplicates endpoints via dependency overrides.
tests/infrastructure/services/test_voice_profile_service.py Whitespace-only cleanup in existing tests.
tests/infrastructure/services/test_chunk_duplicate_service.py Unit tests for duplicate service logic.
tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.py SQL repo tests for duplicates CRUD/filtering.
tests/conftest.py Update test DB connector monkeypatch to new connector module.
test_playlist.py Adds a manual YouTube playlist script but named like a pytest test.
test_dispatcher.py Adds a manual dispatcher script but named like a pytest test.
src/presentation/api/schemas/duplicate_schemas.py New Pydantic schemas for duplicates API.
src/presentation/api/routes/settings_router.py Switch Connector import to new connector module.
src/presentation/api/routes/ingest_router.py Ensure background YouTube ingestions always have a job id + metadata.
src/presentation/api/routes/duplicate_router.py New REST routes for listing/updating duplicates + analysis + deactivate.
src/presentation/api/routes/audio_diarization_and_recognition_router.py Support multiple subject_id query params.
src/presentation/api/dependencies.py Wire duplicate repo/service dependencies; switch DBSessionFactory import.
src/infrastructure/services/chunk_duplicate_service.py New service for detecting/registering duplicates and deactivation.
src/infrastructure/repositories/sql/user_repository.py Switch Connector import to new connector module.
src/infrastructure/repositories/sql/models/voice_record.py Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/user.py Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/knowledge_subject.py Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/ingestion_job.py Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/diarization_record.py Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/content_source.py Switch Base import to new connector module.
src/infrastructure/repositories/sql/models/chunk_index.py Add is_active column; switch Base import to new connector module.
src/infrastructure/repositories/sql/models/chunk_duplicate.py New ORM model for chunk_duplicates.
src/infrastructure/repositories/sql/knowledge_subject_repository.py Switch Connector import to new connector module.
src/infrastructure/repositories/sql/ingestion_job_repository.py Switch Connector import to new connector module.
src/infrastructure/repositories/sql/diarization_repository.py Add multi-subject filtering; safer casts/UUID parsing changes.
src/infrastructure/repositories/sql/content_source_repository.py Switch Connector import to new connector module.
src/infrastructure/repositories/sql/chunk_index_repository.py Switch Connector import; add update_is_active.
src/infrastructure/repositories/sql/chunk_duplicate_repository.py New repository for duplicates table.
src/infrastructure/extractors/youtube_extractor.py Validate downloaded MP3 header/size after yt-dlp download.
src/infrastructure/connectors/connector_sql.py Minor cleanup; centralized SQLAlchemy engine/session/base.
src/infrastructure/connectors/init.py Package init for connectors module.
src/domain/entities/chunk_duplicate_entity.py New domain entity for duplicate groups.
src/application/workers.py Enqueue duplicate detection after ingestions; add duplicate worker; job-status improvements.
src/application/use_cases/diarization_ingestion_use_case.py Refactor into smaller helpers + rollback method; transcript formatting tweaks.
scripts/migrate_vector_db.py Switch DB session import to new connector module.
scripts/dump_database.py Switch engine import (currently incorrect path).
scripts/clear_sql_db.py Switch engine/Base import (currently incorrect path).
README.md Restructure/expand README and quick-start instructions.
main.py Register duplicates router + duplicate worker task.
frontend/src/types.ts Add duplicates view state + ChunkDuplicate type.
frontend/src/store/AppContext.tsx Add duplicates view option; remove auto-subject selection; persist empty selection.
frontend/src/services/api.ts Add duplicates API methods; allow multi subject_id for diarizations.
frontend/src/locales/pt-BR.json Add i18n for sidebar context + duplicates view + misc keys.
frontend/src/locales/en.json Add i18n for sidebar context + duplicates view + misc keys.
frontend/src/components/SubjectIcon.tsx Centralize subject icon mapping (lucide).
frontend/src/components/SidebarContext.tsx New multi-select context sidebar with search + “All means empty selection”.
frontend/src/components/Sidebar.tsx Add duplicates nav item; add expandable group behavior.
frontend/src/components/SearchView.tsx Add local contexts popover; refine search toolbar UX; icon updates.
frontend/src/components/LocalContextSelector.tsx Use shared SubjectIcon instead of emoji fallback.
frontend/src/components/KnowledgeAdminView.tsx Use shared SubjectIcon list instead of duplicating icons.
frontend/src/components/DuplicatesView.tsx New UI for reviewing duplicates + pagination + actions.
frontend/src/components/DiarizationView.tsx Use global sidebar; allow multi-subject filter in API call; small audio cleanup.
frontend/src/components/ContextSelector.tsx Show subject icons in selector UI.
frontend/src/components/AddSubjectModal.tsx Use shared SubjectIcon list instead of duplicating icons.
frontend/src/App.tsx Add duplicates route; add global right sidebar for data views; multi-subject filtering for sources.
docs/issues/issue-duplication-tests-ux.md Tracks the work items included in this PR.
alembic/versions/84524e052673_add_content_source_id_to_duplicates.py Migration to add content_source_id FK to duplicates.
alembic/versions/646a175ac845_add_chunk_duplicates_table_and_is_.py Migration to create duplicates table + chunk_index.is_active.
alembic/env.py Switch Base import; safer context.as_batch access in rewrite hooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/application/workers.py Outdated
Comment thread src/application/workers.py Outdated

# DiarizationRepository needs a DB session
from src.infrastructure.repositories.sql.connector import Session as DBSession
from infrastructure.connectors.connector_sql import Session as DBSession

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). There is no top-level infrastructure package in the repo; the module lives under src/infrastructure/..., so this will raise ModuleNotFoundError at runtime. Use from src.infrastructure.connectors.connector_sql import Session as DBSession here.

Suggested change
from infrastructure.connectors.connector_sql import Session as DBSession
from src.infrastructure.connectors.connector_sql import Session as DBSession

Copilot uses AI. Check for mistakes.
Comment thread src/application/workers.py Outdated

def _audio_diarization_subprocess(cmd_dict: dict):
"""Run audio diarization in a separate process to avoid torch/CUDA thread deadlocks."""
from infrastructure.connectors.connector_sql import (

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.

Suggested change
from infrastructure.connectors.connector_sql import (
from src.infrastructure.connectors.connector_sql import (

Copilot uses AI. Check for mistakes.
Comment thread src/application/workers.py Outdated
try:
from src.infrastructure.extractors.youtube_extractor import YoutubeExtractor
from src.infrastructure.repositories.sql.connector import (
from infrastructure.connectors.connector_sql import (

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.

Suggested change
from infrastructure.connectors.connector_sql import (
from src.infrastructure.connectors.connector_sql import (

Copilot uses AI. Check for mistakes.
Comment thread src/application/workers.py Outdated
logger.error("Audio diarization subprocess exited with code %d", process.exitcode)
if cmd.diarization_id:
from src.infrastructure.repositories.sql.connector import (
from infrastructure.connectors.connector_sql import (

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import path is missing the src. prefix (from infrastructure.connectors.connector_sql ...). It should import from src.infrastructure.connectors.connector_sql to avoid ModuleNotFoundError.

Suggested change
from infrastructure.connectors.connector_sql import (
from src.infrastructure.connectors.connector_sql import (

Copilot uses AI. Check for mistakes.
Comment thread scripts/clear_sql_db.py
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

from src.infrastructure.repositories.sql.connector import engine, Base
from infrastructure.connectors.connector_sql import engine, Base

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import path is missing the src. prefix (from infrastructure.connectors.connector_sql import engine, Base). The SQL connector module lives under src.infrastructure..., so this will raise ModuleNotFoundError. Import from src.infrastructure.connectors.connector_sql instead.

Suggested change
from infrastructure.connectors.connector_sql import engine, Base
from src.infrastructure.connectors.connector_sql import engine, Base

Copilot uses AI. Check for mistakes.
>
<span className="flex items-center gap-2 font-medium">
<Database className="w-3.5 h-3.5" />
{t('sidebar.contexts.title')} • All

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contexts popover uses a hardcoded "• All" suffix ({t('sidebar.contexts.title')} • All). Since sidebar.contexts.title_all / common.all translations exist, this should be pulled from i18n to avoid mixed-language UI and duplicated strings.

Suggested change
{t('sidebar.contexts.title')} All
{t('sidebar.contexts.title')} {t('common.all')}

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +43
# Enrich entities with chunk content if needed for UI
results = []
for entity in entities:
resp = ChunkDuplicateResponse.model_validate(entity)
chunks_info = []
for cid in entity.chunk_ids:
chunk = chunk_service.get_by_id(cid)
if chunk:

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_duplicates enriches each duplicate group by calling chunk_service.get_by_id(...) inside nested loops. This creates an N+1 query pattern (up to limit * len(chunk_ids) SQL calls) and can become slow as duplicates grow. Prefer fetching chunks in bulk (e.g., repository WHERE id IN (...)) and mapping them in-memory before building the response.

Copilot uses AI. Check for mistakes.
Comment thread test_playlist.py
Comment on lines +1 to +25
from src.infrastructure.extractors.youtube_extractor import YoutubeExtractor


def test_playlist():
# Use the playlist provided by the user
playlist_url = "https://www.youtube.com/watch?v=dlQG02mwTD0&list=PLG47XsLEf0LdvYtX_zU7E_y_C1TgjgU59"

print(f"Testing playlist extraction for: {playlist_url}")
extractor = YoutubeExtractor(language="pt")

try:
videos = extractor.extract_playlist_videos(playlist_url)
print(f"Extracted {len(videos)} videos.")
for i, url in enumerate(videos[:5]):
print(f" {i+1}: {url}")

if not videos:
print("FAILED: No videos extracted.")
else:
print("SUCCESS: Videos extracted.")
except Exception as e:
print(f"ERROR: {e}")

if __name__ == "__main__":
test_playlist()

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is named like a pytest test module (test_*.py) but appears to be an ad-hoc/manual script (network-dependent behavior, print-only checks, __main__ runner). Pytest will collect and execute it in CI, which is likely to be flaky/offline-failing. Move it under scripts/ / tmp/, rename it so pytest won’t collect it, or mark the module/tests as skipped.

Copilot uses AI. Check for mistakes.
Comment thread test_dispatcher.py
Comment on lines +1 to +60
import os
import sys
from unittest.mock import MagicMock

# Add the project root to sys.path
sys.path.append(os.getcwd())

# Mock out dependencies before importing workers
sys.modules['src.presentation.api.dependencies'] = MagicMock()
mock_job_service = MagicMock()

class MockContext:
def __init__(self, job_svc):
self.job_service = job_svc

import src.presentation.api.dependencies as deps # noqa: E402

deps.resolve_ingestion_context = MagicMock(return_value=MockContext(mock_job_service))

from src.application.dtos.commands.ingest_youtube_command import IngestYoutubeCommand # noqa: E402
from src.application.dtos.enums.youtube_data_type import YoutubeDataType # noqa: E402
from src.application.workers import run_youtube_dispatcher_worker # noqa: E402


def test_dispatcher():
# Setup mock app state
mock_app = MagicMock()
mock_app.state.task_queue = MagicMock()

# Patch _get_app to return our mock
import src.application.workers as workers # noqa: E402
workers._get_app = MagicMock(return_value=mock_app)

# Create command for a small playlist (or the user's one)
# Using a known small playlist to speed up test if possible
# But user's URL is fine since we are testing logic
url = "https://www.youtube.com/watch?v=dlQG02mwTD0&list=PLG47XsLEf0LdvYtX_zU7E_y_C1TgjgU59"
cmd = IngestYoutubeCommand(
video_url=url,
data_type=YoutubeDataType.PLAYLIST,
ingestion_job_id="test-job-uuid",
subject_id="test-subject"
)

print(f"Testing dispatcher with URL: {url}")
run_youtube_dispatcher_worker(cmd)

# Assertions
print("\nVerifying Job Service calls:")
for call in mock_job_service.update_job_status.call_args_list:
print(f" - Status update: {call[1]}")

print("\nVerifying Task Queue calls:")
enqueue_calls = mock_app.state.task_queue.enqueue.call_args_list
print(f" - Tasks enqueued: {len(enqueue_calls)}")
if len(enqueue_calls) > 0:
print(f" - First task URL: {enqueue_calls[0][0][1].video_url}")

if __name__ == "__main__":
test_dispatcher()

Copilot AI Apr 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is named like a pytest test module (test_dispatcher.py) but is a manual script (prints, __main__ runner, relies on worker internals). It will be collected by pytest in CI and can cause flaky failures. Move it under scripts/ / tmp/, rename it so pytest won’t collect it, or explicitly skip it.

Copilot uses AI. Check for mistakes.
@ericksonlopes ericksonlopes added sql SQL queries, stored procedures, and database query optimization issues backend Backend services, API development, and server-side logic related issues feature New feature or request frontend Frontend UI/UX, web interface, and client-side development issues labels Apr 8, 2026
@ericksonlopes ericksonlopes merged commit c8119ae into main Apr 8, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Backend services, API development, and server-side logic related issues feature New feature or request frontend Frontend UI/UX, web interface, and client-side development issues sql SQL queries, stored procedures, and database query optimization issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Implementation of Duplication Tests and Knowledge Context UX Enhancements

2 participants