ericksonlopes · ericksonlopes · Apr 8, 2026 · Apr 7, 2026 · Apr 7, 2026 · Apr 8, 2026
diff --git a/README.md b/README.md
@@ -2,115 +2,129 @@
 
 # WhatYouSaid
 
-[![codecov](https://codecov.io/github/ericksonlopes/WhatYouSaid/branch/main/graph/badge.svg?token=8CZJARVJUE)](https://codecov.io/github/ericksonlopes/WhatYouSaid)
+## The Vectorized Intelligence & Diarization Hub
+
 
+[![codecov](https://codecov.io/github/ericksonlopes/WhatYouSaid/branch/main/graph/badge.svg?token=8CZJARVJUE)](https://codecov.io/github/ericksonlopes/WhatYouSaid)
 [![Tests](https://github.com/ericksonlopes/WhatYouSaid/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/ericksonlopes/WhatYouSaid/actions/workflows/tests.yml)
 [![Code Quality](https://github.com/ericksonlopes/WhatYouSaid/actions/workflows/code-quality.yml/badge.svg?branch=main)](https://github.com/ericksonlopes/WhatYouSaid/actions/workflows/code-quality.yml)
 [![Security](https://github.com/ericksonlopes/WhatYouSaid/actions/workflows/security.yml/badge.svg?branch=main)](https://github.com/ericksonlopes/WhatYouSaid/actions/workflows/security.yml)
 
 ![Python](https://img.shields.io/badge/-Python-3776AB?&logo=Python&logoColor=FFFFFF)
 ![React](https://img.shields.io/badge/-React-61DAFB?&logo=React&logoColor=000000)
-![Pytest](https://img.shields.io/badge/-Pytest-0A9EDC?&logo=Pytest&logoColor=FFFFFF)
-![GitHub Actions](https://img.shields.io/badge/-GitHub%20Actions-2088FF?&logo=GitHub%20Actions&logoColor=FFFFFF)
+![FastAPI](https://img.shields.io/badge/-FastAPI-05998B?&logo=FastAPI&logoColor=FFFFFF)
+![Redis](https://img.shields.io/badge/-Redis-DC382D?&logo=Redis&logoColor=FFFFFF)
+![Postgres](https://img.shields.io/badge/-PostgreSQL-4169E1?&logo=PostgreSQL&logoColor=FFFFFF)
 
 </div>
 
-WhatYouSaid is a vectorized data hub designed to explore any topic or knowledge domain. It extracts, processes, and indexes content from YouTube videos, local files, and remote URLs to enable advanced semantic search and Retrieval-Augmented Generation (RAG) workflows.
-
-This repository provides modular extractors, robust splitting utilities, and a scalable background processing pipeline to build searchable knowledge bases efficiently.
+**WhatYouSaid** is a state-of-the-art vectorized data hub designed to explore any knowledge domain. It transforms unstructured audio, video, files, and web content into structured, searchable intelligence using advanced AI techniques, including **Speaker Diarization**, **Voice Recognition**, and **RAG** (Retrieval-Augmented Generation).
 
 ---
 
-## 📚 Documentation
+## ✨ Features
 
-Detailed guides for specific topics:
+### 🎧 Diarization & Voice Intelligence
 
-- 🐳 **[Docker Deployment Guide](docs/docker-deployment.md)**: Learn how to use Docker Profiles to run different combinations of databases (MySQL, Postgres, SQLite) and vector stores (FAISS, Weaviate).
+- **Speaker Segmentation**: Automatically split audio/video files by speaker using WhisperX/Whisper for unmatched accuracy.
+- **Voice Recognition**: Identify and label speakers across your entire knowledge base using trained voice profiles.
+- **Diarization Pipeline**: Interactive dashboard to review, edit, and finalize transcripts and speaker assignments before indexing.
 
----
 
-## 🚀 Features
+### 📥 Multi-Source Ingestion
+
+- **YouTube Ecosystem**: Full support for individual videos, entire playlists, or entire channels.
+- **Document Extractors**: High-fidelity extraction from PDF, DOCX, and TXT files.
+- **Web Intelligence**: Powerful scraping via **Crawl4AI** and **Docling** for websites and remote URLs.
+- **Robust Pipeline**: Step-by-step progress tracking with real-time SSE notifications and full rollback support on failure.
+
+### 🔍 Advanced Semantic Search
 
-- **Multi-source Extraction**: Ingest data from YouTube (transcripts), local files (PDF, DOCX, TXT), **remote URLs** via Docling, and **Websites** via Crawl4AI.
-- **Robust Fallbacks**: Integrated `PlainTextExtractor` ensuring successful ingestion even for formats not supported by specialized extractors.
-- **Async Task Queue**: High-performance background processing powered by **Redis**, ensuring responsive workflows.
-- **Structured Logging & Tracing**: Centralized logging equipped with contextvars and request tracing (Correlation IDs) for end-to-end observability.
-- **Real-time Updates**: Live ingestion status and progress monitoring via a **Redis Event Bus** (SSE-ready).
-- **Advanced Search**: Semantic, keyword (BM25), and **Hybrid Search** with cross-encoder re-ranking for maximum precision.
-- **Pluggable Vector Stores**: Support for **FAISS** (local), **ChromaDB**, **Weaviate** (scalable), and **Qdrant**.
-- **Pluggable SQL Databases**: Support for **SQLite**, **PostgreSQL**, **MySQL**, **MariaDB**, and **MSSQL**.
-- **Modern Dashboard**: A clean React + Tailwind CSS frontend for managing knowledge subjects, content sources, and monitoring background tasks.
+- **Hybrid Search**: Combining Vector (FAISS/Weaviate/Chroma) and Keyword (BM25) search for maximum precision.
+- **Re-Ranking**: Specialized Cross-Encoders ensure the most relevant context is always at the top.
+- **Pluggable Architecture**: Seamlessly switch between SQL databases (SQLite/Postgres/MySQL) and Vector stores.
 
 ---
 
-## 🛠️ Infrastructure & Deployment
+## 🚀 Quick Start
 
-WhatYouSaid is designed to be flexible, from a lightweight local setup to a scalable production-ready environment.
+WhatYouSaid is powered by **Python 3.12** and uses **uv** for high-speed dependency management.
 
-### 1. Storage & Messaging Options
+### 1. Prerequisites
 
-| Component | Lightweight (Local) | Scalable / Production |
-| :--- | :--- | :--- |
-| **Relational Database** | **SQLite** (Default, file-based) | **PostgreSQL**, **MySQL**, **MariaDB**, **MSSQL** |
-| **Vector Store** | **FAISS** (Local, file-based) | **Weaviate** (Container or Cloud), **ChromaDB** |
-| **Task Queue & Bus** | **In-memory** (Limited) | **Redis** (Default in Docker) |
+- [uv](https://github.com/astral-sh/uv) (Recommended) or `pip`
+- [Docker](https://www.docker.com/)
+
+### 2. Environment Setup
 
-### 2. Docker Compose Profiles & Dependencies
+```bash
+# Clone the repository
+git clone https://github.com/ericksonlopes/WhatYouSaid.git
+cd WhatYouSaid
 
-We use **Docker Profiles** to keep the environment lean. Only the services you need are started. The project also natively supports both **CPU** and **GPU** environments via optional Python dependencies.
+# Install dependencies (including dev groups)
+uv sync --group dev
+```
 
-> 📘 **Detailed Guide**: For a step-by-step tutorial on different deployment scenarios, see our [Docker Deployment Guide](docs/docker-deployment.md).
+### 3. Spin Up Infrastructure
 
-#### **Scenario A: Lite (Default)**
-Uses **SQLite**, **FAISS**, and **Redis**.
 ```bash
+# Lite mode: SQLite + FAISS + Redis
 docker-compose up -d
+
+# Scalable mode: PostgreSQL + Weaviate + Redis
+docker-compose --profile base up -d
 ```
 
-#### **Scenario B: Scalable (Base)**
-Starts **PostgreSQL**, **Weaviate**, and **Redis**.
+### 4. Run Application
+
 ```bash
-docker-compose --profile base up -d
-# Note: Set SQL__TYPE=postgres and VECTOR__STORE_TYPE=weaviate in .env
+# Start Backend (FastAPI)
+python main.py
+
+# Start Frontend (React)
+cd frontend
+npm install
+npm run dev
 ```
 
 ---
 
-## 🏗️ Architecture
+## 🐳 Deployment Profiles
 
-The system follows a clean architecture approach, ensuring separation of concerns:
+We use **Docker Profiles** to keep your environment lean. Only the services you need are started.
 
-- **Application Layer**: Contains use cases (e.g., `FileIngestionUseCase`, `SearchUseCase`) and a `ServiceRegistry` for background worker dependency resolution.
-- **Infrastructure Layer**:
-  - `extractors/`: Fetch raw content (Docling, YouTube, PlainText).
-  - `repositories/`: Data persistence (SQLAlchemy for relational, specialized clients for Vector Stores).
-  - `services/`: Core logic (text splitting, embedding, re-ranking, Redis task queue).
-- **Presentation Layer**: FastAPI-based REST API with real-time SSE notifications.
+| Component | Lite Profile (Default) | Scalable Profile (`base`) |
+| :--- | :--- | :--- |
+| **Relational DB** | SQLite (File-based) | PostgreSQL / MySQL / MariaDB |
+| **Vector Store** | FAISS (Local) | Weaviate / ChromaDB / Qdrant |
+| **Task Queue** | Redis | Redis (Production-ready) |
 
----
+> [!TIP]
+> Use the **Scalable** profile if you require high-concurrency access or plan to manage multi-gigabyte vector indexes.
 
-## 🧪 Quality & Testing
+---
 
-We maintain a high standard of code quality and test coverage:
+## 🏗️ Clean Architecture
 
-- **417+ Automated Tests**: Covering unit, integration, and complex edge cases.
-- **93% Code Coverage**: Verified via `pytest-cov`.
-- **Strict Linting**: Powered by `ruff` for code style and `mypy` for static type checking.
-- **Security Scanning**: Integrated `bandit` scans for vulnerability detection.
+The system follows a modular approach ensuring maximum testability and maintainability:
 
-**Run tests locally:**
-```bash
-uv run pytest
-```
+- **Application Layer**: Orchestrates logic via use cases and resolves background worker dependencies through a `ServiceRegistry`.
+- **Infrastructure Layer**:
+  - `extractors/`: Fetch raw content from specialized sources (Docling, YouTube, Crawl4AI).
+  - `repositories/`: Persistence via SQL (SQLAlchemy) and specialized Vector clients.
+  - `services/`: Core providers for embeddings, text splitting, and re-ranking.
+- **Presentation Layer**: FastAPI-based REST API with real-time event broadcasting and a modern React dashboard.
 
 ---
 
-## 🤝 Contributing
+## 🤝 Contributing & Quality
+
+Contributions are what make the open-source community such an amazing place! Please:
 
-Contributions are welcome. Please:
-- Open an issue to discuss major changes.
-- Add tests for any new feature or bug fix.
-- Ensure `ruff check .` and `mypy .` pass before submitting.
+1. Open an **Issue** to discuss proposed changes.
+2. Ensure `uv run ruff check . --fix` and `uv run mypy .` pass.
+3. Run all tests: `uv run pytest`.
 
 ---
 
@@ -119,8 +133,9 @@ Contributions are welcome. Please:
 This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 
 <div align="center">
-    <p>Made with ❤️ by Erickson Lopes </p>
 
-[![LinkedIn](https://img.shields.io/badge/LinkedIn-Erickson_Lopes-blue)](https://www.linkedin.com/in/ericksonlopes/)
+Hand-crafted with ❤️ by **Erickson Lopes**
+
+[![LinkedIn](https://img.shields.io/badge/LinkedIn-Erickson_Lopes-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/ericksonlopes/)
 
 </div>
diff --git a/alembic/env.py b/alembic/env.py
@@ -10,7 +10,7 @@
 
 from alembic import context
 from src.config.settings import settings
-from src.infrastructure.repositories.sql.connector import Base
+from src.infrastructure.connectors.connector_sql import Base
 
 _package_name = "src.infrastructure.repositories.sql.models"
 
@@ -58,15 +58,15 @@ def include_object(obj, name, type_, reflected, compare_to):
 @writer.rewrites(ops.CreateTableOp)
 @writer.rewrites(ops.CreateIndexOp)
 def add_if_not_exists(context, revision, op):
-    if not context.as_batch:
+    if not getattr(context, "as_batch", False):
         op.if_not_exists = True
     return op
 
 
 @writer.rewrites(ops.DropTableOp)
 @writer.rewrites(ops.DropIndexOp)
 def add_if_exists(context, revision, op):
-    if not context.as_batch:
+    if not getattr(context, "as_batch", False):
         op.if_exists = True
     return op
 

diff --git a/alembic/versions/646a175ac845_add_chunk_duplicates_table_and_is_.py b/alembic/versions/646a175ac845_add_chunk_duplicates_table_and_is_.py
@@ -0,0 +1,42 @@
+"""Add chunk duplicates table and is_active flag
+
+Revision ID: 646a175ac845
+Revises: b2c3d4e5f6a7
+Create Date: 2026-04-08 09:56:58.625813
+
+"""
+from typing import Sequence, Union
+
+import sqlalchemy as sa
+
+from alembic import op
+
+# revision identifiers, used by Alembic.
+revision: str = '646a175ac845'
+down_revision: Union[str, Sequence[str], None] = 'b2c3d4e5f6a7'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    """Upgrade schema."""
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.create_table('chunk_duplicates',
+        sa.Column('id', sa.UUID(), nullable=False),
+        sa.Column('chunk_ids', sa.JSON(), nullable=False),
+        sa.Column('similarity', sa.Float(), nullable=False),
+        sa.Column('status', sa.Text(), nullable=False),
+        sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.text('(CURRENT_TIMESTAMP)'), nullable=False),
+        sa.Column('updated_at', sa.DateTime(timezone=True), server_default=sa.text('(CURRENT_TIMESTAMP)'), nullable=False),
+        sa.PrimaryKeyConstraint('id')
+    )
+    op.add_column('chunk_index', sa.Column('is_active', sa.Boolean(), server_default=sa.text('1'), nullable=False))
+    # ### end Alembic commands ###
+
+
+def downgrade() -> None:
+    """Downgrade schema."""
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.drop_column('chunk_index', 'is_active')
+    op.drop_table('chunk_duplicates')
+    # ### end Alembic commands ###
diff --git a/alembic/versions/84524e052673_add_content_source_id_to_duplicates.py b/alembic/versions/84524e052673_add_content_source_id_to_duplicates.py
@@ -0,0 +1,32 @@
+"""add_content_source_id_to_duplicates
+
+Revision ID: 84524e052673
+Revises: 646a175ac845
+Create Date: 2026-04-08 10:50:39.027257
+
+"""
+from typing import Sequence, Union
+
+import sqlalchemy as sa
+
+from alembic import op
+
+# revision identifiers, used by Alembic.
+revision: str = '84524e052673'
+down_revision: Union[str, Sequence[str], None] = '646a175ac845'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    """Upgrade schema."""
+    with op.batch_alter_table('chunk_duplicates', schema=None) as batch_op:
+        batch_op.add_column(sa.Column('content_source_id', sa.UUID(), nullable=True))
+        batch_op.create_foreign_key('fk_chunk_duplicates_content_source_id_content_sources', 'content_sources', ['content_source_id'], ['id'], initially='IMMEDIATE', deferrable=True)
+
+
+def downgrade() -> None:
+    """Downgrade schema."""
+    with op.batch_alter_table('chunk_duplicates', schema=None) as batch_op:
+        batch_op.drop_constraint('fk_chunk_duplicates_content_source_id_content_sources', type_='foreignkey')
+        batch_op.drop_column('content_source_id')
diff --git a/docs/issues/issue-duplication-tests-ux.md b/docs/issues/issue-duplication-tests-ux.md
@@ -0,0 +1,14 @@
+## Description
+Implemented a comprehensive test suite for the chunk duplication feature, covering repository, service, and API layers. Additionally, improved the sidebar UX by enabling simple-toggle multi-selection, fixing indicator icon bugs, and adding a search-by-name field for subjects.
+
+## Tasks
+- [x] Create SQL repository tests for chunk duplicates `tests/infrastructure/repositories/sql/test_chunk_duplicate_repository.py`
+- [x] Create service tests for duplicate detection logic `tests/infrastructure/services/test_chunk_duplicate_service.py`
+- [x] Create API router tests for duplicate endpoints `tests/presentation/api/routes/test_duplicate_router.py`
+- [x] Update `SidebarContext.tsx` to enable simple toggle selection for multiple bases.
+- [x] Fix Check icon bug in multi-selection in `SidebarContext.tsx`.
+- [x] Add search filter field in `SidebarContext.tsx`.
+- [x] Fix `tests/conftest.py` import path for infrastructure.
+
+## Additional Context
+The sidebar changes eliminate the need for Ctrl+Click, making the multi-knowledge selection more discoverable. The search field ensures usability as the number of knowledge bases grows.