Skip to content

Feature: Persistent Raw Data Storage for Ingested Content #42

Description

@ericksonlopes

Currently, WhatYouSaid processes ingested content (e.g., YouTube transcripts, text) and stores it directly as vector embeddings and SQL metadata. However, the original "raw" data (the full transcript or raw text) is not persisted in its original form.

This feature aims to introduce a Raw Storage Layer to save the original content before or during the ingestion process. This is crucial for auditing, re-indexing with different chunking strategies without re-fetching from the source, and providing a "source of truth" for the RAG pipeline. The initial MVP will focus on Local File Storage, with an architecture designed to support Cloud Blobs or Google Drive in the future.

Tasks

  • Domain Update: Add raw_storage_path or raw_content_url to ContentSourceEntity in src/domain/entities/content_source_entity.py.
  • Interface Definition: Create a new interface IRawStorageService in src/domain/interfaces/services/ with methods like save(content: str, filename: str) and get(filename: str).
  • Local Implementation: Implement LocalRawStorageService in src/infrastructure/services/ that saves files to a configurable directory (e.g., data/raw_storage/).
  • Use Case Integration: Update IngestYoutubeUseCase in src/application/use_cases/ingest_youtube_use_case.py to call the storage service as soon as the transcript is fetched.
  • Configuration: Add RAW_STORAGE_TYPE (default: local) and RAW_STORAGE_PATH to src/config/settings.py.
  • Frontend: Update SourcesTable.tsx or a new "Raw View" to allow users to download or view the raw content associated with a source.

Additional Context

  • MVP Goal: A working local storage folder that mirrors the ContentSource ID or External ID.
  • Future-Proofing: The IRawStorageService should be generic enough to allow implementations for AWS S3, Azure Blob, or Google Drive via a simple configuration switch, similar to how we handle VectorStoreType.
  • Naming Convention: Files should probably be named using the ContentSource UUID or a slug derived from the title to avoid collisions.

Metadata

Metadata

Assignees

Labels

backendBackend services, API development, and server-side logic related issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions