Feature: Persistent Raw Data Storage for Ingested Content

Currently, `WhatYouSaid` processes ingested content (e.g., YouTube transcripts, text) and stores it directly as vector embeddings and SQL metadata. However, the original "raw" data (the full transcript or raw text) is not persisted in its original form. 

This feature aims to introduce a **Raw Storage Layer** to save the original content before or during the ingestion process. This is crucial for auditing, re-indexing with different chunking strategies without re-fetching from the source, and providing a "source of truth" for the RAG pipeline. The initial MVP will focus on **Local File Storage**, with an architecture designed to support **Cloud Blobs** or **Google Drive** in the future.

## Tasks
- [ ] **Domain Update**: Add `raw_storage_path` or `raw_content_url` to `ContentSourceEntity` in `src/domain/entities/content_source_entity.py`.
- [ ] **Interface Definition**: Create a new interface `IRawStorageService` in `src/domain/interfaces/services/` with methods like `save(content: str, filename: str)` and `get(filename: str)`.
- [ ] **Local Implementation**: Implement `LocalRawStorageService` in `src/infrastructure/services/` that saves files to a configurable directory (e.g., `data/raw_storage/`).
- [ ] **Use Case Integration**: Update `IngestYoutubeUseCase` in `src/application/use_cases/ingest_youtube_use_case.py` to call the storage service as soon as the transcript is fetched.
- [ ] **Configuration**: Add `RAW_STORAGE_TYPE` (default: `local`) and `RAW_STORAGE_PATH` to `src/config/settings.py`.
- [ ] **Frontend**: Update `SourcesTable.tsx` or a new "Raw View" to allow users to download or view the raw content associated with a source.

## Additional Context
- **MVP Goal**: A working local storage folder that mirrors the `ContentSource` ID or External ID.
- **Future-Proofing**: The `IRawStorageService` should be generic enough to allow implementations for AWS S3, Azure Blob, or Google Drive via a simple configuration switch, similar to how we handle `VectorStoreType`.
- **Naming Convention**: Files should probably be named using the `ContentSource` UUID or a slug derived from the title to avoid collisions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Persistent Raw Data Storage for Ingested Content #42

Tasks

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Persistent Raw Data Storage for Ingested Content #42

Description

Tasks

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions