Currently, WhatYouSaid processes ingested content (e.g., YouTube transcripts, text) and stores it directly as vector embeddings and SQL metadata. However, the original "raw" data (the full transcript or raw text) is not persisted in its original form.
This feature aims to introduce a Raw Storage Layer to save the original content before or during the ingestion process. This is crucial for auditing, re-indexing with different chunking strategies without re-fetching from the source, and providing a "source of truth" for the RAG pipeline. The initial MVP will focus on Local File Storage, with an architecture designed to support Cloud Blobs or Google Drive in the future.
Tasks
Additional Context
- MVP Goal: A working local storage folder that mirrors the
ContentSource ID or External ID.
- Future-Proofing: The
IRawStorageService should be generic enough to allow implementations for AWS S3, Azure Blob, or Google Drive via a simple configuration switch, similar to how we handle VectorStoreType.
- Naming Convention: Files should probably be named using the
ContentSource UUID or a slug derived from the title to avoid collisions.
Currently,
WhatYouSaidprocesses ingested content (e.g., YouTube transcripts, text) and stores it directly as vector embeddings and SQL metadata. However, the original "raw" data (the full transcript or raw text) is not persisted in its original form.This feature aims to introduce a Raw Storage Layer to save the original content before or during the ingestion process. This is crucial for auditing, re-indexing with different chunking strategies without re-fetching from the source, and providing a "source of truth" for the RAG pipeline. The initial MVP will focus on Local File Storage, with an architecture designed to support Cloud Blobs or Google Drive in the future.
Tasks
raw_storage_pathorraw_content_urltoContentSourceEntityinsrc/domain/entities/content_source_entity.py.IRawStorageServiceinsrc/domain/interfaces/services/with methods likesave(content: str, filename: str)andget(filename: str).LocalRawStorageServiceinsrc/infrastructure/services/that saves files to a configurable directory (e.g.,data/raw_storage/).IngestYoutubeUseCaseinsrc/application/use_cases/ingest_youtube_use_case.pyto call the storage service as soon as the transcript is fetched.RAW_STORAGE_TYPE(default:local) andRAW_STORAGE_PATHtosrc/config/settings.py.SourcesTable.tsxor a new "Raw View" to allow users to download or view the raw content associated with a source.Additional Context
ContentSourceID or External ID.IRawStorageServiceshould be generic enough to allow implementations for AWS S3, Azure Blob, or Google Drive via a simple configuration switch, similar to how we handleVectorStoreType.ContentSourceUUID or a slug derived from the title to avoid collisions.