diff --git a/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md b/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md new file mode 100644 index 00000000..3fb99fcb --- /dev/null +++ b/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md @@ -0,0 +1,261 @@ +--- +date: 2026-04-02 +topic: video-content-vectorization +--- + +# Video Content Vectorization for Recommendations + +## Problem Frame + +JesusFilm has 50,000+ unique videos ranging from short clips to feature-length films, each available in up to 1,500 language variants. Current recommendations are purely metadata-driven — "you watched Film X, here it is in 1,500 other languages." There is no way to recommend thematically or visually similar content across different films. + +Existing transcript-based text embeddings (already built in the manager pipeline) capture _what was said_ but miss _what was shown_ — visual setting, actions, emotions, cinematography, and mood. A user watching a contemplative scene of someone walking by water should be recommended other reflective moments from entirely different films, not the same film dubbed in Swahili. + +**Validation needed**: Before full investment, confirm that transcript-only embeddings do not already provide adequate cross-film similarity. A quick test (20-50 seed videos, manual evaluation of transcript embedding recommendations) should establish whether visual scene analysis adds meaningful lift. + +**Catalog composition unknown**: The 50K figure includes all video labels (featureFilm, shortFilm, segment, episode, collection, trailer, behindTheScenes). The ratio of feature-length films to short clips dramatically affects scene count, processing time, and cost. A data audit (see R0) is prerequisite to finalizing the approach. + +## Rollout Strategy + +**Phase 1 — English prototype (this scope)**: Process all English-language videos only. Prove recommendation quality, validate the pipeline, and establish cost baseline. This is the fundable proof of concept. + +**Phase 2 — Full catalog (future, funding-dependent)**: If Phase 1 demonstrates value, expand to all 50K+ videos across all languages. Phase 2 is explicitly out of scope for this requirements doc. + +All requirements below are scoped to Phase 1 (English videos only) unless stated otherwise. + +## Requirements + +- R0. **Data audit (prerequisite)**: Before committing to the pipeline, query the CMS to determine: (a) video count by label type and duration distribution for English-language videos, (b) how many have existing chapter/scene metadata from the enrichment pipeline, (c) whether the Video → VideoVariant model provides implicit deduplication or whether separate Video records exist for the same content in different languages. +- R1. **Scene segmentation**: Break videos into meaningful narrative scenes with precise start/end timestamps. + - R1a. **Transcript-based segmentation**: Extend the existing `chapters.ts` service output (which already produces titles, start/end timestamps, and summaries via LLM) as the baseline for scene boundaries. For short clips that are a single scene, chapter output may be sufficient without further segmentation. + - R1b. **Visual shot detection + fusion**: For feature-length films, augment transcript-based boundaries with visual shot detection to produce more accurate narrative scene boundaries. This is a research-heavy component — evaluate libraries and approaches during planning. +- R2. **Scene content description**: For each scene, generate a rich multimodal description capturing visual setting, objects, actions, characters, emotional tone, and mood by feeding representative frames + transcript to a multimodal LLM. Note: this requires a new multimodal LLM client — the existing OpenRouter `embeddings.ts` is text-only and cannot send images. +- R3. **Scene embedding and storage**: Embed each scene description using the existing text embedding pipeline (`text-embedding-3-small`, 1536 dims) and store in a **separate `scene_embeddings` table** in pgvector with full traceability back to source video and scene. +- R4. **Cross-film recommendation**: Given a scene or video, find visually and thematically similar scenes from _different_ films using vector similarity. Deduplication across language variants uses the Video → VideoVariant parent relationship (embed once per Video, not per variant). This scope includes the vector similarity query capability; the recommendation UI (how results are surfaced in web/mobile) is a separate feature. +- R5. **Backfill worker**: A dedicated worker service to process the English video catalog. Must be resumable/idempotent. Must include: + - Configurable batch size and rate limits + - Cost tracking per video and cumulative + - Automatic pause if cost exceeds a configurable threshold + - Dry-run mode that estimates cost without calling LLMs +- R6. **Incremental pipeline integration**: After backfill, scene vectorization becomes a required step in the existing manager enrichment workflow for new English video uploads. Note: unlike existing parallel steps (translate, chapters, metadata, embeddings) which all consume transcript text, scene vectorization needs video frame access via muxAssetId — it runs as an independent branch, not a simple addition to the existing parallel group. +- R7. **Existing scene metadata**: Where videos already have chapter output from the enrichment pipeline, use it as the starting point for segmentation rather than re-detecting from scratch. + +## Storage Schema + +Scene embeddings are stored in a dedicated pgvector table with full traceability to source video and scene boundaries: + +```sql +CREATE TABLE scene_embeddings ( + id SERIAL PRIMARY KEY, + + -- Traceability: which video and scene + video_id INTEGER NOT NULL, -- FK to Strapi video record + core_id TEXT, -- video.coreId for cross-reference + mux_asset_id TEXT NOT NULL, -- which Mux asset frames came from + playback_id TEXT NOT NULL, -- for Mux thumbnail URL construction + + -- Scene boundaries + scene_index INTEGER NOT NULL, -- 0-based order within the video + start_seconds FLOAT NOT NULL, + end_seconds FLOAT, -- NULL for final scene (extends to end) + + -- Content (for debugging, tracing, and quality review) + description TEXT NOT NULL, -- LLM-generated scene description + chapter_title TEXT, -- from chapters.ts if available + frame_count INTEGER, -- how many frames were sent to LLM + + -- The embedding + embedding vector(1536) NOT NULL, + model TEXT NOT NULL DEFAULT 'text-embedding-3-small', + + -- Phase tracking + language TEXT NOT NULL DEFAULT 'en', -- which language transcript was used + + -- Metadata + created_at TIMESTAMPTZ DEFAULT NOW(), + + -- Uniqueness: one embedding per scene per video + UNIQUE(video_id, scene_index) +); + +-- HNSW index for fast similarity search +CREATE INDEX scene_embeddings_hnsw + ON scene_embeddings USING hnsw (embedding vector_cosine_ops); + +-- Lookup by video (for "find scenes in this video" and deduplication) +CREATE INDEX scene_embeddings_video_id ON scene_embeddings(video_id); + +-- Phase filtering (English prototype vs full catalog) +CREATE INDEX scene_embeddings_language ON scene_embeddings(language); +``` + +**How to trace an embedding back to its source:** + +- `video_id` → Strapi Video record (title, slug, label, description) +- `video_id` → Video.variants → VideoVariant records (language-specific playback) +- `mux_asset_id` / `playback_id` → Mux asset (for re-extracting frames) +- `scene_index` + `start_seconds` / `end_seconds` → exact moment in the video +- `description` → what the LLM "saw" in this scene (stored for inspection) +- `chapter_title` → link to chapters.ts output if it was the scene source + +**Recommendation query pattern:** + +```sql +-- Find similar scenes from DIFFERENT videos +SELECT se.video_id, se.scene_index, se.description, se.start_seconds, + 1 - (se.embedding <=> $1) AS similarity +FROM scene_embeddings se +WHERE se.video_id != $2 -- exclude current video + AND se.language = 'en' -- Phase 1: English only +ORDER BY se.embedding <=> $1 +LIMIT 10; +``` + +**Why this schema:** + +- **Separate from `video_embeddings`** (feat-009): Different columns (timestamps, description) and different query patterns (scene similarity vs. transcript keyword search). Separate tables let feat-009 proceed as-is. +- **`video_id` as dedup key**: Language variants are VideoVariants under the same Video parent. Embedding once per Video and filtering by `video_id !=` gives implicit cross-variant deduplication. +- **`language` column**: Enables Phase 1 (English only) filtering and future Phase 2 expansion without schema changes. +- **`description` stored**: Enables quality review, debugging, and re-embedding with a different model without re-running the LLM. + +## Rough Cost Model + +**Phase 1 (English only) — order-of-magnitude estimates. Refine after R0 data audit.** + +English subset is likely a fraction of the 50K total. Assuming ~5K-10K English videos: + +- Short clips (~80%): 8K × 2 scenes = ~16K scene descriptions +- Feature films (~20%): 2K × 75 scenes = ~150K scene descriptions +- **Total: ~166K multimodal LLM calls** + +At Gemini 2.5 Flash pricing (~$0.15/1M input tokens, ~$0.60/1M output tokens): + +- Per scene: ~3 frames (thumbnails) + transcript chunk ≈ ~2K tokens input, ~500 tokens output +- **Total input: ~332M tokens → ~$50** +- **Total output: ~83M tokens → ~$50** +- **Embedding cost**: 166K × text-embedding-3-small ≈ ~$3 +- **Phase 1 rough total: ~$100-$300** + +**Full catalog estimate (Phase 2, for future funding request):** + +- ~830K scene descriptions → ~$500-$1,500 + +Compare: Twelve Labs Embed at ~$0.03/min × estimated 500K+ total minutes = **$15K+** + +## Success Criteria + +- Recommendations surface genuinely different films/clips based on visual and thematic similarity, not just metadata overlap +- **Measurable quality bar**: Curate 50-100 seed videos with human-labeled "expected similar" results. Scene embeddings must surface at least 3 relevant cross-film results in top 10 for 80% of seed videos, outperforming transcript-only embeddings on the same evaluation set. +- Feature-length films are segmented into meaningful narrative scenes (not raw shot cuts) +- The backfill worker can process the English catalog without manual intervention (resumable on failure, cost-capped) +- New English uploads are automatically scene-vectorized as part of the enrichment pipeline +- Language variants of the same content are deduplicated in recommendation results +- **Phase gate**: Phase 1 results are evaluated before requesting Phase 2 funding + +## Scope Boundaries + +- **Phase 1 only**: English-language videos. Other languages are Phase 2, out of scope. +- **Not building a user-facing search UI** — this is the recommendation engine layer. Search (feat-010) is a separate concern. +- **Not replacing transcript embeddings** — scene embeddings complement them. Both live in pgvector in separate tables. +- **Hybrid approach**: Start with LLM-generated scene descriptions embedded as text vectors (ships faster, reuses existing infra). Native video embedding models (Twelve Labs, Gemini video embeddings) are a future upgrade path, not in scope now. +- **Not building the recommendation UI** — this provides the vector similarity query capability. How recommendations are surfaced in web/mobile is a separate feature. + +## Key Decisions + +- **English-first phased rollout**: Prototype with all English videos (~$100-$300 estimated cost). Prove value before investing in full 50K+ catalog. Phase 2 is a separate funding decision. +- **LLM descriptions over native video embeddings**: At scale, native video embedding APIs (Twelve Labs at ~$15K+) are 10-30x more expensive than LLM scene descriptions (~$500-$1,500 full catalog). LLM descriptions reuse existing infrastructure (text-embedding-3-small + pgvector) and provide good quality. Can upgrade selectively later. +- **Scene-level granularity**: Embeddings are per-scene, not per-frame or per-video. Short clips may be 1-3 scenes; feature films 50-200. This is the right unit for recommendations. +- **Build on existing chapters pipeline**: The `chapters.ts` service already produces transcript-based scene segmentation with timestamps. R1 extends this with visual shot detection for feature films rather than building scene detection from scratch. +- **Separate `scene_embeddings` table**: Scene embeddings have different columns (start/end timestamps, description text) and query patterns than transcript chunk embeddings. Separate tables let feat-009 proceed as-is and keep query logic clean. Resolve before feat-009 starts Apr 7. +- **Hybrid storage: pgvector + lightweight metadata**: Scene data lives in the `scene_embeddings` table with full traceability columns (video_id, mux_asset_id, timestamps, description) rather than as a Strapi content type. Keeps it lean for prototype; can promote to CMS entity later if human-in-the-loop editing is needed. +- **Backfill worker separate from manager**: The one-time catalog processing runs as a dedicated worker service (can scale independently, doesn't block the manager pipeline). Can reuse the same workflow code/libraries. New uploads use the integrated manager pipeline step. +- **Deduplication via Video → VideoVariant model**: Scene detection and embedding runs once per Video entity (the parent), not per VideoVariant. Recommendations filter by unique Video ID. Confirm during data audit (R0) that language variants are modeled as VideoVariants, not separate Video records. + +## Dependencies / Assumptions + +- **pgvector must be deployed first** (feat-009, scheduled Apr 7, 14-day duration → ~Apr 21) — R3, R4, R6 are blocked. R0, R1, R2, R5 scaffolding can proceed in parallel. +- **Existing chapters pipeline** in manager is working and produces scene-like segmentation +- **Mux thumbnail API** provides frame extraction at specific timestamps via `image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time=N` — confirm during planning +- **New multimodal LLM client needed** — existing OpenRouter client is text-only; R2 requires sending images alongside text +- **Railway worker constraints** — need to confirm Railway supports long-lived worker processes or design backfill as queue-based with short-lived jobs. Existing `railway.toml` has `restartPolicyMaxRetries: 3` which may not suit multi-day processing. + +## Outstanding Questions + +### Deferred to Planning + +- [Affects R0][Data audit] Query CMS for English video count by label, duration distribution, and chapter metadata coverage. This gates the pipeline sizing. +- [Affects R1b][Needs research] Which visual scene detection libraries work best for narrative film content? PySceneDetect handles shot boundaries; evaluate options for combining with transcript-based scene detection. +- [Affects R2][Needs research] Which multimodal LLM gives best scene descriptions for the cost? Gemini 2.5 Flash vs GPT-4o vs others — benchmark quality and pricing at scale. +- [Affects R2][Technical] How many representative frames per scene should be sampled for description? 1 keyframe vs 3-5 frames affects description quality and API cost. +- [Affects R5][Technical] Backfill worker architecture — queue-based (process videos from a job queue) or single long-lived process? Depends on Railway constraints. +- [Affects R5][Needs research] Confirm Mux thumbnail API works for arbitrary timestamps and returns sufficient resolution for multimodal LLM input. +- [Affects R4][Technical] How will scene similarity interact with feat-010 semantic search API? Different query pattern (find similar scenes vs. keyword search). + +## Visual Embedding Technology Research + +**Researched Apr 2, 2026. Use to inform feat-040 (scene descriptions) model selection.** + +### Approach Comparison + +| Approach | Est. Cost (50K videos) | Quality | Infra Complexity | +| ------------------------------------------ | ---------------------- | ----------------------------- | ------------------------- | +| **Gemini 2.5 Flash describe + text-embed** | **$150-300** | **High (narrative + visual)** | **Low (reuses existing)** | +| Gemini Embedding 2 (direct video embed) | $2,000-5,000 | High (native multimodal) | Medium (new index) | +| Twelve Labs Embed (Marengo 3.0) | $10,000+ | Highest (purpose-built) | Medium (new index) | +| CLIP/SigLIP local | ~$0 (compute only) | Medium (visual only) | Medium (new index + GPU) | +| GPT-4o describe + text-embed | $1,200-2,400 | High | Low | + +### Recommended: Gemini 2.5 Flash "Describe then Embed" + +- **Image input**: Accepts multiple images + text per request. ~1,290 tokens per image ≈ $0.000039/image. +- **At 3 frames/scene × 166K scenes (English)**: ~$58 in image tokens + ~$50 output tokens = **~$100-$300 total**. +- **Quality**: Strong at visual description, emotional tone, settings, actions. Best cost/quality ratio by a wide margin. +- **Why not GPT-4o**: 8x more expensive ($2.50/1M input vs $0.30/1M). Comparable quality. +- **Why not Claude**: Haiku is 3-4x more expensive, Sonnet 10x. Not justified at scale for scene description. + +### Why Not CLIP/SigLIP Directly? + +CLIP/SigLIP produce embeddings directly from images (512-1152 dims) in a shared text-image space. Strengths: zero marginal cost, text-to-image search works. But: + +- Embeddings capture "what's in this image" not narrative meaning. Will find "beach scene" but miss "baptism at a river" vs "family swimming at a lake." +- **Incompatible vector space** with text-embedding-3-small — cannot mix in the same pgvector index. +- For ministry content requiring semantic nuance, CLIP alone is insufficient. + +### Future Upgrade Path: Gemini Embedding 2 + +Google's multimodal embedding model (public preview, Mar 2026): + +- 3072 dims (Matryoshka down to 768). Can target 1536 to match existing space. +- Accepts text, image, video, audio in one unified embedding space. +- **Video constraint**: max 80-120 seconds per clip → fits our scene-based approach. +- Pricing: ~$0.00079/frame. At 1fps for 60s scenes ≈ $0.047/scene. +- **When to adopt**: Once out of preview and pricing stabilizes. Store as a second signal in a separate column, combine scores at query time. + +### Mux Thumbnail API (Confirmed) + +- **URL**: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.{png|jpg|webp}?time={SECONDS}` +- **Resolution**: Defaults to original video resolution. Supports `?width=512&height=512` for LLM-friendly sizes. +- **Rate limit**: 1 unique thumbnail per 10 seconds of video duration per asset. A 60-min film supports 360 thumbnails — plenty for 3 frames × 20 scenes. +- **Cost**: Included in Mux standard pricing. No per-thumbnail charge. +- **CDN cached**: Repeated requests for the same timestamp are free. + +## Roadmap Tickets + +This brainstorm produced the following roadmap features in `docs/roadmap/content-discovery/`: + +| ID | Feature | Days | Start | Depends on | +| ------------------------------------------------------------------------------------ | ----------------------------------- | ---- | ------ | ---------------------------- | +| [feat-037](../roadmap/content-discovery/feat-037-video-content-vectorization.md) | Parent: Video Content Vectorization | 42 | Apr 21 | feat-009, feat-031 | +| [feat-038](../roadmap/content-discovery/feat-038-video-vectorization-data-audit.md) | Data Audit | 3 | Apr 21 | feat-037 | +| [feat-039](../roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md) | Chapter-Based Scene Boundaries | 7 | Apr 24 | feat-038 | +| [feat-040](../roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md) | Multimodal Scene Descriptions | 10 | May 1 | feat-039 | +| [feat-041](../roadmap/content-discovery/feat-041-scene-embeddings-table.md) | Scene Embeddings Table + Indexing | 7 | May 11 | feat-009, feat-040 | +| [feat-042](../roadmap/content-discovery/feat-042-backfill-worker.md) | English Backfill Worker | 10 | May 18 | feat-038, feat-040, feat-041 | +| [feat-043](../roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md) | Visual Shot Detection Fusion (P2) | 10 | May 28 | feat-039 | +| [feat-044](../roadmap/content-discovery/feat-044-recommendation-query-api.md) | Recommendation Query API | 7 | May 28 | feat-041, feat-042 | +| [feat-045](../roadmap/content-discovery/feat-045-pipeline-integration.md) | Pipeline Integration | 7 | Jun 4 | feat-041, feat-042 | +| [feat-046](../roadmap/content-discovery/feat-046-recommendations-demo-experience.md) | Recommendations Demo Experience | 7 | Jun 4 | feat-044 | + +## Next Steps + +→ `/ce:plan` for structured implementation planning (R0 data audit is first planning task). diff --git a/docs/roadmap/README.md b/docs/roadmap/README.md index 3a72c5c4..3ff3813e 100644 --- a/docs/roadmap/README.md +++ b/docs/roadmap/README.md @@ -30,12 +30,22 @@ Build trusted, scalable AI capabilities that help people discover gospel content ### Content Discovery -| ID | Feature | Owner | Priority | Start | Days | Status | -| --------------------------------------------------------------------- | ------------------------------------- | ----- | -------- | ------ | ---- | ----------- | -| [feat-009](content-discovery/feat-009-pgvector-embedding-indexing.md) | pgvector Setup and Embedding Indexing | nisal | P0 | Apr 7 | 14 | not-started | -| [feat-010](content-discovery/feat-010-semantic-search-api.md) | Semantic Search API | nisal | P0 | Apr 14 | 21 | not-started | -| [feat-011](content-discovery/feat-011-search-ui-web.md) | Search UI — Web | urim | P0 | Apr 14 | 21 | not-started | -| [feat-012](content-discovery/feat-012-search-ui-mobile.md) | Search UI — Mobile | urim | P0 | Apr 14 | 21 | not-started | +| ID | Feature | Owner | Priority | Start | Days | Status | +| ------------------------------------------------------------------------- | ------------------------------------- | ----- | -------- | ------ | ---- | ----------- | +| [feat-009](content-discovery/feat-009-pgvector-embedding-indexing.md) | pgvector Setup and Embedding Indexing | nisal | P0 | Apr 7 | 14 | not-started | +| [feat-010](content-discovery/feat-010-semantic-search-api.md) | Semantic Search API | nisal | P0 | Apr 14 | 21 | not-started | +| [feat-011](content-discovery/feat-011-search-ui-web.md) | Search UI — Web | urim | P0 | Apr 14 | 21 | not-started | +| [feat-012](content-discovery/feat-012-search-ui-mobile.md) | Search UI — Mobile | urim | P0 | Apr 14 | 21 | not-started | +| [feat-037](content-discovery/feat-037-video-content-vectorization.md) | Video Content Vectorization for Recs | nisal | P1 | Apr 21 | 42 | not-started | +| [feat-038](content-discovery/feat-038-video-vectorization-data-audit.md) | Vectorization — Data Audit | nisal | P1 | Apr 21 | 3 | not-started | +| [feat-039](content-discovery/feat-039-chapter-based-scene-boundaries.md) | Vectorization — Scene Boundaries | nisal | P1 | Apr 24 | 7 | not-started | +| [feat-040](content-discovery/feat-040-multimodal-scene-descriptions.md) | Vectorization — Scene Descriptions | nisal | P1 | May 1 | 10 | not-started | +| [feat-041](content-discovery/feat-041-scene-embeddings-table.md) | Vectorization — Embeddings Table | nisal | P1 | May 11 | 7 | not-started | +| [feat-042](content-discovery/feat-042-backfill-worker.md) | Vectorization — English Backfill | nisal | P1 | May 18 | 10 | not-started | +| [feat-043](content-discovery/feat-043-visual-shot-detection-fusion.md) | Vectorization — Visual Shot Fusion | nisal | P2 | May 28 | 10 | not-started | +| [feat-044](content-discovery/feat-044-recommendation-query-api.md) | Vectorization — Recommendation API | nisal | P1 | May 28 | 7 | not-started | +| [feat-045](content-discovery/feat-045-pipeline-integration.md) | Vectorization — Pipeline Integration | nisal | P1 | Jun 4 | 7 | not-started | +| [feat-046](content-discovery/feat-046-recommendations-demo-experience.md) | Vectorization — Recommendations Demo | nisal | P1 | Jun 4 | 7 | not-started | ### Topic Experiences diff --git a/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md b/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md index 541111de..09c4640d 100644 --- a/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md +++ b/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md @@ -10,6 +10,7 @@ depends_on: - "feat-002" blocks: - "feat-010" + - "feat-037" tags: - "cms" - "pgvector" diff --git a/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md b/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md new file mode 100644 index 00000000..70901205 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md @@ -0,0 +1,215 @@ +--- +id: "feat-037" +title: "Video Content Vectorization for Recommendations" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-04-21" +duration: 42 +depends_on: + - "feat-009" + - "feat-031" +blocks: + - "feat-038" +tags: + - "cms" + - "pgvector" + - "ai-pipeline" + - "search" + - "manager" +--- + +## Problem + +Current recommendations are metadata-driven — "you watched Film X, here it is in 1,500 other languages." Transcript embeddings (feat-009/010) capture what was said, but miss what was shown. Visual scene embeddings enable cross-film recommendations based on visual setting, actions, emotional tone, and mood. + +**Phase 1 (this feature)**: All English-language videos. Prove recommendation quality at ~$100-$300 estimated cost. Phase 2 (full 50K+ catalog) is a separate funding decision. + +## Entry Points — Read These First + +1. `apps/manager/src/services/chapters.ts` — existing scene-like segmentation: `Chapter { title, startSeconds, endSeconds, summary }`. This is the baseline for R1a. +2. `apps/manager/src/services/embeddings.ts` — existing text embedding pipeline using `text-embedding-3-small` (1536 dims). Scene descriptions will be embedded through the same model. +3. `apps/manager/src/workflows/videoEnrichment.ts` — enrichment workflow with parallel steps. R6 adds scene vectorization as a new branch. +4. `apps/manager/src/services/storage.ts` — S3 artifact storage pattern (`{assetId}/{type}.json`). +5. `apps/cms/src/api/video/content-types/video/schema.json` — Video content type with `coreId`, `label` enum, `variants` relation. +6. `apps/cms/src/api/video-variant/content-types/video-variant/schema.json` — VideoVariant with `language` and `muxVideo` relations. +7. `apps/cms/src/api/mux-video/content-types/mux-video/schema.json` — MuxVideo with `assetId` and `playbackId` for frame extraction. +8. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — full requirements doc with storage schema, cost model, and rollout strategy. + +## Grep These + +- `chapters` in `apps/manager/src/` — existing chapter/scene segmentation +- `getOpenrouter` in `apps/manager/src/` — AI model client (text-only; needs multimodal extension) +- `text-embedding-3-small` in `apps/manager/src/` — embedding model +- `strapi.db.connection.raw` in `apps/cms/src/` — raw SQL patterns for pgvector +- `muxAssetId` in `apps/manager/src/` — Mux asset references for frame extraction +- `playbackId` in `apps/cms/src/` — Mux playback IDs for thumbnail URLs +- `label` in `apps/cms/src/api/video/` — video type enum (featureFilm, shortFilm, etc.) + +## What To Build + +### R0. Data Audit (first task) + +Query CMS to determine English video landscape: + +```sql +-- Video count by label type +SELECT label, COUNT(*) FROM videos GROUP BY label; + +-- Duration distribution +SELECT label, + COUNT(*) as count, + AVG(duration) as avg_duration, + MAX(duration) as max_duration +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 = 'en' +GROUP BY label; + +-- Chapter metadata coverage +SELECT COUNT(DISTINCT ej.mux_asset_id) +FROM enrichment_jobs ej +WHERE ej.step_statuses->>'chapters' = 'completed'; +``` + +### R1. Scene Segmentation + +**R1a — Transcript-based (extend chapters.ts)**: + +- For each English video, use existing chapter output as scene boundaries +- Short clips (single chapter) → treat as one scene +- Store chapter boundaries as scene candidates + +**R1b — Visual fusion (feature films only)**: + +- Extract frames at chapter boundaries using Mux thumbnail API: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={SECONDS}` +- Feed frame sequences + transcript to multimodal LLM to refine/merge chapter boundaries into narrative scenes +- Research: evaluate PySceneDetect for shot boundary detection to augment + +### R2. Scene Content Description + +New service: `apps/manager/src/services/sceneDescription.ts` + +```typescript +type SceneDescription = { + sceneIndex: number + startSeconds: number + endSeconds: number | null + description: string // LLM-generated rich description + chapterTitle: string | null + frameCount: number +} + +export async function describeScene( + playbackId: string, + startSeconds: number, + endSeconds: number | null, + transcript: string, + chapterTitle: string | null, +): Promise +``` + +- Extract 3 representative frames via Mux thumbnail API at scene start, midpoint, and end +- Send frames + transcript chunk to multimodal LLM (Gemini 2.5 Flash via OpenRouter or direct API) +- Prompt: describe visual setting, objects, actions, characters, emotional tone, mood +- **Requires new multimodal client** — existing OpenRouter client is text-only + +### R3. Scene Embedding + Storage + +Create `scene_embeddings` table via bootstrap SQL (same pattern as feat-009): + +```sql +CREATE TABLE IF NOT EXISTS scene_embeddings ( + id SERIAL PRIMARY KEY, + video_id INTEGER NOT NULL, + core_id TEXT, + mux_asset_id TEXT NOT NULL, + playback_id TEXT NOT NULL, + scene_index INTEGER NOT NULL, + start_seconds FLOAT NOT NULL, + end_seconds FLOAT, + description TEXT NOT NULL, + chapter_title TEXT, + frame_count INTEGER, + embedding vector(1536) NOT NULL, + model TEXT NOT NULL DEFAULT 'text-embedding-3-small', + language TEXT NOT NULL DEFAULT 'en', + created_at TIMESTAMPTZ DEFAULT NOW(), + UNIQUE(video_id, scene_index) +); + +CREATE INDEX IF NOT EXISTS scene_embeddings_hnsw + ON scene_embeddings USING hnsw (embedding vector_cosine_ops); +CREATE INDEX IF NOT EXISTS scene_embeddings_video_id + ON scene_embeddings(video_id); +CREATE INDEX IF NOT EXISTS scene_embeddings_language + ON scene_embeddings(language); +``` + +Indexing service: `apps/cms/src/api/scene-embedding/services/indexer.ts` + +```typescript +export async function indexSceneEmbeddings( + videoId: number, + scenes: SceneDescription[], + embeddings: number[][], + meta: { + coreId: string + muxAssetId: string + playbackId: string + language: string + }, +): Promise<{ scenesIndexed: number }> +``` + +### R4. Cross-film Recommendation Query + +```sql +SELECT se.video_id, se.scene_index, se.description, se.start_seconds, + 1 - (se.embedding <=> $1) AS similarity +FROM scene_embeddings se +WHERE se.video_id != $2 + AND se.language = 'en' +ORDER BY se.embedding <=> $1 +LIMIT 10; +``` + +Expose as CMS service or API endpoint for web/mobile consumption. + +### R5. Backfill Worker + +Dedicated Railway service (or separate entry point in manager) for one-time English catalog processing: + +- Queue-based: iterate English videos, process each through R1 → R2 → R3 +- Resumable: track processed video IDs, skip on restart +- Cost controls: configurable batch size, rate limits, cost tracking per video, auto-pause at threshold +- Dry-run mode: estimate cost without LLM calls + +### R6. Pipeline Integration + +Add scene vectorization to `videoEnrichment.ts` as an independent branch: + +- Runs after transcription completes (needs transcript) +- Also needs muxAssetId/playbackId (for frames) — different input than other parallel steps +- Triggers R1a → R2 → R3 for the new video + +## Constraints + +- **English only** — filter by language in all queries and processing. `language` column enables future expansion. +- **Separate table from `video_embeddings`** — different columns, different query patterns. Do not extend feat-009's table. +- **Do NOT use a Strapi content type** for scene embeddings — pgvector columns don't work with Strapi ORM. Use raw SQL (same pattern as feat-009). +- **Embed once per Video, not per VideoVariant** — language variants share visual content. Dedup by `video_id`. +- **Cost cap** — backfill worker must auto-pause if cumulative cost exceeds configurable threshold. +- **Mux thumbnail API** for frame extraction — do not download full videos. Confirm API supports arbitrary timestamps during planning. + +## Verification + +1. **Data audit complete**: know English video count by label, duration distribution, chapter coverage +2. **Scene segmentation**: sample 10 feature films, verify scene boundaries align with narrative scenes (not just shot cuts) +3. **Scene descriptions**: sample 20 scenes, verify descriptions capture visual content, not just transcript paraphrasing +4. **Embeddings indexed**: `SELECT COUNT(*) FROM scene_embeddings WHERE language = 'en'` matches expected scene count +5. **Recommendation quality**: for 50 seed videos, top-10 similar scenes include at least 3 relevant cross-film results for 80% of seeds +6. **Deduplication**: recommendations never surface the same video (different variant) as the input +7. **Cost tracking**: backfill worker logs cumulative cost, stays within budget +8. **Pipeline integration**: upload a new English video → scene embeddings appear in `scene_embeddings` table automatically diff --git a/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md b/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md new file mode 100644 index 00000000..c354dade --- /dev/null +++ b/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md @@ -0,0 +1,83 @@ +--- +id: "feat-038" +title: "Video Vectorization — Data Audit" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-04-21" +duration: 3 +depends_on: + - "feat-037" +blocks: + - "feat-039" + - "feat-042" +tags: + - "cms" + - "pgvector" +--- + +## Problem + +Before building the scene vectorization pipeline, we need to know the shape of the English video catalog: how many videos by type, duration distribution, and existing chapter coverage. This gates all downstream sizing, cost estimates, and architecture decisions. + +## Entry Points — Read These First + +1. `apps/cms/src/api/video/content-types/video/schema.json` — Video schema with `label` enum +2. `apps/cms/src/api/video-variant/content-types/video-variant/schema.json` — VideoVariant with language relation +3. `apps/cms/src/api/enrichment-job/content-types/enrichment-job/schema.json` — tracks chapter completion status +4. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — R0 requirements + +## Grep These + +- `label` in `apps/cms/src/api/video/` — video type enum values +- `bcp47` in `apps/cms/src/` — language code field for filtering English + +## What To Build + +Run diagnostic queries against the CMS database: + +```sql +-- English video count by label +SELECT v.label, COUNT(*) as count +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 = 'en' +GROUP BY v.label ORDER BY count DESC; + +-- Duration distribution for English videos +SELECT v.label, + COUNT(*) as count, + ROUND(AVG(vv.duration)) as avg_duration_sec, + MAX(vv.duration) as max_duration_sec +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 = 'en' +GROUP BY v.label; + +-- Chapter metadata coverage +SELECT COUNT(DISTINCT ej.mux_asset_id) +FROM enrichment_jobs ej +WHERE ej.step_statuses->>'chapters' = 'completed'; + +-- Confirm Video → VideoVariant dedup model +SELECT v.id, COUNT(vv.id) as variant_count +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +GROUP BY v.id ORDER BY variant_count DESC LIMIT 10; +``` + +Deliverable: update the brainstorm doc cost model with actual numbers. Confirm or revise the ~$100-$300 Phase 1 estimate. + +## Constraints + +- Read-only queries — do not modify production data +- Use `strapi.db.connection.raw()` pattern or direct DB access + +## Verification + +- Know exact English video count by label type +- Know duration distribution (what % are short clips vs feature films) +- Know chapter coverage (what % already have scene-like metadata) +- Cost model in brainstorm doc updated with real numbers diff --git a/docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md b/docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md new file mode 100644 index 00000000..b1c26746 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md @@ -0,0 +1,66 @@ +--- +id: "feat-039" +title: "Video Vectorization — Chapter-Based Scene Boundaries" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-04-24" +duration: 7 +depends_on: + - "feat-038" +blocks: + - "feat-040" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +The existing `chapters.ts` service produces transcript-based scene segmentation (title, startSeconds, endSeconds, summary). This output needs to be formalized as "scene boundaries" that downstream steps (description, embedding) consume. For short clips that are a single chapter, the chapter IS the scene. + +## Entry Points — Read These First + +1. `apps/manager/src/services/chapters.ts` — `Chapter { title, startSeconds, endSeconds, summary }` type and generation logic +2. `apps/manager/src/services/storage.ts` — artifact storage/retrieval pattern +3. `apps/manager/src/workflows/videoEnrichment.ts` — where chapters step runs + +## Grep These + +- `Chapter` in `apps/manager/src/services/chapters.ts` — existing type definition +- `chapters` in `apps/manager/src/workflows/` — how chapters are invoked + +## What To Build + +New service: `apps/manager/src/services/sceneBoundaries.ts` + +```typescript +type SceneBoundary = { + sceneIndex: number + startSeconds: number + endSeconds: number | null + chapterTitle: string | null + transcriptChunk: string +} + +export async function extractSceneBoundaries( + assetId: string, + chapters: Chapter[], + transcript: string, +): Promise +``` + +- Map each chapter to a SceneBoundary with its corresponding transcript chunk +- Single-chapter videos → one scene +- Store as `{assetId}/scene-boundaries.json` artifact + +## Constraints + +- Do not modify `chapters.ts` — consume its output, don't change it +- Keep the SceneBoundary type simple — visual fusion (feat-043) will extend it later + +## Verification + +- Process 10 English videos with existing chapters → scene boundaries match chapter structure +- Short clips produce 1-3 scenes, feature films produce 20-100+ +- Artifact stored successfully in S3 diff --git a/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md b/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md new file mode 100644 index 00000000..4f0554d8 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md @@ -0,0 +1,83 @@ +--- +id: "feat-040" +title: "Video Vectorization — Multimodal Scene Descriptions" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-01" +duration: 10 +depends_on: + - "feat-039" +blocks: + - "feat-041" + - "feat-042" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +Each scene needs a rich description capturing visual setting, objects, actions, emotional tone, and mood. This requires a new multimodal LLM client (existing OpenRouter client is text-only) that can send video frames alongside transcript text. + +## Entry Points — Read These First + +1. `apps/manager/src/lib/openrouter.ts` — existing AI client (text-only) +2. `apps/manager/src/services/chapters.ts` — example of LLM prompting pattern +3. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary input (from feat-039) +4. `apps/cms/src/api/mux-video/content-types/mux-video/schema.json` — `playbackId` for Mux thumbnail URLs + +## Grep These + +- `getOpenrouter` in `apps/manager/src/` — existing AI client usage +- `playbackId` in `apps/manager/src/` — Mux playback ID references + +## What To Build + +1. **Multimodal LLM client** — extend or add a client that supports sending images + text. Gemini 2.5 Flash recommended for cost/quality. + +2. **Frame extraction utility**: + + ```typescript + export async function extractFrames( + playbackId: string, + timestamps: number[], + ): Promise + ``` + + Uses Mux thumbnail API: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={SECONDS}` + +3. **Scene description service**: `apps/manager/src/services/sceneDescription.ts` + + ```typescript + type SceneDescription = { + sceneIndex: number + startSeconds: number + endSeconds: number | null + description: string + chapterTitle: string | null + frameCount: number + } + + export async function describeScene( + playbackId: string, + boundary: SceneBoundary, + ): Promise + ``` + + - Extract 3 frames (start, mid, end of scene) + - Send frames + transcript chunk to multimodal LLM + - Prompt for: visual setting, objects, actions, characters, emotional tone, mood + - Store as `{assetId}/scene-descriptions.json` artifact + +## Constraints + +- Confirm Mux thumbnail API works for arbitrary timestamps and returns sufficient resolution +- Rate limit LLM calls — respect provider limits +- Log token usage per call for cost tracking + +## Verification + +- Sample 20 scenes: descriptions capture visual content, not just transcript paraphrasing +- Mux thumbnail extraction works for timestamps throughout a video +- Token usage logged accurately diff --git a/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md b/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md new file mode 100644 index 00000000..5a86639a --- /dev/null +++ b/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md @@ -0,0 +1,82 @@ +--- +id: "feat-041" +title: "Video Vectorization — Scene Embeddings Table + Indexing" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-11" +duration: 7 +depends_on: + - "feat-009" + - "feat-040" +blocks: + - "feat-042" + - "feat-044" +tags: + - "cms" + - "pgvector" +--- + +## Problem + +Scene descriptions need to be embedded and stored in pgvector for similarity queries. This requires a new `scene_embeddings` table (separate from feat-009's `video_embeddings`) and an indexing service. + +## Entry Points — Read These First + +1. `apps/cms/src/bootstrap.ts` or `apps/cms/src/index.ts` — where pgvector extension and tables are created (feat-009 pattern) +2. `apps/manager/src/services/embeddings.ts` — existing text embedding pipeline +3. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — full schema in Storage Schema section + +## Grep These + +- `video_embeddings` in `apps/cms/src/` — feat-009 table creation pattern to follow +- `strapi.db.connection.raw` in `apps/cms/src/` — raw SQL execution pattern + +## What To Build + +1. **Bootstrap SQL** — add to CMS bootstrap alongside feat-009's table: + + ```sql + CREATE TABLE IF NOT EXISTS scene_embeddings ( + id SERIAL PRIMARY KEY, + video_id INTEGER NOT NULL, + core_id TEXT, + mux_asset_id TEXT NOT NULL, + playback_id TEXT NOT NULL, + scene_index INTEGER NOT NULL, + start_seconds FLOAT NOT NULL, + end_seconds FLOAT, + description TEXT NOT NULL, + chapter_title TEXT, + frame_count INTEGER, + embedding vector(1536) NOT NULL, + model TEXT NOT NULL DEFAULT 'text-embedding-3-small', + language TEXT NOT NULL DEFAULT 'en', + created_at TIMESTAMPTZ DEFAULT NOW(), + UNIQUE(video_id, scene_index) + ); + + CREATE INDEX IF NOT EXISTS scene_embeddings_hnsw + ON scene_embeddings USING hnsw (embedding vector_cosine_ops); + CREATE INDEX IF NOT EXISTS scene_embeddings_video_id + ON scene_embeddings(video_id); + CREATE INDEX IF NOT EXISTS scene_embeddings_language + ON scene_embeddings(language); + ``` + +2. **Indexing service**: `apps/cms/src/api/scene-embedding/services/indexer.ts` + - Accept scene descriptions + embeddings + video metadata + - Upsert rows (delete existing for video_id + insert within transaction) + - Return count indexed + +## Constraints + +- Follow exact same pattern as feat-009 for raw SQL in Strapi +- HNSW index, not IVFFlat +- Table name may need adjustment based on Strapi's actual `videos` table name + +## Verification + +- `\d scene_embeddings` shows table with vector(1536) column +- Insert test data → HNSW index used in EXPLAIN ANALYZE of similarity query +- Upsert is idempotent — re-indexing same video replaces rows diff --git a/docs/roadmap/content-discovery/feat-042-backfill-worker.md b/docs/roadmap/content-discovery/feat-042-backfill-worker.md new file mode 100644 index 00000000..234b1c83 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-042-backfill-worker.md @@ -0,0 +1,66 @@ +--- +id: "feat-042" +title: "Video Vectorization — English Backfill Worker" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-18" +duration: 10 +depends_on: + - "feat-038" + - "feat-040" + - "feat-041" +blocks: + - "feat-044" +tags: + - "manager" + - "ai-pipeline" + - "infrastructure" +--- + +## Problem + +The full English video catalog needs to be processed through the scene vectorization pipeline (boundaries → descriptions → embeddings → indexing). This is a one-time batch job that must be resumable, cost-tracked, and safe to run against production. + +## Entry Points — Read These First + +1. `apps/manager/src/workflows/videoEnrichment.ts` — existing workflow pattern +2. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary extraction (feat-039) +3. `apps/manager/src/services/sceneDescription.ts` — scene description generation (feat-040) +4. `apps/cms/src/api/scene-embedding/services/indexer.ts` — embedding indexer (feat-041) +5. `apps/manager/railway.toml` — Railway service configuration + +## Grep These + +- `restartPolicyType` in `apps/manager/` — Railway restart configuration +- `enrichment-job` in `apps/cms/src/api/` — job tracking pattern + +## What To Build + +Dedicated entry point (separate Railway service or manager CLI command) that: + +1. **Fetches English video queue** — all Videos with English variants, ordered by label (feature films first for early quality signal) +2. **Tracks progress** — store processed video IDs to resume on restart. Use enrichment job pattern or simple DB table. +3. **Per-video pipeline**: scene boundaries → scene descriptions → embed descriptions → index in pgvector +4. **Cost controls**: + - Configurable batch size (default: 100 videos per run) + - Rate limiting (requests per minute to LLM provider) + - Cumulative cost tracking (log tokens used, compute running total) + - Auto-pause at configurable cost threshold (default: $500) +5. **Dry-run mode** — process N videos through boundary extraction only, estimate total LLM cost without making calls +6. **Logging** — structured JSON logs: video ID, label, scene count, tokens used, cost, duration per video + +## Constraints + +- Must be resumable — crashing mid-batch loses no completed work +- Must not block the manager pipeline for new uploads +- Railway worker constraints: design as queue-based with configurable batch sizes rather than assuming infinite runtime +- English only: filter by language throughout + +## Verification + +- Dry-run mode reports accurate cost estimate for full English catalog +- Process 100 English videos end-to-end → embeddings appear in `scene_embeddings` +- Kill worker mid-batch, restart → picks up where it left off +- Cost tracking matches actual API billing within 10% +- `SELECT COUNT(*) FROM scene_embeddings WHERE language = 'en'` grows as expected diff --git a/docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md b/docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md new file mode 100644 index 00000000..7643ffe6 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md @@ -0,0 +1,63 @@ +--- +id: "feat-043" +title: "Video Vectorization — Visual Shot Detection Fusion" +owner: "nisal" +priority: "P2" +status: "not-started" +start_date: "2026-05-28" +duration: 10 +depends_on: + - "feat-039" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +Transcript-based chapter boundaries (feat-039) work well for short clips but may miss visual scene transitions in feature films where a narrative scene contains many camera cuts. Combining visual shot detection with transcript analysis produces more accurate scene boundaries for longer content. + +## Entry Points — Read These First + +1. `apps/manager/src/services/sceneBoundaries.ts` — existing chapter-based boundaries (feat-039) +2. `apps/manager/src/services/sceneDescription.ts` — consumer of scene boundaries (feat-040) + +## Grep These + +- `SceneBoundary` in `apps/manager/src/` — type to extend +- `chapters` in `apps/manager/src/services/` — existing segmentation + +## What To Build + +1. **Research phase** — evaluate scene detection approaches: + - PySceneDetect (Python, may need microservice or WASM) + - Mux frame sampling + LLM-based scene change detection + - FFmpeg scene detection filter (`-vf "select=gt(scene\,0.3)"`) + +2. **Visual boundary detector**: + + ```typescript + export async function detectVisualBoundaries( + playbackId: string, + duration: number, + ): Promise // timestamps of visual scene changes + ``` + +3. **Fusion logic** — merge visual boundaries with chapter-based boundaries: + - If visual and chapter boundaries align (within N seconds), keep chapter boundary + - If visual boundary exists between chapter boundaries, consider splitting + - Use LLM to decide: "given this transcript segment, does a scene change at timestamp T make narrative sense?" + +4. **Update `extractSceneBoundaries`** to optionally use fusion for feature-length videos + +## Constraints + +- This is P2 — only needed if chapter-based boundaries prove insufficient for feature films +- Do not break existing chapter-based flow; fusion is an optional enhancement +- May require Python tooling (PySceneDetect) — evaluate Node.js alternatives first + +## Verification + +- Compare scene boundaries with and without fusion for 10 feature films +- Fusion boundaries align better with narrative scene changes (manual review) +- No regression for short clips (still use chapter-based only) diff --git a/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md b/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md new file mode 100644 index 00000000..2882f252 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md @@ -0,0 +1,87 @@ +--- +id: "feat-044" +title: "Video Vectorization — Recommendation Query API" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-28" +duration: 7 +depends_on: + - "feat-041" + - "feat-042" +blocks: + - "feat-046" +tags: + - "cms" + - "pgvector" + - "graphql" +--- + +## Problem + +With scene embeddings indexed, we need a queryable API that returns similar scenes from different videos. This is the core recommendation capability that the demo frontend (feat-046) and future recommendation UI will consume. + +## Entry Points — Read These First + +1. `apps/cms/src/api/scene-embedding/services/indexer.ts` — scene embedding storage (feat-041) +2. `apps/cms/src/api/core-sync/services/` — raw SQL patterns in Strapi services +3. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — recommendation query in Storage Schema section + +## Grep These + +- `strapi.db.connection.raw` in `apps/cms/src/` — raw SQL execution +- `scene_embeddings` in `apps/cms/src/` — table references +- `register` in `apps/cms/src/api/` — custom route/controller registration pattern + +## What To Build + +1. **Recommendation service**: `apps/cms/src/api/scene-embedding/services/recommender.ts` + + ```typescript + type SceneRecommendation = { + videoId: number + sceneIndex: number + description: string + startSeconds: number + endSeconds: number | null + similarity: number // 0-1 + } + + export async function getRecommendations( + videoId: number, + sceneIndex?: number, // specific scene, or aggregate across all scenes + limit?: number, // default 10 + ): Promise + ``` + +2. **Query logic**: + + ```sql + -- For a specific scene + SELECT se.video_id, se.scene_index, se.description, se.start_seconds, se.end_seconds, + 1 - (se.embedding <=> $1) AS similarity + FROM scene_embeddings se + WHERE se.video_id != $2 + AND se.language = 'en' + ORDER BY se.embedding <=> $1 + LIMIT $3; + ``` + + For whole-video recommendations: average similarity across all scenes of the input video, or take top scene match per candidate video. + +3. **Custom API route**: `GET /api/scene-embeddings/recommendations?videoId=X&sceneIndex=Y&limit=10` + +4. **GraphQL integration** (if applicable): expose as custom query resolver + +## Constraints + +- Filter `video_id != input` to never recommend the same video +- English only for Phase 1 (`language = 'en'`) +- Response must include enough metadata (videoId, timestamps, description) for the frontend to render + +## Verification + +- Query with a known video → returns different videos with >0.5 similarity +- Never returns the input video in results +- Response time <500ms for top-10 query +- Results are plausibly similar (manual spot-check) diff --git a/docs/roadmap/content-discovery/feat-045-pipeline-integration.md b/docs/roadmap/content-discovery/feat-045-pipeline-integration.md new file mode 100644 index 00000000..0784fe66 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-045-pipeline-integration.md @@ -0,0 +1,64 @@ +--- +id: "feat-045" +title: "Video Vectorization — Pipeline Integration" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-06-04" +duration: 7 +depends_on: + - "feat-041" + - "feat-042" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +After backfill, new English video uploads need to be automatically scene-vectorized as part of the enrichment workflow. Unlike existing parallel steps that consume transcript text, scene vectorization needs video frame access — it's an independent branch. + +## Entry Points — Read These First + +1. `apps/manager/src/workflows/videoEnrichment.ts` — existing enrichment workflow with parallel steps +2. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary extraction +3. `apps/manager/src/services/sceneDescription.ts` — scene description generation +4. `apps/cms/src/api/scene-embedding/services/indexer.ts` — embedding indexer + +## Grep These + +- `"use step"` in `apps/manager/src/workflows/` — workflow step pattern +- `transcribe` in `apps/manager/src/workflows/` — step dependency pattern +- `muxAssetId` in `apps/manager/src/workflows/` — where asset IDs are available + +## What To Build + +Add scene vectorization as a new branch in `videoEnrichment.ts`: + +``` +transcribe +├── [existing parallel] translate, chapters, metadata, embeddings +└── [new branch] sceneVectorize + ├── extractSceneBoundaries (needs transcript + chapters output) + ├── describeScenes (needs playbackId for frames + boundaries) + ├── embedDescriptions (needs descriptions) + └── indexSceneEmbeddings (needs embeddings + video metadata) +``` + +- Runs after both transcription AND chapters complete (needs both) +- Uses `muxAssetId` / `playbackId` from job context for frame extraction +- English-only gate: skip for non-English primary language videos +- Updates enrichment job status with `sceneVectorization` step tracking + +## Constraints + +- Do not block existing parallel steps — scene vectorization runs independently +- Failure in scene vectorization should not fail the overall enrichment job +- English-only check: skip step if video's primary language is not English + +## Verification + +- Upload a new English video → enrichment completes → scene embeddings appear in `scene_embeddings` +- Upload a non-English video → scene vectorization step is skipped +- Scene vectorization failure does not block transcript/translation/chapters from completing +- Enrichment job status shows sceneVectorization step status diff --git a/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md b/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md new file mode 100644 index 00000000..69aa82f5 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md @@ -0,0 +1,92 @@ +--- +id: "feat-046" +title: "Video Vectorization — Recommendations Demo Experience" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-06-04" +duration: 7 +depends_on: + - "feat-044" +blocks: [] +tags: + - "web" + - "cms" + - "graphql" +--- + +## Problem + +We need a demo frontend to prove the recommendation engine works and to present results for Phase 2 funding decisions. This renders as an Experience on the existing `[slug]/[locale]` route, showing a video with its scene-similar recommendations from other films. + +## Entry Points — Read These First + +1. `apps/web/src/app/[slug]/[locale]/page.tsx` — experience page route (slug + locale) +2. `apps/web/src/app/[slug]/page.tsx` — experience page route (slug only) +3. `apps/web/src/components/sections/index.tsx` — `SectionRenderer` maps block `__typename` to components +4. `apps/web/src/lib/content.ts` — `getWatchExperience()` fetches experience data via GraphQL +5. `apps/cms/src/api/scene-embedding/services/recommender.ts` — recommendation query API (feat-044) + +## Grep These + +- `SectionRenderer` in `apps/web/src/components/` — block type mapping +- `__typename` in `apps/web/src/components/sections/` — how block types are resolved +- `getWatchExperience` in `apps/web/src/lib/` — experience data fetching +- `ExperienceSectionRenderer` in `apps/web/src/` — section rendering pipeline + +## What To Build + +### 1. CMS: Recommendations Block Type + +Add a new block type to the Experience content type in Strapi: + +- **Block name**: `ComponentBlocksVideoRecommendations` +- **Fields**: + - `sourceVideo` — relation to Video (the video to get recommendations for) + - `title` — text (e.g., "Scenes like this") + - `limit` — integer (default 10, max recommendations to show) + +### 2. GraphQL: Expose Recommendations + +Extend the Experience GraphQL query to include the new block type. The block fetches recommendations at render time via the recommendation API (feat-044). + +### 3. Web: Recommendations Section Component + +New component: `apps/web/src/components/sections/VideoRecommendations.tsx` + +```typescript +// Renders a grid/carousel of recommended scenes from other videos +// Each card shows: +// - Mux thumbnail at the scene's start timestamp +// - Scene description (truncated) +// - Source video title +// - Similarity score (optional, for demo purposes) +// - Click → navigates to that video at the scene timestamp +``` + +### 4. Register in SectionRenderer + +Add `ComponentBlocksVideoRecommendations` → `VideoRecommendations` mapping in `SectionRenderer`. + +### 5. Create Demo Experience in CMS + +Create an Experience with slug (e.g., `recommendations-demo`) containing: + +- A VideoHero block with a source video +- A VideoRecommendations block for that video +- Accessible at `/recommendations-demo/en` + +## Constraints + +- Use existing Experience / SectionRenderer pattern — do not create custom routes +- Thumbnails via Mux: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={START_SECONDS}` +- Demo purpose: optimize for clarity and showcasing results, not production polish +- Server Component by default (Next.js App Router convention) + +## Verification + +- Navigate to `/recommendations-demo/en` → see source video + grid of recommended scenes +- Recommendations are from different videos (not the same film) +- Each recommendation card shows thumbnail, description, and source video title +- Clicking a recommendation navigates to the video (or plays from scene timestamp) +- Page loads in <3s with recommendations visible