Skip to content

GSoC 2026 Candidate Submission: End-to-End Narrative Audio Pipeline#39

Open
meganho456 wants to merge 7 commits intohumanai-foundation:masterfrom
meganho456:gsoc2026-narrative-audio
Open

GSoC 2026 Candidate Submission: End-to-End Narrative Audio Pipeline#39
meganho456 wants to merge 7 commits intohumanai-foundation:masterfrom
meganho456:gsoc2026-narrative-audio

Conversation

@meganho456
Copy link
Copy Markdown

This PR contains my GSoC 2026 test submission for a complete narrative-audio workflow, including all required tasks and a bonus storytelling analysis component.

What’s included
Task 1: Audio Processing Pipeline
Loads .wav recordings, normalizes audio, segments clips when needed, and extracts ML-ready features.
Features include MFCCs, pitch, spectral centroid, RMS energy, and duration.
Produces a structured feature dataset and normalized audio outputs.
Task 2: Narrative Tone Classification
Trains a neural-network classifier using labeled emotional-tone data.
Uses train/test split and reports evaluation metrics (accuracy, weighted F1, per-class report).
Task 3: AI-Based Transcription
Implements batch transcription with Whisper.
Exports transcripts to text format.
Measures transcription quality on a subset using WER.
Task 4: Narrative Audio Retrieval
Implements a retrieval prototype for narrative-style queries (e.g., calm narration, high-energy speech, dramatic dialogue).
Combines structured filtering and semantic ranking to return relevant recordings.
Bonus: Storytelling Audio Analysis
Analyzes storytelling-oriented cues: pacing/pauses, pitch variation, energy dynamics, and sentence-length characteristics.
Adds a heuristic storytelling score and ranks clips by storytelling-like expressiveness.
Deliverables in this submission
Full source code for Tasks 1–4 and bonus task, and run_pipeline that chains all the tasks together
Technical report PDF
README with setup and run instructions
Example output artifacts (feature CSVs, transcripts, analysis outputs)

meganho456 and others added 7 commits March 29, 2026 06:27
- New task0_audio_capture/audio_capture.py with RollingBuffer (thread-safe
  circular deque), AudioCaptureStream (sounddevice callback at 64 ms chunks),
  and record_for_duration() convenience helper
- run_pipeline.py now captures 5 s from the microphone at startup and feeds
  the recording into the downstream tasks; falls back to pre-recorded file
  if no mic is available
- requirements.txt: add sounddevice>=0.4.6
- README.md: document Step 1 architecture, parameters, standalone usage,
  library API, and PortAudio install instructions

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
New modules:
- emotion_classifier/: production EmotionClassifier with mfcc-mlp,
  wav2vec2, and hubert backends; parallel inference via ThreadPoolExecutor
- transcriber/: StreamingTranscriber with faster-whisper and
  openai-whisper backends; pause-triggered utterance segmentation
- utterance_buffer/: pause-triggered and fixed-window segmentation strategies
- vad_engine/: webrtcvad and silero-vad backends for speech detection
- output_generator/: CaptionFormatter, SRTWriter, CaptionBroadcaster
  (WebSocket), AtmosphereMapper, CrossfadeScheduler
- output_generator/overlay.html: browser caption overlay for OBS/streaming
- tests/: full test suite with fixtures for all pipeline steps

Updated:
- README.md: expanded with real-time pipeline architecture and usage
- requirements.txt: added faster-whisper, websockets, transformers
- run_pipeline.py: wired into new module structure
- .gitignore: exclude .claude/, __pycache__, *.pt checkpoints, *_cache.npz

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant