Skip to content

Add audio-summary specialized skill#13

Open
MZULALI wants to merge 2 commits intomainfrom
skill/audio-summary
Open

Add audio-summary specialized skill#13
MZULALI wants to merge 2 commits intomainfrom
skill/audio-summary

Conversation

@MZULALI
Copy link
Contributor

@MZULALI MZULALI commented Mar 1, 2026

audio-summary (Specialized)

Two-host conversational audio summaries from articles, videos, or raw text. NotebookLM-style short-form dialogue — default ~1 minute, adjustable from 30 seconds to 10+ minutes.

What it does

Takes any source content (YouTube video, article URL, or raw text) and produces a natural two-host audio discussion using ElevenLabs multi-voice dialogue. Host A explains key points, Host B reacts and asks questions.

Workflow

  1. Extract source — supadata for transcripts/web scraping, yt-dlp for video metadata
  2. Determine length — maps duration to word count (~150 wpm)
  3. Write dialogue script — structured two-host conversation (hook → core → wrap)
  4. Select voices — contrasting ElevenLabs voices matched to content tone
  5. Generate audio — elevenlabs dialogue endpoint produces single multi-voice file
  6. Deliver — file path + duration estimate + content summary

Dependencies

Builds on: supadata, yt-dlp, elevenlabs

Handles

  • YouTube videos, articles, raw text, multiple sources
  • Non-English content (via supadata translation + elevenlabs multilingual)
  • Edge cases: very long sources (distills to key points), very short sources (proportional output), listicles, background music overlay guidance

Two-host conversational audio summaries from articles, videos, or text.
NotebookLM-style short-form dialogue (default ~1 min, adjustable 30s to 10min+).
Builds on: supadata, yt-dlp, elevenlabs.
@MZULALI MZULALI added needs-testing Builder finished, ready for reviewer to test specialized Builds on foundational skills labels Mar 1, 2026
@MZULALI
Copy link
Contributor Author

MZULALI commented Mar 1, 2026

🎙️ Skill Review: audio-summary (Specialized)

Reviewer: Choug (automated CI/CD)
PR: #13skill/audio-summary
Tested commit: 5bbe1bf
Date: 2026-03-01 07:40 UTC


Pre-Test Checks

Check Result
Raw foundational API endpoints? ✅ PASS — No raw curl commands or foundational API URLs. Skill properly references foundational skills by name.
Dependencies declared? ✅ PASS — Lists supadata, yt-dlp, elevenlabs as dependencies
Dependencies available? ✅ PASS — All three foundational skills installed and env vars present (SUPADATA_API_KEY, ELEVENLABS_API_KEY)
Metadata well-formed? ✅ PASS — name, description, emoji, os all present
Description quality? ✅ PASS — 8 descriptive trigger phrases covering the key use cases

Phase 1: Discovery Testing (2 sessions)

D1: YouTube Video → Listenable Recap

Prompt: "Here's a YouTube video I found interesting: https://www.youtube.com/watch?v=jNQXAC9IVRw — can you turn it into a short listenable recap? Like two people casually discussing what it's about, around 1-2 minutes of audio."

Check Result
Prompt clean? (no skill/API/brand names) ✅ Yes
Found specialized skill? Yes — Read audio-summary/SKILL.md as its first action
Used specialized workflow? ✅ Followed the 6-step workflow (Extract → Length → Script → Voices → Generate → Deliver)
Used foundational skills correctly? ✅ Read supadata + elevenlabs SKILLs, used supadata for transcript + metadata, elevenlabs wrapper for dialogue
Result quality? ✅ Generated valid 1:44 MP3. Correctly handled the edge case of a 19-second source video by focusing on historical significance rather than padding thin content.
Voice selection? ✅ Charlie + Lily (matches "Energetic / fun" pairing from the skill's recommendations)

Score: ✅ Found and used the specialized skill's workflow naturally

The agent discovered the audio-summary skill purely from the task description ("listenable recap... two people casually discussing"). It followed the full 6-step specialized workflow, not just ad-hoc foundational calls.

D2: Article → Conversational Audio Summary

Prompt: "I have this article I want to listen to instead of reading: https://paulgraham.com/writes.html — make it into a short conversational audio summary with two different voices, like a quick podcast recap."

Check Result
Prompt clean? (no skill/API/brand names) ✅ Yes
Found specialized skill? Yes — Read audio-summary/SKILL.md first
Used specialized workflow? ✅ Full 6-step workflow followed
Used foundational skills correctly? ✅ Read supadata + elevenlabs SKILLs, used supadata for web scrape, elevenlabs dialogue command
Result quality? ✅ Generated valid 1:46 MP3. Good content coverage of the Paul Graham essay.
Voice selection? ✅ George + Lily (matches "Professional / news" pairing)

Score: ✅ Found and used the specialized skill's workflow naturally

Excellent discoverability. The description's trigger phrases ("turning an article or blog post into a listenable recap", "creating a conversational audio digest") matched the natural language of the prompt.


Phase 2: Explicit Testing (3 sessions)

E1: YouTube Video — Primary Use Case

Prompt: "You have a skill called 'audio-summary'... Read its SKILL.md first, then use it to turn this YouTube video into a 1-minute two-host audio summary: https://www.youtube.com/watch?v=jNQXAC9IVRw — follow the workflow steps exactly."

Check Result
Read the SKILL.md? ✅ Read audio-summary SKILL.md, then supadata + elevenlabs
Followed workflow steps? ✅ All 6 steps executed in order
Source extraction worked? ✅ Got transcript via supadata (text=true), metadata via supadata /youtube/video (noted yt-dlp was blocked by bot detection)
Target length correct? ✅ 1 minute → ~150 words. Output was ~73 seconds
Dialogue quality? ✅ Natural two-host script, hook → core → wrap structure
Voice selection? ✅ Will + Lily (matches "Energetic / fun" pairing)
Audio generated? ✅ Valid MP3, 128kbps, 44.1kHz
Delivery format? ✅ File path + duration + coverage summary

Score: ✅ Skill works correctly — full workflow executed

Note: The agent mentioned yt-dlp was "blocked by bot detection" for metadata and fell back to supadata's /youtube/video endpoint. The skill's Step 1 says to use yt-dlp for metadata, but the supadata fallback worked fine. This is a minor robustness point — the skill could mention this fallback path explicitly.

E2: Article — Full Workflow with Voice Selection

Prompt: "You have a skill called 'audio-summary'... use it to create a 2-minute conversational audio recap of this article: https://paulgraham.com/greatwork.html — use the full workflow including voice selection and dialogue generation."

Check Result
Read the SKILL.md? ✅ Read all three skills (audio-summary, supadata, elevenlabs)
Followed workflow steps? ✅ All 6 steps in order
Source extraction worked? ✅ Supadata web scrape returned full ~11,000-word article
Target length correct? ✅ 2 min → ~300 words. Wrote ~310-word script. Output ~2:27
Dialogue quality? ✅ Excellent — natural conversation, good distillation of a very long essay to 5 key points
Voice selection? ✅ George + Lily (Professional/news — good match for an essay)
Audio generated? ✅ Valid MP3, 2.3MB, 128kbps 44.1kHz
Saved dialogue script? ✅ Saved to summary-how-to-do-great-work-dialogue.txt (nice touch)

Score: ✅ Skill works correctly — handled long content well

E3: Raw Text — Edge Case (Short Input)

Prompt: "You have a skill called 'audio-summary'... create a short audio discussion from this raw text: 'Scientists at CERN announced today...' Keep it around 30-45 seconds."

Check Result
Read the SKILL.md? ✅ Read audio-summary SKILL.md, then elevenlabs
Followed workflow steps? ✅ Correctly identified "raw text" path in Step 1 (no extraction needed)
Target length correct? ✅ 30-45 sec → ~75-110 words. Output was ~46 seconds
Dialogue quality? ✅ Clean hook → core → wrap structure in ~100 words
Voice selection? ✅ George + Lily
Audio generated? ✅ Valid MP3, 720KB, 46 seconds
Skipped unnecessary foundational skills? ✅ Didn't read supadata/yt-dlp (not needed for raw text)

Score: ✅ Skill works correctly — clean edge case handling


Summary

Session Type Skill Found? Workflow Followed? Output Valid? Score
D1 Discovery (YouTube) ✅ Full 6-step ✅ 1:44 MP3
D2 Discovery (Article) ✅ Full 6-step ✅ 1:46 MP3
E1 Explicit (YouTube) ✅ Full 6-step ✅ 1:13 MP3
E2 Explicit (Article) ✅ Full 6-step ✅ 2:27 MP3
E3 Explicit (Raw text) ✅ Full 6-step ✅ 0:46 MP3

5/5 sessions passed. Zero failures. Zero bugs.

Specialized Skill Value Assessment

The audio-summary skill clearly adds value beyond just using the foundational skills individually:

  • Discovery agents naturally found and used it — the description is well-written and covers realistic use cases
  • The 6-step workflow (Extract → Length → Script → Voices → Generate → Deliver) provides structure that an agent wouldn't organically produce on its own
  • The dialogue writing rules (hook → core → wrap, vary turn length, no filler, attribution) produce notably better scripts than ad-hoc generation
  • The voice pairing recommendations give agents good defaults based on content tone
  • Edge case handling (short sources, long sources) is well-documented and followed

Observations (Not Bugs)

  1. yt-dlp bot detection: E1 noted yt-dlp was blocked for YouTube metadata. The skill's Step 1 says to use yt-dlp for metadata, but all agents fell back to supadata's /youtube/video endpoint which worked fine. Consider documenting this fallback path explicitly in the skill.

  2. Duration accuracy: All outputs ran slightly over target duration (e.g., 46s for 30-45s target, 2:27 for 2:00 target). This is normal — spoken pauses between dialogue turns add time beyond word count estimates. The skill's "150 wpm" heuristic could note that dialogue format adds ~15-20% overhead for speaker transitions.

  3. File naming collision: D1 and E1 both generated summary-me-at-the-zoo.mp3 (same video). E1's file overwrote D1's. This is expected behavior (same source, same default name) and not a skill bug, but worth noting.

These are all minor polish items, not bugs. None warrant blocking the merge.


Verdict: ✅ tested-pass

Zero bugs found. The skill works correctly across all three source types (YouTube, article, raw text), at multiple target durations (30s, 1m, 2m), and with both discovery and explicit usage patterns. The specialized workflow adds clear value over ad-hoc foundational skill usage. Ready to merge.


📋 D1: YouTube video → listenable recap (discovery)

Prompt: "Here's a YouTube video I found interesting: https://www.youtube.com/watch?v=jNQXAC9IVRw — can you turn it into a short listenable recap? Like two people casually discussing what it's about, around 1-2 minutes of audio."

Tool calls:

  1. readskills/audio-summary/SKILL.md ✅ Found specialized skill
  2. readskills/Supadata/SKILL.md + skills/elevenlabs/SKILL.md (parallel)
  3. exec → supadata /youtube/video?id=jNQXAC9IVRw → got metadata (title: "Me at the zoo", 19s, 383M views)
  4. exec → supadata /transcript?url=...&text=true → got transcript ("All right, so here we are, in front of the elephants...")
  5. execelevenlabs.sh voices → HTTP 401 (proxy key limitation, expected)
  6. exec → grep VOICE_IDS from script → found static voice table
  7. execelevenlabs.sh dialogue --inputs '[...]' --out summary-me-at-the-zoo.mp3
  8. exec → verify file: 1.6MB, 104s

Artifacts:

📋 D2: Article → conversational audio summary (discovery)

Prompt: "I have this article I want to listen to instead of reading: https://paulgraham.com/writes.html — make it into a short conversational audio summary with two different voices, like a quick podcast recap."

Tool calls:

  1. readskills/audio-summary/SKILL.md ✅ Found specialized skill
  2. readskills/Supadata/SKILL.md + skills/elevenlabs/SKILL.md (parallel)
  3. exec → supadata /web/scrape?url=...paulgraham.com/writes.html → got full essay
  4. execelevenlabs.sh voices → HTTP 401 (expected)
  5. exec → grep VOICE_IDS → found static table
  6. execelevenlabs.sh dialogue --inputs '[...]' --out summary-writes-and-write-nots.mp3
  7. exec → verify file: 1.7MB, 106s

Artifacts:

📋 E1: YouTube video — primary use case (explicit)

Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to turn this YouTube video into a 1-minute two-host audio summary: https://www.youtube.com/watch?v=jNQXAC9IVRw — follow the workflow steps exactly."

Tool calls:

  1. readskills/audio-summary/SKILL.md
  2. readskills/Supadata/SKILL.md + skills/yt-dlp/SKILL.md + skills/elevenlabs/SKILL.md
  3. exec → supadata transcript (text=true) → got transcript
  4. exec → supadata /youtube/video → got metadata (noted yt-dlp blocked by bot detection)
  5. execelevenlabs.sh voices → HTTP 401 (expected)
  6. exec → grep VOICE_IDS → found table
  7. execelevenlabs.sh dialogue --inputs '[...]' --out summary-me-at-the-zoo.mp3
  8. exec → verify: 1.2MB, ~73s

Artifacts:

📋 E2: Article — full workflow with voice selection (explicit)

Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to create a 2-minute conversational audio recap of this article: https://paulgraham.com/greatwork.html — use the full workflow including voice selection and dialogue generation."

Tool calls:

  1. readskills/audio-summary/SKILL.md
  2. readskills/Supadata/SKILL.md + skills/elevenlabs/SKILL.md (parallel)
  3. exec → supadata /web/scrape?url=...paulgraham.com/greatwork.html → got ~11,000-word essay
  4. execls elevenlabs/scripts/ → found elevenlabs.sh
  5. execelevenlabs.sh voices → HTTP 401 (expected)
  6. exec → grep VOICE_IDS → found static table
  7. writesummary-how-to-do-great-work-dialogue.txt (310-word dialogue script)
  8. execelevenlabs.sh dialogue --inputs '[...]' --out summary-how-to-do-great-work.mp3
  9. exec → verify: 2.3MB, ~147s

Artifacts:

📋 E3: Raw text — edge case, short input (explicit)

Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to create a short audio discussion from this raw text: 'Scientists at CERN announced today...' Keep it around 30-45 seconds."

Tool calls:

  1. readskills/audio-summary/SKILL.md
  2. readskills/elevenlabs/SKILL.md (correctly skipped supadata/yt-dlp — not needed for raw text)
  3. execls elevenlabs/scripts/ → found script
  4. execelevenlabs.sh voices → HTTP 401 (expected)
  5. exec → grep VOICE_IDS → found static table
  6. execelevenlabs.sh dialogue --inputs '[...]' --out cern-xi17-summary.mp3
  7. exec → verify via ffprobe: 720KB, 46.03s

Artifacts:

@MZULALI MZULALI added tested-pass Reviewer verified the skill works and removed needs-testing Builder finished, ready for reviewer to test labels Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

specialized Builds on foundational skills tested-pass Reviewer verified the skill works

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant