Conversation
Two-host conversational audio summaries from articles, videos, or text. NotebookLM-style short-form dialogue (default ~1 min, adjustable 30s to 10min+). Builds on: supadata, yt-dlp, elevenlabs.
🎙️ Skill Review:
|
| Check | Result |
|---|---|
| Raw foundational API endpoints? | ✅ PASS — No raw curl commands or foundational API URLs. Skill properly references foundational skills by name. |
| Dependencies declared? | ✅ PASS — Lists supadata, yt-dlp, elevenlabs as dependencies |
| Dependencies available? | ✅ PASS — All three foundational skills installed and env vars present (SUPADATA_API_KEY, ELEVENLABS_API_KEY) |
| Metadata well-formed? | ✅ PASS — name, description, emoji, os all present |
| Description quality? | ✅ PASS — 8 descriptive trigger phrases covering the key use cases |
Phase 1: Discovery Testing (2 sessions)
D1: YouTube Video → Listenable Recap
Prompt: "Here's a YouTube video I found interesting: https://www.youtube.com/watch?v=jNQXAC9IVRw — can you turn it into a short listenable recap? Like two people casually discussing what it's about, around 1-2 minutes of audio."
| Check | Result |
|---|---|
| Prompt clean? (no skill/API/brand names) | ✅ Yes |
| Found specialized skill? | ✅ Yes — Read audio-summary/SKILL.md as its first action |
| Used specialized workflow? | ✅ Followed the 6-step workflow (Extract → Length → Script → Voices → Generate → Deliver) |
| Used foundational skills correctly? | ✅ Read supadata + elevenlabs SKILLs, used supadata for transcript + metadata, elevenlabs wrapper for dialogue |
| Result quality? | ✅ Generated valid 1:44 MP3. Correctly handled the edge case of a 19-second source video by focusing on historical significance rather than padding thin content. |
| Voice selection? | ✅ Charlie + Lily (matches "Energetic / fun" pairing from the skill's recommendations) |
Score: ✅ Found and used the specialized skill's workflow naturally
The agent discovered the audio-summary skill purely from the task description ("listenable recap... two people casually discussing"). It followed the full 6-step specialized workflow, not just ad-hoc foundational calls.
D2: Article → Conversational Audio Summary
Prompt: "I have this article I want to listen to instead of reading: https://paulgraham.com/writes.html — make it into a short conversational audio summary with two different voices, like a quick podcast recap."
| Check | Result |
|---|---|
| Prompt clean? (no skill/API/brand names) | ✅ Yes |
| Found specialized skill? | ✅ Yes — Read audio-summary/SKILL.md first |
| Used specialized workflow? | ✅ Full 6-step workflow followed |
| Used foundational skills correctly? | ✅ Read supadata + elevenlabs SKILLs, used supadata for web scrape, elevenlabs dialogue command |
| Result quality? | ✅ Generated valid 1:46 MP3. Good content coverage of the Paul Graham essay. |
| Voice selection? | ✅ George + Lily (matches "Professional / news" pairing) |
Score: ✅ Found and used the specialized skill's workflow naturally
Excellent discoverability. The description's trigger phrases ("turning an article or blog post into a listenable recap", "creating a conversational audio digest") matched the natural language of the prompt.
Phase 2: Explicit Testing (3 sessions)
E1: YouTube Video — Primary Use Case
Prompt: "You have a skill called 'audio-summary'... Read its SKILL.md first, then use it to turn this YouTube video into a 1-minute two-host audio summary: https://www.youtube.com/watch?v=jNQXAC9IVRw — follow the workflow steps exactly."
| Check | Result |
|---|---|
| Read the SKILL.md? | ✅ Read audio-summary SKILL.md, then supadata + elevenlabs |
| Followed workflow steps? | ✅ All 6 steps executed in order |
| Source extraction worked? | ✅ Got transcript via supadata (text=true), metadata via supadata /youtube/video (noted yt-dlp was blocked by bot detection) |
| Target length correct? | ✅ 1 minute → ~150 words. Output was ~73 seconds |
| Dialogue quality? | ✅ Natural two-host script, hook → core → wrap structure |
| Voice selection? | ✅ Will + Lily (matches "Energetic / fun" pairing) |
| Audio generated? | ✅ Valid MP3, 128kbps, 44.1kHz |
| Delivery format? | ✅ File path + duration + coverage summary |
Score: ✅ Skill works correctly — full workflow executed
Note: The agent mentioned yt-dlp was "blocked by bot detection" for metadata and fell back to supadata's /youtube/video endpoint. The skill's Step 1 says to use yt-dlp for metadata, but the supadata fallback worked fine. This is a minor robustness point — the skill could mention this fallback path explicitly.
E2: Article — Full Workflow with Voice Selection
Prompt: "You have a skill called 'audio-summary'... use it to create a 2-minute conversational audio recap of this article: https://paulgraham.com/greatwork.html — use the full workflow including voice selection and dialogue generation."
| Check | Result |
|---|---|
| Read the SKILL.md? | ✅ Read all three skills (audio-summary, supadata, elevenlabs) |
| Followed workflow steps? | ✅ All 6 steps in order |
| Source extraction worked? | ✅ Supadata web scrape returned full ~11,000-word article |
| Target length correct? | ✅ 2 min → ~300 words. Wrote ~310-word script. Output ~2:27 |
| Dialogue quality? | ✅ Excellent — natural conversation, good distillation of a very long essay to 5 key points |
| Voice selection? | ✅ George + Lily (Professional/news — good match for an essay) |
| Audio generated? | ✅ Valid MP3, 2.3MB, 128kbps 44.1kHz |
| Saved dialogue script? | ✅ Saved to summary-how-to-do-great-work-dialogue.txt (nice touch) |
Score: ✅ Skill works correctly — handled long content well
E3: Raw Text — Edge Case (Short Input)
Prompt: "You have a skill called 'audio-summary'... create a short audio discussion from this raw text: 'Scientists at CERN announced today...' Keep it around 30-45 seconds."
| Check | Result |
|---|---|
| Read the SKILL.md? | ✅ Read audio-summary SKILL.md, then elevenlabs |
| Followed workflow steps? | ✅ Correctly identified "raw text" path in Step 1 (no extraction needed) |
| Target length correct? | ✅ 30-45 sec → ~75-110 words. Output was ~46 seconds |
| Dialogue quality? | ✅ Clean hook → core → wrap structure in ~100 words |
| Voice selection? | ✅ George + Lily |
| Audio generated? | ✅ Valid MP3, 720KB, 46 seconds |
| Skipped unnecessary foundational skills? | ✅ Didn't read supadata/yt-dlp (not needed for raw text) |
Score: ✅ Skill works correctly — clean edge case handling
Summary
| Session | Type | Skill Found? | Workflow Followed? | Output Valid? | Score |
|---|---|---|---|---|---|
| D1 | Discovery (YouTube) | ✅ | ✅ Full 6-step | ✅ 1:44 MP3 | ✅ |
| D2 | Discovery (Article) | ✅ | ✅ Full 6-step | ✅ 1:46 MP3 | ✅ |
| E1 | Explicit (YouTube) | ✅ | ✅ Full 6-step | ✅ 1:13 MP3 | ✅ |
| E2 | Explicit (Article) | ✅ | ✅ Full 6-step | ✅ 2:27 MP3 | ✅ |
| E3 | Explicit (Raw text) | ✅ | ✅ Full 6-step | ✅ 0:46 MP3 | ✅ |
5/5 sessions passed. Zero failures. Zero bugs.
Specialized Skill Value Assessment
The audio-summary skill clearly adds value beyond just using the foundational skills individually:
- Discovery agents naturally found and used it — the description is well-written and covers realistic use cases
- The 6-step workflow (Extract → Length → Script → Voices → Generate → Deliver) provides structure that an agent wouldn't organically produce on its own
- The dialogue writing rules (hook → core → wrap, vary turn length, no filler, attribution) produce notably better scripts than ad-hoc generation
- The voice pairing recommendations give agents good defaults based on content tone
- Edge case handling (short sources, long sources) is well-documented and followed
Observations (Not Bugs)
-
yt-dlp bot detection: E1 noted yt-dlp was blocked for YouTube metadata. The skill's Step 1 says to use yt-dlp for metadata, but all agents fell back to supadata's
/youtube/videoendpoint which worked fine. Consider documenting this fallback path explicitly in the skill. -
Duration accuracy: All outputs ran slightly over target duration (e.g., 46s for 30-45s target, 2:27 for 2:00 target). This is normal — spoken pauses between dialogue turns add time beyond word count estimates. The skill's "150 wpm" heuristic could note that dialogue format adds ~15-20% overhead for speaker transitions.
-
File naming collision: D1 and E1 both generated
summary-me-at-the-zoo.mp3(same video). E1's file overwrote D1's. This is expected behavior (same source, same default name) and not a skill bug, but worth noting.
These are all minor polish items, not bugs. None warrant blocking the merge.
Verdict: ✅ tested-pass
Zero bugs found. The skill works correctly across all three source types (YouTube, article, raw text), at multiple target durations (30s, 1m, 2m), and with both discovery and explicit usage patterns. The specialized workflow adds clear value over ad-hoc foundational skill usage. Ready to merge.
📋 D1: YouTube video → listenable recap (discovery)
Prompt: "Here's a YouTube video I found interesting: https://www.youtube.com/watch?v=jNQXAC9IVRw — can you turn it into a short listenable recap? Like two people casually discussing what it's about, around 1-2 minutes of audio."
Tool calls:
read→skills/audio-summary/SKILL.md✅ Found specialized skillread→skills/Supadata/SKILL.md+skills/elevenlabs/SKILL.md(parallel)exec→ supadata/youtube/video?id=jNQXAC9IVRw→ got metadata (title: "Me at the zoo", 19s, 383M views)exec→ supadata/transcript?url=...&text=true→ got transcript ("All right, so here we are, in front of the elephants...")exec→elevenlabs.sh voices→ HTTP 401 (proxy key limitation, expected)exec→ grep VOICE_IDS from script → found static voice tableexec→elevenlabs.sh dialogue --inputs '[...]' --out summary-me-at-the-zoo.mp3exec→ verify file: 1.6MB, 104s
Artifacts:
- 🔊 summary-me-at-the-zoo.mp3 (1.6MB, ~1:44, Charlie + Lily)
📋 D2: Article → conversational audio summary (discovery)
Prompt: "I have this article I want to listen to instead of reading: https://paulgraham.com/writes.html — make it into a short conversational audio summary with two different voices, like a quick podcast recap."
Tool calls:
read→skills/audio-summary/SKILL.md✅ Found specialized skillread→skills/Supadata/SKILL.md+skills/elevenlabs/SKILL.md(parallel)exec→ supadata/web/scrape?url=...paulgraham.com/writes.html→ got full essayexec→elevenlabs.sh voices→ HTTP 401 (expected)exec→ grep VOICE_IDS → found static tableexec→elevenlabs.sh dialogue --inputs '[...]' --out summary-writes-and-write-nots.mp3exec→ verify file: 1.7MB, 106s
Artifacts:
- 🔊 summary-writes-and-write-nots.mp3 (1.7MB, ~1:46, George + Lily)
📋 E1: YouTube video — primary use case (explicit)
Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to turn this YouTube video into a 1-minute two-host audio summary: https://www.youtube.com/watch?v=jNQXAC9IVRw — follow the workflow steps exactly."
Tool calls:
read→skills/audio-summary/SKILL.mdread→skills/Supadata/SKILL.md+skills/yt-dlp/SKILL.md+skills/elevenlabs/SKILL.mdexec→ supadata transcript (text=true) → got transcriptexec→ supadata/youtube/video→ got metadata (noted yt-dlp blocked by bot detection)exec→elevenlabs.sh voices→ HTTP 401 (expected)exec→ grep VOICE_IDS → found tableexec→elevenlabs.sh dialogue --inputs '[...]' --out summary-me-at-the-zoo.mp3exec→ verify: 1.2MB, ~73s
Artifacts:
- 🔊 summary-me-at-the-zoo.mp3 (1.2MB, ~1:13, Will + Lily)
(Note: same filename as D1 — this overwrote D1's file. Uploaded URL is D1's original.)
📋 E2: Article — full workflow with voice selection (explicit)
Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to create a 2-minute conversational audio recap of this article: https://paulgraham.com/greatwork.html — use the full workflow including voice selection and dialogue generation."
Tool calls:
read→skills/audio-summary/SKILL.mdread→skills/Supadata/SKILL.md+skills/elevenlabs/SKILL.md(parallel)exec→ supadata/web/scrape?url=...paulgraham.com/greatwork.html→ got ~11,000-word essayexec→ls elevenlabs/scripts/→ foundelevenlabs.shexec→elevenlabs.sh voices→ HTTP 401 (expected)exec→ grep VOICE_IDS → found static tablewrite→summary-how-to-do-great-work-dialogue.txt(310-word dialogue script)exec→elevenlabs.sh dialogue --inputs '[...]' --out summary-how-to-do-great-work.mp3exec→ verify: 2.3MB, ~147s
Artifacts:
- 🔊 summary-how-to-do-great-work.mp3 (2.3MB, ~2:27, George + Lily)
- 📄 dialogue script (2.2KB)
📋 E3: Raw text — edge case, short input (explicit)
Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to create a short audio discussion from this raw text: 'Scientists at CERN announced today...' Keep it around 30-45 seconds."
Tool calls:
read→skills/audio-summary/SKILL.mdread→skills/elevenlabs/SKILL.md(correctly skipped supadata/yt-dlp — not needed for raw text)exec→ls elevenlabs/scripts/→ found scriptexec→elevenlabs.sh voices→ HTTP 401 (expected)exec→ grep VOICE_IDS → found static tableexec→elevenlabs.sh dialogue --inputs '[...]' --out cern-xi17-summary.mp3exec→ verify via ffprobe: 720KB, 46.03s
Artifacts:
- 🔊 cern-xi17-summary.mp3 (720KB, ~46s, George + Lily)
audio-summary (Specialized)
Two-host conversational audio summaries from articles, videos, or raw text. NotebookLM-style short-form dialogue — default ~1 minute, adjustable from 30 seconds to 10+ minutes.
What it does
Takes any source content (YouTube video, article URL, or raw text) and produces a natural two-host audio discussion using ElevenLabs multi-voice dialogue. Host A explains key points, Host B reacts and asks questions.
Workflow
Dependencies
Builds on: supadata, yt-dlp, elevenlabs
Handles