Add audio-summary specialized skill by MZULALI · Pull Request #13 · vibecode/Skillsforagent

MZULALI · 2026-03-01T07:38:23Z

audio-summary (Specialized)

Two-host conversational audio summaries from articles, videos, or raw text. NotebookLM-style short-form dialogue — default ~1 minute, adjustable from 30 seconds to 10+ minutes.

What it does

Takes any source content (YouTube video, article URL, or raw text) and produces a natural two-host audio discussion using ElevenLabs multi-voice dialogue. Host A explains key points, Host B reacts and asks questions.

Workflow

Extract source — supadata for transcripts/web scraping, yt-dlp for video metadata
Determine length — maps duration to word count (~150 wpm)
Write dialogue script — structured two-host conversation (hook → core → wrap)
Select voices — contrasting ElevenLabs voices matched to content tone
Generate audio — elevenlabs dialogue endpoint produces single multi-voice file
Deliver — file path + duration estimate + content summary

Dependencies

Builds on: supadata, yt-dlp, elevenlabs

Handles

YouTube videos, articles, raw text, multiple sources
Non-English content (via supadata translation + elevenlabs multilingual)
Edge cases: very long sources (distills to key points), very short sources (proportional output), listicles, background music overlay guidance

Two-host conversational audio summaries from articles, videos, or text. NotebookLM-style short-form dialogue (default ~1 min, adjustable 30s to 10min+). Builds on: supadata, yt-dlp, elevenlabs.

MZULALI · 2026-03-01T07:45:59Z

🎙️ Skill Review: `audio-summary` (Specialized)

Reviewer: Choug (automated CI/CD)
PR: #13 — skill/audio-summary
Tested commit: 5bbe1bf
Date: 2026-03-01 07:40 UTC

Pre-Test Checks

Check	Result
Raw foundational API endpoints?	✅ PASS — No raw `curl` commands or foundational API URLs. Skill properly references foundational skills by name.
Dependencies declared?	✅ PASS — Lists supadata, yt-dlp, elevenlabs as dependencies
Dependencies available?	✅ PASS — All three foundational skills installed and env vars present (`SUPADATA_API_KEY`, `ELEVENLABS_API_KEY`)
Metadata well-formed?	✅ PASS — `name`, `description`, `emoji`, `os` all present
Description quality?	✅ PASS — 8 descriptive trigger phrases covering the key use cases

Phase 1: Discovery Testing (2 sessions)

D1: YouTube Video → Listenable Recap

Prompt: "Here's a YouTube video I found interesting: https://www.youtube.com/watch?v=jNQXAC9IVRw — can you turn it into a short listenable recap? Like two people casually discussing what it's about, around 1-2 minutes of audio."

Check	Result
Prompt clean? (no skill/API/brand names)	✅ Yes
Found specialized skill?	✅ Yes — Read `audio-summary/SKILL.md` as its first action
Used specialized workflow?	✅ Followed the 6-step workflow (Extract → Length → Script → Voices → Generate → Deliver)
Used foundational skills correctly?	✅ Read supadata + elevenlabs SKILLs, used supadata for transcript + metadata, elevenlabs wrapper for dialogue
Result quality?	✅ Generated valid 1:44 MP3. Correctly handled the edge case of a 19-second source video by focusing on historical significance rather than padding thin content.
Voice selection?	✅ Charlie + Lily (matches "Energetic / fun" pairing from the skill's recommendations)

Score: ✅ Found and used the specialized skill's workflow naturally

The agent discovered the audio-summary skill purely from the task description ("listenable recap... two people casually discussing"). It followed the full 6-step specialized workflow, not just ad-hoc foundational calls.

D2: Article → Conversational Audio Summary

Prompt: "I have this article I want to listen to instead of reading: https://paulgraham.com/writes.html — make it into a short conversational audio summary with two different voices, like a quick podcast recap."

Check	Result
Prompt clean? (no skill/API/brand names)	✅ Yes
Found specialized skill?	✅ Yes — Read `audio-summary/SKILL.md` first
Used specialized workflow?	✅ Full 6-step workflow followed
Used foundational skills correctly?	✅ Read supadata + elevenlabs SKILLs, used supadata for web scrape, elevenlabs dialogue command
Result quality?	✅ Generated valid 1:46 MP3. Good content coverage of the Paul Graham essay.
Voice selection?	✅ George + Lily (matches "Professional / news" pairing)

Score: ✅ Found and used the specialized skill's workflow naturally

Excellent discoverability. The description's trigger phrases ("turning an article or blog post into a listenable recap", "creating a conversational audio digest") matched the natural language of the prompt.

Phase 2: Explicit Testing (3 sessions)

E1: YouTube Video — Primary Use Case

Prompt: "You have a skill called 'audio-summary'... Read its SKILL.md first, then use it to turn this YouTube video into a 1-minute two-host audio summary: https://www.youtube.com/watch?v=jNQXAC9IVRw — follow the workflow steps exactly."

Check	Result
Read the SKILL.md?	✅ Read audio-summary SKILL.md, then supadata + elevenlabs
Followed workflow steps?	✅ All 6 steps executed in order
Source extraction worked?	✅ Got transcript via supadata (`text=true`), metadata via supadata `/youtube/video` (noted yt-dlp was blocked by bot detection)
Target length correct?	✅ 1 minute → ~150 words. Output was ~73 seconds
Dialogue quality?	✅ Natural two-host script, hook → core → wrap structure
Voice selection?	✅ Will + Lily (matches "Energetic / fun" pairing)
Audio generated?	✅ Valid MP3, 128kbps, 44.1kHz
Delivery format?	✅ File path + duration + coverage summary

Score: ✅ Skill works correctly — full workflow executed

Note: The agent mentioned yt-dlp was "blocked by bot detection" for metadata and fell back to supadata's /youtube/video endpoint. The skill's Step 1 says to use yt-dlp for metadata, but the supadata fallback worked fine. This is a minor robustness point — the skill could mention this fallback path explicitly.

E2: Article — Full Workflow with Voice Selection

Prompt: "You have a skill called 'audio-summary'... use it to create a 2-minute conversational audio recap of this article: https://paulgraham.com/greatwork.html — use the full workflow including voice selection and dialogue generation."

Check	Result
Read the SKILL.md?	✅ Read all three skills (audio-summary, supadata, elevenlabs)
Followed workflow steps?	✅ All 6 steps in order
Source extraction worked?	✅ Supadata web scrape returned full ~11,000-word article
Target length correct?	✅ 2 min → ~300 words. Wrote ~310-word script. Output ~2:27
Dialogue quality?	✅ Excellent — natural conversation, good distillation of a very long essay to 5 key points
Voice selection?	✅ George + Lily (Professional/news — good match for an essay)
Audio generated?	✅ Valid MP3, 2.3MB, 128kbps 44.1kHz
Saved dialogue script?	✅ Saved to `summary-how-to-do-great-work-dialogue.txt` (nice touch)

Score: ✅ Skill works correctly — handled long content well

E3: Raw Text — Edge Case (Short Input)

Prompt: "You have a skill called 'audio-summary'... create a short audio discussion from this raw text: 'Scientists at CERN announced today...' Keep it around 30-45 seconds."

Check	Result
Read the SKILL.md?	✅ Read audio-summary SKILL.md, then elevenlabs
Followed workflow steps?	✅ Correctly identified "raw text" path in Step 1 (no extraction needed)
Target length correct?	✅ 30-45 sec → ~75-110 words. Output was ~46 seconds
Dialogue quality?	✅ Clean hook → core → wrap structure in ~100 words
Voice selection?	✅ George + Lily
Audio generated?	✅ Valid MP3, 720KB, 46 seconds
Skipped unnecessary foundational skills?	✅ Didn't read supadata/yt-dlp (not needed for raw text)

Score: ✅ Skill works correctly — clean edge case handling

Summary

Session	Type	Skill Found?	Workflow Followed?	Output Valid?	Score
D1	Discovery (YouTube)	✅	✅ Full 6-step	✅ 1:44 MP3	✅
D2	Discovery (Article)	✅	✅ Full 6-step	✅ 1:46 MP3	✅
E1	Explicit (YouTube)	✅	✅ Full 6-step	✅ 1:13 MP3	✅
E2	Explicit (Article)	✅	✅ Full 6-step	✅ 2:27 MP3	✅
E3	Explicit (Raw text)	✅	✅ Full 6-step	✅ 0:46 MP3	✅

5/5 sessions passed. Zero failures. Zero bugs.

Specialized Skill Value Assessment

The audio-summary skill clearly adds value beyond just using the foundational skills individually:

Discovery agents naturally found and used it — the description is well-written and covers realistic use cases
The 6-step workflow (Extract → Length → Script → Voices → Generate → Deliver) provides structure that an agent wouldn't organically produce on its own
The dialogue writing rules (hook → core → wrap, vary turn length, no filler, attribution) produce notably better scripts than ad-hoc generation
The voice pairing recommendations give agents good defaults based on content tone
Edge case handling (short sources, long sources) is well-documented and followed

Observations (Not Bugs)

yt-dlp bot detection: E1 noted yt-dlp was blocked for YouTube metadata. The skill's Step 1 says to use yt-dlp for metadata, but all agents fell back to supadata's /youtube/video endpoint which worked fine. Consider documenting this fallback path explicitly in the skill.
Duration accuracy: All outputs ran slightly over target duration (e.g., 46s for 30-45s target, 2:27 for 2:00 target). This is normal — spoken pauses between dialogue turns add time beyond word count estimates. The skill's "150 wpm" heuristic could note that dialogue format adds ~15-20% overhead for speaker transitions.
File naming collision: D1 and E1 both generated summary-me-at-the-zoo.mp3 (same video). E1's file overwrote D1's. This is expected behavior (same source, same default name) and not a skill bug, but worth noting.

These are all minor polish items, not bugs. None warrant blocking the merge.

Verdict: ✅ `tested-pass`

Zero bugs found. The skill works correctly across all three source types (YouTube, article, raw text), at multiple target durations (30s, 1m, 2m), and with both discovery and explicit usage patterns. The specialized workflow adds clear value over ad-hoc foundational skill usage. Ready to merge.

📋 D1: YouTube video → listenable recap (discovery)

Prompt: "Here's a YouTube video I found interesting: https://www.youtube.com/watch?v=jNQXAC9IVRw — can you turn it into a short listenable recap? Like two people casually discussing what it's about, around 1-2 minutes of audio."

Tool calls:

read → skills/audio-summary/SKILL.md ✅ Found specialized skill
read → skills/Supadata/SKILL.md + skills/elevenlabs/SKILL.md (parallel)
exec → supadata /youtube/video?id=jNQXAC9IVRw → got metadata (title: "Me at the zoo", 19s, 383M views)
exec → supadata /transcript?url=...&text=true → got transcript ("All right, so here we are, in front of the elephants...")
exec → elevenlabs.sh voices → HTTP 401 (proxy key limitation, expected)
exec → grep VOICE_IDS from script → found static voice table
exec → elevenlabs.sh dialogue --inputs '[...]' --out summary-me-at-the-zoo.mp3
exec → verify file: 1.6MB, 104s

Artifacts:

🔊 summary-me-at-the-zoo.mp3 (1.6MB, ~1:44, Charlie + Lily)

📋 D2: Article → conversational audio summary (discovery)

Prompt: "I have this article I want to listen to instead of reading: https://paulgraham.com/writes.html — make it into a short conversational audio summary with two different voices, like a quick podcast recap."

Tool calls:

read → skills/audio-summary/SKILL.md ✅ Found specialized skill
read → skills/Supadata/SKILL.md + skills/elevenlabs/SKILL.md (parallel)
exec → supadata /web/scrape?url=...paulgraham.com/writes.html → got full essay
exec → elevenlabs.sh voices → HTTP 401 (expected)
exec → grep VOICE_IDS → found static table
exec → elevenlabs.sh dialogue --inputs '[...]' --out summary-writes-and-write-nots.mp3
exec → verify file: 1.7MB, 106s

Artifacts:

🔊 summary-writes-and-write-nots.mp3 (1.7MB, ~1:46, George + Lily)

📋 E1: YouTube video — primary use case (explicit)

Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to turn this YouTube video into a 1-minute two-host audio summary: https://www.youtube.com/watch?v=jNQXAC9IVRw — follow the workflow steps exactly."

Tool calls:

read → skills/audio-summary/SKILL.md
read → skills/Supadata/SKILL.md + skills/yt-dlp/SKILL.md + skills/elevenlabs/SKILL.md
exec → supadata transcript (text=true) → got transcript
exec → supadata /youtube/video → got metadata (noted yt-dlp blocked by bot detection)
exec → elevenlabs.sh voices → HTTP 401 (expected)
exec → grep VOICE_IDS → found table
exec → elevenlabs.sh dialogue --inputs '[...]' --out summary-me-at-the-zoo.mp3
exec → verify: 1.2MB, ~73s

Artifacts:

🔊 summary-me-at-the-zoo.mp3 (1.2MB, ~1:13, Will + Lily)
(Note: same filename as D1 — this overwrote D1's file. Uploaded URL is D1's original.)

📋 E2: Article — full workflow with voice selection (explicit)

Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to create a 2-minute conversational audio recap of this article: https://paulgraham.com/greatwork.html — use the full workflow including voice selection and dialogue generation."

Tool calls:

read → skills/audio-summary/SKILL.md
read → skills/Supadata/SKILL.md + skills/elevenlabs/SKILL.md (parallel)
exec → supadata /web/scrape?url=...paulgraham.com/greatwork.html → got ~11,000-word essay
exec → ls elevenlabs/scripts/ → found elevenlabs.sh
exec → elevenlabs.sh voices → HTTP 401 (expected)
exec → grep VOICE_IDS → found static table
write → summary-how-to-do-great-work-dialogue.txt (310-word dialogue script)
exec → elevenlabs.sh dialogue --inputs '[...]' --out summary-how-to-do-great-work.mp3
exec → verify: 2.3MB, ~147s

Artifacts:

🔊 summary-how-to-do-great-work.mp3 (2.3MB, ~2:27, George + Lily)
📄 dialogue script (2.2KB)

📋 E3: Raw text — edge case, short input (explicit)

Prompt: "You have a skill called 'audio-summary' in your available skills. Read its SKILL.md first, then use it to create a short audio discussion from this raw text: 'Scientists at CERN announced today...' Keep it around 30-45 seconds."

Tool calls:

read → skills/audio-summary/SKILL.md
read → skills/elevenlabs/SKILL.md (correctly skipped supadata/yt-dlp — not needed for raw text)
exec → ls elevenlabs/scripts/ → found script
exec → elevenlabs.sh voices → HTTP 401 (expected)
exec → grep VOICE_IDS → found static table
exec → elevenlabs.sh dialogue --inputs '[...]' --out cern-xi17-summary.mp3
exec → verify via ffprobe: 720KB, 46.03s

Artifacts:

🔊 cern-xi17-summary.mp3 (720KB, ~46s, George + Lily)

Add audio-summary specialized skill

b2e472e

Two-host conversational audio summaries from articles, videos, or text. NotebookLM-style short-form dialogue (default ~1 min, adjustable 30s to 10min+). Builds on: supadata, yt-dlp, elevenlabs.

MZULALI added needs-testing Builder finished, ready for reviewer to test specialized Builds on foundational skills labels Mar 1, 2026

MZULALI added tested-pass Reviewer verified the skill works and removed needs-testing Builder finished, ready for reviewer to test labels Mar 1, 2026

chore: remove COORDINATION.md to resolve merge conflicts

388f559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add audio-summary specialized skill#13

Add audio-summary specialized skill#13
MZULALI wants to merge 2 commits intomainfrom
skill/audio-summary

MZULALI commented Mar 1, 2026

Uh oh!

MZULALI commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MZULALI commented Mar 1, 2026

audio-summary (Specialized)

What it does

Workflow

Dependencies

Handles

Uh oh!

MZULALI commented Mar 1, 2026

🎙️ Skill Review: audio-summary (Specialized)

Pre-Test Checks

Phase 1: Discovery Testing (2 sessions)

D1: YouTube Video → Listenable Recap

D2: Article → Conversational Audio Summary

Phase 2: Explicit Testing (3 sessions)

E1: YouTube Video — Primary Use Case

E2: Article — Full Workflow with Voice Selection

E3: Raw Text — Edge Case (Short Input)

Summary

Specialized Skill Value Assessment

Observations (Not Bugs)

Verdict: ✅ tested-pass

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🎙️ Skill Review: `audio-summary` (Specialized)

Verdict: ✅ `tested-pass`