Product Requirements Document (PRD)

Voice Features for MessageAI - Busy Parent/Caregiver Edition

Version: 1.0
Date: October 25, 2025
Owner: [Your Name]
Status: Draft for Review

1. Executive Summary

Vision

Transform MessageAI into the first truly hands-free family messaging platform by integrating voice AI capabilities that allow busy parents to send, receive, and comprehend messages without ever looking at their phone.

Problem Statement

Busy parents and caregivers face constant communication demands while juggling multiple responsibilities (driving, cooking, childcare, work). Current messaging apps require:

Eyes on screen to read messages
Hands free to type responses
Undivided attention to parse long group threads

This creates dangerous situations (texting while driving), missed information (can't read during tasks), and mental overload (100+ unread messages).

Solution

Three-phased voice AI system:

Voice-to-Text: Speak messages, AI transcribes instantly
Text-to-Voice: Every message playable in sender's cloned voice
Conversation Audio: Listen to entire group chats or AI summaries

Success Metrics

60%+ of users enable voice features within first week
40%+ reduction in average message response time
80%+ user satisfaction with voice quality (>4/5 rating)
Demo video views spike (target: 10k+ views on social)
Rubric score: 100+ points (A+ with innovation bonuses)

2. User Stories

Persona: Sarah - Working Parent with 2 Kids in Activities

Background: 38-year-old marketing manager, two kids (ages 7 and 10) in soccer and dance. Juggles work meetings, school pickups, activity coordination, and household management. In 5 different group chats (soccer team parents, dance moms, school PTA, family, neighbors).

User Stories:

US-1: Hands-Free Messaging While Driving

AS Sarah driving to pick up kids from soccer practice
I WANT TO send a message to the team parents without typing
SO THAT I can notify them I'm running late without endangering my family

Acceptance Criteria:

Can record voice message with single tap-and-hold
Transcription appears within 2 seconds of finishing recording
Can review and edit transcript before sending
Can cancel recording by swiping/sliding
Audio quality sufficient for accurate transcription (95%+ accuracy)

US-2: Listening to Messages While Cooking

AS Sarah preparing dinner with messy hands
I WANT TO hear text messages read aloud
SO THAT I can stay updated on important family coordination without stopping what I'm doing

Acceptance Criteria:

Every text message has visible "play" button/icon
Tapping play speaks message in sender's voice (or default if not cloned)
Audio is clear and natural-sounding
Playback speed adjustable (1x, 1.5x, 2x)
Can pause/resume playback

US-3: Catching Up on Group Chat During Commute

AS Sarah with 47 unread messages in soccer parents group
I WANT TO listen to a summary of the conversation
SO THAT I can understand decisions made without reading 47 individual messages

Acceptance Criteria:

Can trigger summary generation with single tap
AI extracts key decisions, action items, and dates
Summary audio is 2-3 minutes max for 50+ message threads
Includes who's responsible for what
Option to listen to full conversation or just summary

US-4: Personal Voice for Authenticity

AS Sarah who wants her family to hear HER voice in messages
I WANT TO clone my voice once and have it used automatically
SO THAT my messages feel personal even when AI-generated

Acceptance Criteria:

Voice cloning setup takes <5 minutes
Recording sample is 30-60 seconds max
Cloned voice sounds recognizably like user (80%+ similarity)
Can preview voice before confirming
Can re-record if unhappy with result

US-5: Managing Multiple Group Conversations Efficiently

AS Sarah in 5 active group chats simultaneously
I WANT TO quickly understand which conversations need my attention
SO THAT I can prioritize responses without reading everything

Acceptance Criteria:

Can see which threads have new voice messages vs. text
Can play conversation summaries for each group
Visual indicators for urgent/priority messages
Can queue multiple summaries to listen back-to-back

Persona: Marcus - Single Dad with Shared Custody

Background: 42-year-old software engineer, shares custody of 8-year-old daughter Emma. Coordinates heavily with ex-wife, Emma's school, after-school program, and his own parents who help with childcare.

User Stories:

US-6: Quick Voice Responses During Work Meetings

AS Marcus in a work meeting when urgent message arrives
I WANT TO send quick voice response discreetly
SO THAT I can coordinate Emma's pickup without disrupting my meeting

Acceptance Criteria:

Can record voice in low volume/whisper
Transcription works with whispered speech
Can send text-only (no audio attachment) for discretion
<3 second turnaround from record to send

US-7: Bedtime Stories When Traveling

AS Marcus traveling for work and missing Emma's bedtime
I WANT TO send bedtime story read in my voice
SO THAT Emma can hear daddy's voice even when I'm away

Acceptance Criteria:

Can record longer voice messages (3-5 minutes)
High audio quality for emotional content
Ex-wife can easily play for Emma
Audio saved permanently (not auto-deleted)

US-8: Voice Accessibility for Emma (Child User)

AS Emma who is learning to read but not fluent yet
I WANT TO hear daddy's messages read aloud
SO THAT I can understand what he's saying without asking mom to read it

Acceptance Criteria:

Simple, large "play" button child can tap
Auto-plays in parent's voice by default
No complex controls that confuse children
Works offline (cached audio)

Persona: Aisha - Caregiver for Aging Parents

Background: 55-year-old nurse, caring for mother (82) with vision impairment and father (85) with mild cognitive decline. Coordinates with siblings, home health aides, and medical providers.

User Stories:

US-9: Accessibility for Vision-Impaired Parent

AS Aisha's mother who cannot read small text on phone
I WANT TO hear all messages read aloud automatically
SO THAT I can stay connected with family without straining my eyes

Acceptance Criteria:

Settings option for "auto-play all messages"
Messages play immediately upon arrival
High volume and slow speed options
Works with phone accessibility features (VoiceOver, TalkBack)

US-10: Medical Coordination with Multiple Caregivers

AS Aisha coordinating home care schedule with 3 siblings
I WANT TO listen to family group updates during my commute
SO THAT I stay informed without dedicating screen time

Acceptance Criteria:

Can subscribe to specific groups for auto-summaries
Daily digest option (morning or evening summary)
Highlights medical/urgent information
Distinguishes between different family member voices

3. Key Features Required for MVP

Phase 1: Voice-to-Text (Foundation) - MUST HAVE

Feature 1.1: Voice Message Recording

Priority: P0 (Critical)

Description: User can record voice message and receive instant transcription.

Functional Requirements:

Long-press microphone icon to record
Visual feedback (waveform, timer) during recording
Release to stop, swipe left to cancel
Maximum recording length: 2 minutes (safety limit)
Audio format: M4A or MP3
File size limit: 10MB

Technical Requirements:

Use device native audio recording APIs
Stream audio to backend for transcription
Display transcription within 2 seconds of completion
Handle network failures gracefully (queue for later transcription)

UX Requirements:

Haptic feedback on press/release
Clear "Recording..." indicator
Waveform visualization
Timer display (00:00 format)
Cancel affordance clearly visible

Acceptance Criteria:

95%+ transcription accuracy for clear speech
<2 second latency for transcription
Works on iOS and Android
Handles background noise reasonably
Graceful degradation if API fails

Feature 1.2: Transcription Display & Editing

Priority: P0 (Critical)

Description: Show transcription result with editing capability before sending.

Functional Requirements:

Display transcript in message input field
Allow user to edit text before sending
Show confidence indicator (Good/Fair/Poor)
Option to listen to original audio
Can discard and re-record

UX Requirements:

Transcript appears with typing cursor at end
"✓ Transcribed" or "⚠ Low confidence" indicator
Edit button with keyboard icon
Re-record button with microphone icon
Send button (only enabled after review)

Acceptance Criteria:

User can edit any word in transcript
Original audio preserved if user wants to include it
Can choose to send: text-only, audio-only, or both
Low-confidence transcriptions flagged for review

Feature 1.3: Message Storage (Text + Audio)

Priority: P0 (Critical)

Description: Store both audio file and transcript with message.

Database Schema:

ALTER TABLE messages ADD COLUMN audio_url TEXT;
ALTER TABLE messages ADD COLUMN transcript TEXT;
ALTER TABLE messages ADD COLUMN audio_duration INTEGER; -- seconds
ALTER TABLE messages ADD COLUMN transcription_confidence FLOAT; -- 0-1

Technical Requirements:

Upload audio to cloud storage (S3, Cloudflare R2, Firebase Storage)
Store URL in database
Enable offline access (cache audio locally)
Auto-delete audio after 30 days (configurable) to save storage

Acceptance Criteria:

Audio accessible via URL
Transcript searchable
Both linked to same message entity
Offline playback works

Phase 2: Text-to-Voice (Voice Cloning) - MUST HAVE

Feature 2.1: Voice Profile Setup

Priority: P0 (Critical)

Description: One-time voice cloning setup for each user.

Functional Requirements:

User records 30-60 second voice sample
Sample text provided (pangram or conversational script)
Submit to ElevenLabs for voice cloning
Store voice_id in user profile
Preview cloned voice before confirming

User Flow:

Settings → Voice Profile → "Set Up My Voice"
→ Read sample text (30-60s)
→ Submit for processing (15-30s wait)
→ "Listen to Preview"
→ Confirm or Re-record
→ Voice active

Sample Text (30-second script):

"Hi, this is [Name]. I'm setting up my voice for MessageAI. 
This voice will be used to read my messages aloud to my family. 
I can speak naturally, just like I'm talking to a friend. 
The quick brown fox jumps over the lazy dog. 
Thanks for setting this up with me!"

Technical Requirements:

ElevenLabs Voice Design API integration
Store voice_id in users table
Handle API failures (retry logic)
Cost: ~$0 per voice (included in ElevenLabs plan)

Acceptance Criteria:

Setup takes <5 minutes end-to-end
Cloned voice sounds recognizable (80%+ user satisfaction)
Can re-do voice cloning if unsatisfied
Works for various accents and ages
Fallback to default voice if cloning fails

Feature 2.2: Text-to-Speech Generation

Priority: P0 (Critical)

Description: Generate audio for any text message using sender's cloned voice.

Functional Requirements:

Every text message gets "play" button/icon
On first tap: generate audio (if not cached)
On subsequent taps: play cached audio
Uses sender's voice_id if available, else default voice
Shows "Generating..." indicator (1-3 seconds)

Technical Requirements:

// API call
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}
{
  "text": message.content,
  "model_id": "eleven_multilingual_v2",
  "voice_settings": {
    "stability": 0.5,
    "similarity_boost": 0.8
  }
}

Caching Strategy:

Generate audio on first play request
Upload to cloud storage
Cache URL in messages.audio_url
Serve from cache for future plays
Cache locally on device for offline playback

Cost Considerations:

ElevenLabs TTS: ~$0.30 per 1,000 characters
Average message: 50 characters = $0.015 per message
Cache aggressively to minimize regeneration
Budget: $100/month = ~6,600 message plays

Acceptance Criteria:

Audio generation <3 seconds
Voice quality matches original (80%+ similarity)
Cached plays are instant
Works offline with cached audio
Graceful error handling (show text if audio fails)

Feature 2.3: Audio Playback Controls

Priority: P1 (High)

Description: Full-featured audio player for voice messages.

Functional Requirements:

Play/Pause button
Seek bar (scrub through audio)
Playback speed (1x, 1.25x, 1.5x, 2x)
Time display (current / total)
Waveform visualization (optional, P2)

UX Requirements:

Inline player (doesn't navigate away)
Continues playing while scrolling
Pause if new audio starts
Keyboard shortcuts (spacebar = play/pause)

Technical Requirements:

Native audio player APIs
Background audio support (iOS/Android)
Lock screen controls
Bluetooth/CarPlay integration

Acceptance Criteria:

Smooth playback, no stuttering
Seek bar responsive
Speed changes don't restart audio
Works with device volume controls
Pauses automatically for phone calls

Phase 3: Conversation Audio (Choose One) - SHOULD HAVE

Option A: Multi-Voice Conversation Playback

Priority: P1 (High Impact, but complex)

Description: Generate podcast-style audio of entire group conversation with unique voice for each participant.

Functional Requirements:

"🎧 Play Conversation" button on group chats
Select message range (e.g., "Last 50 messages" or "Since yesterday")
AI generates multi-voice audio with each person's cloned voice
Visual timeline showing current speaker
Scrub through conversation by speaker

Technical Requirements:

Batch TTS generation for all messages
Audio stitching (ffmpeg or similar)
Insert 300ms silence between speakers
Generate speaker timeline (who speaks when)
Total processing time: <30 seconds for 50 messages

Technical Challenges:

Audio stitching complexity: Need backend processing
Cost: 50 messages × 50 chars = 2,500 chars = $0.75 per conversation
Storage: Large audio files (5-10MB per conversation)
Processing time: May take 20-30 seconds for long threads

Acceptance Criteria:

Distinct voices for each participant (minimum 3 voices)
Smooth transitions between speakers
Visual timeline accurate
Generation completes in <30 seconds
Audio quality matches individual TTS

MVP Simplification:

Limit to 30 messages max per playback
Pre-process on backend (not real-time)
Cache generated audio for 24 hours
Fallback to summary if stitching fails

Option B: AI Summary Audio (RECOMMENDED FOR MVP)

Priority: P1 (High Impact, simpler)

Description: Generate narrated summary of long group conversations.

Functional Requirements:

"🎙️ Hear Summary" button appears when thread >20 messages
AI summarizes key points, decisions, action items
Professional narrator voice reads summary
2-3 minute max duration for any thread length
Option to read text summary or listen

Technical Requirements:

// Step 1: GPT-4 Summarization
const summary = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [
    {
      role: 'system',
      content: `Summarize this group chat for a busy parent. 
                Focus on: decisions made, action items, who's doing what, 
                dates/times, and urgent information.
                Structure: Overview → Decisions → Action Items → Dates.
                Max 300 words. Conversational tone for audio narration.`
    },
    { role: 'user', content: conversationText }
  ]
});

// Step 2: TTS Generation
const audio = await elevenLabs.textToSpeech({
  text: summary,
  voice_id: 'professional_narrator', // ElevenLabs pre-made voice
  model_id: 'eleven_multilingual_v2'
});

Cost Analysis:

GPT-4: ~$0.03 per 1k tokens (input) + $0.06 per 1k tokens (output)
50 messages = ~2k input tokens + ~300 output tokens = $0.078
TTS: 300 words = ~1,500 chars = $0.45
Total per summary: ~$0.53
Budget: $100/month = ~190 summaries

Acceptance Criteria:

Summary captures all key information
Audio narration is natural and professional
Generation completes in <10 seconds
Summary text also displayed for reading
Works for threads of 20-200 messages

Why This Over Option A:

✅ 60% cheaper per use
✅ 3x faster generation
✅ Simpler technical implementation (no audio stitching)
✅ More practical for daily use (parents want key info, not full playback)
✅ Still impressive for demo
✅ Can add Option A later as premium feature

4. Tech Stack

✅ CONFIRMED STACK

Frontend: Native iOS (Swift + SwiftUI)

Why Swift:

✅ Already chosen by team
✅ Best performance and battery efficiency
✅ Full access to iOS native features
✅ Smallest app size (~10-15MB)
✅ Superior audio handling with AVFoundation
✅ Seamless integration with iOS ecosystem

Audio Framework:

AVFoundation (Apple's native audio framework)
- AVAudioRecorder for voice recording
- AVAudioPlayer for playback
- AVAudioEngine for advanced audio processing
- Native noise suppression and echo cancellation
- Background audio support built-in
- CarPlay integration ready

Key Swift Libraries:

// Audio Recording & Playback
import AVFoundation

// Firebase Integration
import FirebaseCore
import FirebaseAuth
import FirebaseFirestore
import FirebaseStorage

// Networking
import Alamofire // or native URLSession

// UI
import SwiftUI
import Combine

iOS Version Target:

Minimum: iOS 16.0 (covers 95%+ of users)
Target: iOS 17.0+ for latest features
Voice features work on iOS 16+

Important Considerations:

⚠️ iOS-only for MVP (no Android)
⚠️ Need separate Android implementation later (Kotlin + Jetpack Compose)
⚠️ TestFlight for beta testing (100 external testers free)
⚠️ App Store review time (~24-48 hours)

Backend: Firebase (Serverless)

Why Firebase:

✅ Already chosen by team
✅ Zero server management
✅ Excellent Swift SDK
✅ Real-time database & WebSocket built-in
✅ Integrated auth, storage, and database
✅ Free tier generous (good for MVP)

Firebase Services Used:

1. Firebase Authentication

import FirebaseAuth

// User auth (existing)
Auth.auth().currentUser

2. Firebase Firestore (Database)

import FirebaseFirestore

// Existing schema + voice additions
db.collection("users").document(userId).updateData([
    "elevenLabsVoiceId": voiceId,
    "voiceSampleUrl": url,
    "voiceCloningCompletedAt": Timestamp()
])

db.collection("messages").document(messageId).updateData([
    "audioUrl": url,
    "transcript": text,
    "audioDuration": seconds,
    "transcriptionConfidence": confidence
])

Firestore Schema Updates:

users/
  {userId}/
    name: string
    email: string
    photoUrl: string
    elevenLabsVoiceId: string? (NEW)
    voiceSampleUrl: string? (NEW)
    voiceCloningCompletedAt: timestamp? (NEW)
    createdAt: timestamp

messages/
  {messageId}/
    conversationId: string
    senderId: string
    content: string
    audioUrl: string? (NEW)
    transcript: string? (NEW)
    audioDuration: number? (NEW - seconds)
    transcriptionConfidence: number? (NEW - 0-1)
    sentAt: timestamp
    deliveredAt: timestamp?
    readAt: timestamp?

conversationAudio/ (NEW COLLECTION)
  {audioId}/
    conversationId: string
    type: string ("summary" | "full_conversation")
    audioUrl: string
    summaryText: string?
    messageCount: number
    duration: number
    createdAt: timestamp
    expiresAt: timestamp

3. Firebase Storage

import FirebaseStorage

// Audio file storage
let storageRef = Storage.storage().reference()
let audioRef = storageRef.child("audio/\(userId)/\(messageId).m4a")

// Upload
audioRef.putFile(from: localURL) { metadata, error in
    // Get download URL
    audioRef.downloadURL { url, error in
        // Save URL to Firestore
    }
}

Storage Structure:

audio/
  {userId}/
    {messageId}.m4a (voice recordings)
    voice_sample.m4a (for cloning)
  summaries/
    {conversationId}_{timestamp}.mp3 (generated audio)

4. Firebase Cloud Functions (Serverless Backend)

// Node.js functions for API calls
const functions = require('firebase-functions');
const admin = require('firebase-admin');

// Transcription endpoint
exports.transcribeAudio = functions.https.onCall(async (data, context) => {
    const { audioUrl } = data;
    
    // Call OpenAI Whisper
    const transcript = await callWhisperAPI(audioUrl);
    
    return { transcript };
});

// Voice cloning endpoint
exports.cloneVoice = functions.https.onCall(async (data, context) => {
    const { audioUrl, userId } = data;
    
    // Call ElevenLabs
    const voiceId = await callElevenLabsCloning(audioUrl);
    
    // Save to Firestore
    await admin.firestore().collection('users').doc(userId).update({
        elevenLabsVoiceId: voiceId
    });
    
    return { voiceId };
});

// TTS generation endpoint
exports.generateTTS = functions.https.onCall(async (data, context) => {
    const { text, voiceId } = data;
    
    // Call ElevenLabs TTS
    const audioBuffer = await callElevenLabsTTS(text, voiceId);
    
    // Upload to Firebase Storage
    const audioUrl = await uploadToStorage(audioBuffer);
    
    return { audioUrl };
});

// Summary generation endpoint
exports.generateSummary = functions.https.onCall(async (data, context) => {
    const { messages } = data;
    
    // Call OpenAI GPT
    const summary = await callGPTSummary(messages);
    
    // Generate audio
    const audioUrl = await callElevenLabsTTS(summary.text);
    
    return { summary: summary.text, audioUrl };
});

Why Cloud Functions:

✅ Keep API keys secure (never in client)
✅ Serverless (auto-scales, pay per use)
✅ Node.js = easy to write
✅ Integrated with Firebase ecosystem
✅ No server management

Cloud Functions Setup:

firebase init functions
cd functions
npm install openai @elevenlabs/api
firebase deploy --only functions

Backend: Firebase Alternatives Considered ❌

Option: Custom Node.js Backend

❌ Rejected: Firebase already chosen, adds complexity
❌ Would need separate server hosting
❌ More infrastructure to manage

Option: Firebase Realtime Database

❌ Rejected: Firestore is better for complex queries
❌ Less flexible data structure

Stick with Firebase Firestore + Cloud Functions

AI Services

✅ Speech-to-Text: OpenAI Whisper API

Already have OpenAI key ✅

Implementation:

// Swift client calling Cloud Function
func transcribeAudio(audioURL: URL) async throws -> String {
    let functions = Functions.functions()
    let transcribe = functions.httpsCallable("transcribeAudio")
    
    let result = try await transcribe.call(["audioUrl": audioURL.absoluteString])
    return result.data["transcript"] as! String
}

// Cloud Function (Node.js)
const { OpenAI } = require('openai');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

exports.transcribeAudio = functions.https.onCall(async (data, context) => {
    const { audioUrl } = data;
    
    // Download audio from Firebase Storage
    const audioBuffer = await downloadFromStorage(audioUrl);
    
    // Create temp file
    const tempFile = `/tmp/${Date.now()}.m4a`;
    fs.writeFileSync(tempFile, audioBuffer);
    
    // Transcribe with Whisper
    const transcription = await openai.audio.transcriptions.create({
        file: fs.createReadStream(tempFile),
        model: 'whisper-1',
        language: 'en', // or auto-detect
        response_format: 'verbose_json' // includes confidence
    });
    
    // Cleanup
    fs.unlinkSync(tempFile);
    
    return {
        transcript: transcription.text,
        confidence: transcription.segments?.[0]?.confidence || 1.0
    };
});

Pricing:

$0.006 per minute
Average 30-second voice message = $0.003
Budget: $50/month = 16,666 transcriptions

Pros:

✅ Best accuracy (95%+)
✅ Handles accents, noise well
✅ Fast (1-3 seconds)
✅ 100+ languages
✅ Already have API key

✅ Text-to-Speech & Voice Cloning: ElevenLabs

Need to set up ElevenLabs account

Swift Implementation:

// Voice cloning setup
func cloneVoice(audioURL: URL) async throws -> String {
    let functions = Functions.functions()
    let clone = functions.httpsCallable("cloneVoice")
    
    let result = try await clone.call([
        "audioUrl": audioURL.absoluteString,
        "userId": Auth.auth().currentUser!.uid
    ])
    
    return result.data["voiceId"] as! String
}

// Generate TTS
func generateTTS(text: String, voiceId: String) async throws -> URL {
    let functions = Functions.functions()
    let generate = functions.httpsCallable("generateTTS")
    
    let result = try await generate.call([
        "text": text,
        "voiceId": voiceId
    ])
    
    let audioUrlString = result.data["audioUrl"] as! String
    return URL(string: audioUrlString)!
}

Cloud Function Implementation:

const { ElevenLabsClient } = require('@elevenlabs/api');
const elevenlabs = new ElevenLabsClient({ 
    apiKey: process.env.ELEVENLABS_API_KEY 
});

// Voice cloning
exports.cloneVoice = functions.https.onCall(async (data, context) => {
    const { audioUrl, userId } = data;
    
    // Download audio sample from Firebase Storage
    const audioBuffer = await downloadFromStorage(audioUrl);
    
    // Clone voice with ElevenLabs
    const voice = await elevenlabs.voices.add({
        name: `user_${userId}`,
        files: [audioBuffer],
        description: 'MessageAI user voice'
    });
    
    // Save voice_id to Firestore
    await admin.firestore().collection('users').doc(userId).update({
        elevenLabsVoiceId: voice.voice_id,
        voiceCloningCompletedAt: admin.firestore.FieldValue.serverTimestamp()
    });
    
    return { voiceId: voice.voice_id };
});

// Text-to-Speech
exports.generateTTS = functions.https.onCall(async (data, context) => {
    const { text, voiceId, messageId } = data;
    
    // Check cache first
    const messageRef = admin.firestore().collection('messages').doc(messageId);
    const messageDoc = await messageRef.get();
    
    if (messageDoc.data()?.audioUrl) {
        return { audioUrl: messageDoc.data().audioUrl };
    }
    
    // Generate audio with ElevenLabs
    const audio = await elevenlabs.textToSpeech.convert(voiceId, {
        text: text,
        model_id: 'eleven_multilingual_v2',
        voice_settings: {
            stability: 0.5,
            similarity_boost: 0.8,
            style: 0.0,
            use_speaker_boost: true
        }
    });
    
    // Convert stream to buffer
    const chunks = [];
    for await (const chunk of audio) {
        chunks.push(chunk);
    }
    const audioBuffer = Buffer.concat(chunks);
    
    // Upload to Firebase Storage
    const bucket = admin.storage().bucket();
    const file = bucket.file(`audio/tts/${messageId}.mp3`);
    await file.save(audioBuffer, { contentType: 'audio/mpeg' });
    
    // Make publicly accessible (or use signed URL)
    await file.makePublic();
    const audioUrl = file.publicUrl();
    
    // Cache URL in Firestore
    await messageRef.update({ audioUrl });
    
    return { audioUrl };
});

Pricing:

Voice cloning: Free (included in plan)
TTS: $0.30 per 1,000 characters
Average message: 50 chars = $0.015
Budget: $100/month = 333,333 characters = 6,666 messages

Plan Recommendation:

Start with Starter plan ($5/month)
- 30,000 characters included
- Then $0.30 per 1k characters
Upgrade to Creator ($22/month) if needed
- 100,000 characters included
- Then $0.24 per 1k characters (20% discount)

✅ LLM for Summaries: OpenAI GPT-4o-mini

Already have OpenAI key ✅

Implementation:

// Cloud Function
exports.generateSummary = functions.https.onCall(async (data, context) => {
    const { conversationId, messageRange } = data;
    
    // Fetch messages from Firestore
    const messagesSnapshot = await admin.firestore()
        .collection('messages')
        .where('conversationId', '==', conversationId)
        .where('sentAt', '>=', messageRange.start)
        .where('sentAt', '<=', messageRange.end)
        .orderBy('sentAt', 'asc')
        .get();
    
    // Format conversation
    const conversation = messagesSnapshot.docs.map(doc => {
        const msg = doc.data();
        return `${msg.senderName}: ${msg.content}`;
    }).join('\n');
    
    // Generate summary with GPT-4o-mini
    const completion = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [
            {
                role: 'system',
                content: `You are summarizing a group chat for a busy parent. 
                         Focus on: decisions made, action items, who's doing what, 
                         important dates/times, and urgent information.
                         Structure: Brief overview, then key points.
                         Max 200 words. Conversational tone for audio narration.`
            },
            {
                role: 'user',
                content: `Summarize this conversation:\n\n${conversation}`
            }
        ],
        temperature: 0.7,
        max_tokens: 400
    });
    
    const summaryText = completion.choices[0].message.content;
    
    // Generate audio with ElevenLabs
    const audio = await elevenlabs.textToSpeech.convert(
        'professional_narrator_voice_id', // Use pre-made ElevenLabs voice
        {
            text: summaryText,
            model_id: 'eleven_multilingual_v2'
        }
    );
    
    // Upload audio
    const audioBuffer = await streamToBuffer(audio);
    const audioUrl = await uploadToStorage(audioBuffer, `summaries/${conversationId}_${Date.now()}.mp3`);
    
    // Save to Firestore
    await admin.firestore().collection('conversationAudio').add({
        conversationId,
        type: 'summary',
        summaryText,
        audioUrl,
        messageCount: messagesSnapshot.size,
        createdAt: admin.firestore.FieldValue.serverTimestamp(),
        expiresAt: admin.firestore.Timestamp.fromDate(
            new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) // 7 days
        )
    });
    
    return { summaryText, audioUrl };
});

Pricing:

GPT-4o-mini: $0.15 per 1M input tokens, $0.60 per 1M output tokens
50-message thread: ~2k input tokens, ~300 output tokens
Cost per summary: ~$0.0005 (basically free!)
Budget: $50/month = 100,000 summaries (won't hit this)

Why GPT-4o-mini vs GPT-4:

✅ 30x cheaper
✅ Faster responses
✅ 95% as good for summaries
✅ Good enough for MVP
Can upgrade to GPT-4 later if needed

Database: Firebase Firestore

Already using Firebase ✅

Schema Design Philosophy:

Denormalize for read performance (Firestore best practice)
Use subcollections for large lists
Index frequently queried fields
Use serverTimestamp() for consistency

Firestore Security Rules:

rules_version = '2';
service cloud.firestore {
  match /databases/{database}/documents {
    // Users can only read/write their own profile
    match /users/{userId} {
      allow read: if request.auth != null;
      allow write: if request.auth.uid == userId;
    }
    
    // Messages: only conversation participants can read/write
    match /messages/{messageId} {
      allow read: if request.auth != null && 
                     isConversationMember(resource.data.conversationId);
      allow create: if request.auth != null && 
                       request.auth.uid == request.resource.data.senderId;
      allow update: if request.auth.uid == resource.data.senderId;
    }
    
    // Conversation audio: only participants can access
    match /conversationAudio/{audioId} {
      allow read: if request.auth != null && 
                     isConversationMember(resource.data.conversationId);
      allow create: if request.auth != null;
    }
    
    function isConversationMember(conversationId) {
      return exists(/databases/$(database)/documents/conversations/$(conversationId)/members/$(request.auth.uid));
    }
  }
}

File Storage: Firebase Storage

Already using Firebase ✅

Storage Structure:

/audio/
  /recordings/
    /{userId}/
      {messageId}.m4a (voice recordings)
      voice_sample.m4a (for cloning)
  /tts/
    {messageId}.mp3 (generated TTS audio)
  /summaries/
    {conversationId}_{timestamp}.mp3 (summary audio)

Storage Rules:

rules_version = '2';
service firebase.storage {
  match /b/{bucket}/o {
    // Voice recordings: only owner can write, participants can read
    match /audio/recordings/{userId}/{allPaths=**} {
      allow write: if request.auth.uid == userId;
      allow read: if request.auth != null;
    }
    
    // TTS audio: anyone authenticated can read (cached)
    match /audio/tts/{allPaths=**} {
      allow read: if request.auth != null;
      allow write: if false; // Only Cloud Functions can write
    }
    
    // Summaries: conversation participants can read
    match /audio/summaries/{allPaths=**} {
      allow read: if request.auth != null;
      allow write: if false; // Only Cloud Functions can write
    }
  }
}

Cost Optimization:

Auto-delete old files with lifecycle rules
Compress audio before upload (AAC codec for recordings)
Use Cloud Functions to manage storage (delete expired files)

Real-Time Communication: Firebase Realtime Database (for WebSocket)

For real-time presence and typing indicators:

import FirebaseDatabase

// Presence system
let presenceRef = Database.database().reference(withPath: "presence/\(userId)")
presenceRef.onDisconnectRemoveValue()
presenceRef.setValue(["online": true, "lastSeen": ServerValue.timestamp()])

// Typing indicators
let typingRef = Database.database().reference(withPath: "typing/\(conversationId)/\(userId)")
typingRef.setValue(true)
DispatchQueue.main.asyncAfter(deadline: .now() + 3) {
    typingRef.removeValue()
}

Why Realtime Database for presence:

✅ Better for real-time ephemeral data
✅ Automatic onDisconnect handling
✅ Lower latency than Firestore for presence
✅ Free tier sufficient for MVP

Tech Stack Summary

Component	Technology	Why
Mobile App	Swift + SwiftUI	Native performance, team choice
Audio Recording	AVFoundation	Best iOS audio framework
Backend	Firebase Cloud Functions	Serverless, secure API key handling
Database	Firebase Firestore	Already in use, real-time sync
File Storage	Firebase Storage	Integrated, CDN included
Authentication	Firebase Auth	Already in use
Real-Time	Firebase Realtime DB	Presence & typing indicators
Speech-to-Text	OpenAI Whisper	Best accuracy, have API key
Text-to-Speech	ElevenLabs	Best voice cloning
Voice Cloning	ElevenLabs	Industry-leading quality
LLM (Summaries)	GPT-4o-mini	Cheap, fast, good enough

Swift + Firebase Architecture

┌─────────────────────────────────────────────────────┐
│                  iOS App (Swift)                     │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────┐ │
│  │ SwiftUI Views│  │ AVFoundation │  │ Firebase  │ │
│  │  - Chat      │  │  - Recording │  │  SDK      │ │
│  │  - Voice     │  │  - Playback  │  │           │ │
│  │  - Settings  │  │              │  │           │ │
│  └──────────────┘  └──────────────┘  └───────────┘ │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│              Firebase Services                       │
│  ┌──────────┐  ┌───────────┐  ┌──────────────────┐ │
│  │   Auth   │  │ Firestore │  │  Cloud Functions │ │
│  │          │  │ (Database)│  │  - transcribe    │ │
│  │          │  │           │  │  - cloneVoice    │ │
│  │          │  │           │  │  - generateTTS   │ │
│  │          │  │           │  │  - summarize     │ │
│  └──────────┘  └───────────┘  └──────────────────┘ │
│  ┌──────────┐  ┌───────────┐                       │
│  │ Storage  │  │ Realtime  │                       │
│  │ (Audio)  │  │ DB        │                       │
│  └──────────┘  └───────────┘                       │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│               External APIs                          │
│  ┌──────────────┐  ┌──────────────────────────────┐ │
│  │   OpenAI     │  │      ElevenLabs              │ │
│  │  - Whisper   │  │  - Voice Cloning             │ │
│  │  - GPT-4o    │  │  - Text-to-Speech            │ │
│  └──────────────┘  └──────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
```Pros:**
- ✅ Enterprise-grade reliability
- ✅ Moderate pricing
- ✅ Voice cloning available (Azure only)

**Cons:**
- ❌ Voice quality inferior to ElevenLabs
- ❌ More complex setup
- ❌ Azure voice cloning requires more data

**Recommendation:** Skip for MVP.

---

#### LLM for Summaries

**Option: OpenAI GPT-4 (**ONLY OPTION**)**

**Pros:**
- ✅ Best summarization quality
- ✅ Understands context and nuance
- ✅ Structured output
- ✅ Reliable and fast

**Cons:**
- ❌ Most expensive LLM ($0.03 input, $0.06 output per 1k tokens)
- ❌ Requires careful prompt engineering

**Pricing:**
- 50-message thread = ~2k input tokens + ~300 output tokens
- Cost per summary: ~$0.08
- Budget: $50/month = ~625 summaries

**Alternative:** GPT-4o-mini ($0.15 input, $0.60 output per 1M tokens)
- 95% as good, 10x cheaper
- Cost per summary: ~$0.008
- Budget: $50/month = ~6,250 summaries

**Recommendation:** Start with GPT-4o-mini, upgrade to GPT-4 if quality insufficient.

---

## Swift Code Examples

### Voice Recording (AVFoundation)

```swift
import AVFoundation
import FirebaseStorage
import FirebaseFunctions

class VoiceRecorder: ObservableObject {
    private var audioRecorder: AVAudioRecorder?
    private var audioSession: AVAudioSession?
    
    @Published var isRecording = false
    @Published var recordingTime: TimeInterval = 0
    
    private var timer: Timer?
    
    func setupRecorder() throws {
        // Configure audio session
        audioSession = AVAudioSession.sharedInstance()
        try audioSession?.setCategory(.playAndRecord, mode: .default)
        try audioSession?.setActive(true)
        
        // Setup recorder
        let audioFilename = getDocumentsDirectory().appendingPathComponent("recording.m4a")
        
        let settings = [
            AVFormatIDKey: Int(kAudioFormatMPEG4AAC),
            AVSampleRateKey: 44100,
            AVNumberOfChannelsKey: 1,
            AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue
        ]
        
        audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
        audioRecorder?.prepareToRecord()
    }
    
    func startRecording() throws {
        try setupRecorder()
        audioRecorder?.record()
        isRecording = true
        
        // Start timer
        timer = Timer.scheduledTimer(withTimeInterval: 0.1, repeats: true) { [weak self] _ in
            self?.recordingTime = self?.audioRecorder?.currentTime ?? 0
        }
    }
    
    func stopRecording() -> URL? {
        audioRecorder?.stop()
        isRecording = false
        timer?.invalidate()
        timer = nil
        
        return audioRecorder?.url
    }
    
    private func getDocumentsDirectory() -> URL {
        FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
    }
}

// SwiftUI View
struct VoiceMessageView: View {
    @StateObject private var recorder = VoiceRecorder()
    @State private var isTranscribing = false
    @State private var transcript = ""
    
    var body: some View {
        VStack {
            if recorder.isRecording {
                // Recording UI
                VStack {
                    Text("Recording...")
                        .font(.headline)
                    
                    Text(formatTime(recorder.recordingTime))
                        .font(.system(.title, design: .monospaced))
                    
                    // Waveform visualization (custom view)
                    WaveformView()
                        .frame(height: 50)
                    
                    HStack {
                        Button(action: {
                            recorder.stopRecording()
                        }) {
                            Image(systemName: "xmark.circle.fill")
                                .font(.largeTitle)
                                .foregroundColor(.red)
                        }
                        
                        Button(action: {
                            if let audioURL = recorder.stopRecording() {
                                transcribeAudio(audioURL)
                            }
                        }) {
                            Image(systemName: "checkmark.circle.fill")
                                .font(.largeTitle)
                                .foregroundColor(.green)
                        }
                    }
                }
            } else if isTranscribing {
                ProgressView("Transcribing...")
            } else if !transcript.isEmpty {
                // Show transcript with edit option
                TextEditor(text: $transcript)
                    .frame(height: 100)
                    .border(Color.gray)
                
                Button("Send Message") {
                    sendMessage(transcript)
                }
            } else {
                // Record button
                Button(action: {
                    try? recorder.startRecording()
                }) {
                    Image(systemName: "mic.fill")
                        .font(.largeTitle)
                }
                .buttonStyle(.borderedProminent)
            }
        }
    }
    
    func transcribeAudio(_ url: URL) {
        isTranscribing = true
        
        Task {
            do {
                // Upload to Firebase Storage
                let audioURL = try await uploadAudio(url)
                
                // Call Cloud Function
                let functions = Functions.functions()
                let transcribe = functions.httpsCallable("transcribeAudio")
                
                let result = try await transcribe.call(["audioUrl": audioURL])
                let data = result.data as! [String: Any]
                
                await MainActor.run {
                    transcript = data["transcript"] as! String
                    isTranscribing = false
                }
            } catch {
                print("Transcription error: \(error)")
                isTranscribing = false
            }
        }
    }
    
    func uploadAudio(_ localURL: URL) async throws -> String {
        let storage = Storage.storage()
        let userId = Auth.auth().currentUser!.uid
        let messageId = UUID().uuidString
        let storageRef = storage.reference().child("audio/recordings/\(userId)/\(messageId).m4a")
        
        _ = try await storageRef.putFileAsync(from: localURL)
        let downloadURL = try await storageRef.downloadURL()
        
        return downloadURL.absoluteString
    }
    
    func sendMessage(_ text: String) {
        // Send to Firestore (existing logic)
    }
    
    func formatTime(_ time: TimeInterval) -> String {
        let minutes = Int(time) / 60
        let seconds = Int(time) % 60
        return String(format: "%02d:%02d", minutes, seconds)
    }
}

Audio Playback (Text-to-Speech)

import AVFoundation
import FirebaseFunctions

class AudioPlayer: ObservableObject {
    private var audioPlayer: AVAudioPlayer?
    
    @Published var isPlaying = false
    @Published var currentTime: TimeInterval = 0
    @Published var duration: TimeInterval = 0
    @Published var playbackRate: Float = 1.0
    
    private var timer: Timer?
    
    func playTTS(for message: Message) async {
        do {
            // Check if audio already cached
            if let cachedURL = message.audioUrl {
                try await playFromURL(URL(string: cachedURL)!)
                return
            }
            
            // Generate TTS
            let functions = Functions.functions()
            let generateTTS = functions.httpsCallable("generateTTS")
            
            let result = try await generateTTS.call([
                "text": message.content,
                "voiceId": message.sender.elevenLabsVoiceId ?? "default",
                "messageId": message.id
            ])
            
            let data = result.data as! [String: Any]
            let audioURLString = data["audioUrl"] as! String
            
            try await playFromURL(URL(string: audioURLString)!)
            
        } catch {
            print("TTS playback error: \(error)")
        }
    }
    
    private func playFromURL(_ url: URL) async throws {
        // Download audio data
        let (data, _) = try await URLSession.shared.data(from: url)
        
        await MainActor.run {
            do {
                audioPlayer = try AVAudioPlayer(data: data)
                audioPlayer?.prepareToPlay()
                audioPlayer?.rate = playbackRate
                audioPlayer?.play()
                
                isPlaying = true
                duration = audioPlayer?.duration ?? 0
                
                startTimer()
            } catch {
                print("Audio player error: \(error)")
            }
        }
    }
    
    func pause() {
        audioPlayer?.pause()
        isPlaying = false
        stopTimer()
    }
    
    func play() {
        audioPlayer?.play()
        isPlaying = true
        startTimer()
    }
    
    func seek(to time: TimeInterval) {
        audioPlayer?.currentTime = time
        currentTime = time
    }
    
    func setPlaybackRate(_ rate: Float) {
        playbackRate = rate
        audioPlayer?.rate = rate
    }
    
    private func startTimer() {
        timer = Timer.scheduledTimer(withTimeInterval: 0.1, repeats: true) { [weak self] _ in
            self?.currentTime = self?.audioPlayer?.currentTime ?? 0
            
            if self?.currentTime ?? 0 >= self?.duration ?? 0 {
                self?.isPlaying = false
                self?.stopTimer()
            }
        }
    }
    
    private func stopTimer() {
        timer?.invalidate()
        timer = nil
    }
}

// SwiftUI View
struct MessageAudioPlayer: View {
    let message: Message
    @StateObject private var player = AudioPlayer()
    
    var body: some View {
        HStack {
            Button(action: {
                if player.isPlaying {
                    player.pause()
                } else if player.duration > 0 {
                    player.play()
                } else {
                    Task {
                        await player.playTTS(for: message)
                    }
                }
            }) {
                Image(systemName: player.isPlaying ? "pause.circle.fill" : "play.circle.fill")
                    .font(.title)
            }
            
            if player.duration > 0 {
                // Progress slider
                Slider(
                    value: Binding(
                        get: { player.currentTime },
                        set: { player.seek(to: $0) }
                    ),
                    in: 0...player.duration
                )
                
                Text("\(formatTime(player.currentTime)) / \(formatTime(player.duration))")
                    .font(.caption)
                    .monospacedDigit()
                
                // Playback speed
                Menu {
                    Button("1.0x") { player.setPlaybackRate(1.0) }
                    Button("1.25x") { player.setPlaybackRate(1.25) }
                    Button("1.5x") { player.setPlaybackRate(1.5) }
                    Button("2.0x") { player.setPlaybackRate(2.0) }
                } label: {
                    Text("\(String(format: "%.2f", player.playbackRate))x")
                        .font(.caption)
                }
            }
        }
    }
    
    func formatTime(_ time: TimeInterval) -> String {
        let minutes = Int(time) / 60
        let seconds = Int(time) % 60
        return String(format: "%d:%02d", minutes, seconds)
    }
}

Voice Cloning Setup

import FirebaseFunctions
import FirebaseAuth

class VoiceCloner: ObservableObject {
    @Published var isRecording = false
    @Published var isProcessing = false
    @Published var voiceId: String?
    @Published var previewAudioURL: URL?
    
    func recordVoiceSample() async throws -> URL {
        // Record 30-60 second sample
        let recorder = VoiceRecorder()
        try recorder.startRecording()
        
        // Wait for user to finish (or implement 60s auto-stop)
        // Return audio URL
        return recorder.stopRecording()!
    }
    
    func cloneVoice(sampleURL: URL) async throws {
        isProcessing = true
        
        do {
            // Upload sample to Firebase Storage
            let storage = Storage.storage()
            let userId = Auth.auth().currentUser!.uid
            let storageRef = storage.reference().child("audio/recordings/\(userId)/voice_sample.m4a")
            
            _ = try await storageRef.putFileAsync(from: sampleURL)
            let downloadURL = try await storageRef.downloadURL()
            
            // Call Cloud Function to clone voice
            let functions = Functions.functions()
            let clone = functions.httpsCallable("cloneVoice")
            
            let result = try await clone.call([
                "audioUrl": downloadURL.absoluteString,
                "userId": userId
            ])
            
            let data = result.data as! [String: Any]
            voiceId = data["voiceId"] as? String
            
            // Generate preview
            try await generatePreview()
            
            isProcessing = false
        } catch {
            isProcessing = false
            throw error
        }
    }
    
    func generatePreview() async throws {
        guard let voiceId = voiceId else { return }
        
        let functions = Functions.functions()
        let generateTTS = functions.httpsCallable("generateTTS")
        
        let result = try await generateTTS.call([
            "text": "Hi, this is a preview of my cloned voice. How does it sound?",
            "voiceId": voiceId,
            "messageId": "preview"
        ])
        
        let data = result.data as! [String: Any]
        let audioURLString = data["audioUrl"] as! String
        
        previewAudioURL = URL(string: audioURLString)
    }
}

// SwiftUI View
struct VoiceCloningView: View {
    @StateObject private var cloner = VoiceCloner()
    @State private var showingInstructions = true
    
    var body: some View {
        VStack(spacing: 20) {
            if showingInstructions {
                instructionsView
            } else if cloner.isRecording {
                recordingView
            } else if cloner.isProcessing {
                processingView
            } else if let previewURL = cloner.previewAudioURL {
                previewView(url: previewURL)
            }
        }
        .padding()
    }
    
    var instructionsView: some View {
        VStack(spacing: 15) {
            Image(systemName: "waveform")
                .font(.system(size: 60))
                .foregroundColor(.blue)
            
            Text("Set Up Your Voice")
                .font(.title.bold())
            
            Text("Record yourself reading the sample text below. Speak naturally and clearly in a quiet environment.")
                .multilineTextAlignment(.center)
                .foregroundColor(.secondary)
            
            ScrollView {
                Text(sampleText)
                    .padding()
                    .background(Color.gray.opacity(0.1))
                    .cornerRadius(10)
            }
            
            Button("Start Recording") {
                showingInstructions = false
                cloner.isRecording = true
                
                Task {
                    let sampleURL = try await cloner.recordVoiceSample()
                    try await cloner.cloneVoice(sampleURL: sampleURL)
                }
            }
            .buttonStyle(.borderedProminent)
        }
    }
    
    var recordingView: some View {
        VStack {
            Text("Recording...")
                .font(.title)
            
            // Show waveform, timer, etc.
            
            Button("Stop & Process") {
                // Stop recording and process
            }
        }
    }
    
    var processingView: some View {
        VStack {
            ProgressView()
                .scaleEffect(1.5)
            
            Text("Creating your voice...")
                .padding()
            
            Text("This takes about 30 seconds")
                .font(.caption)
                .foregroundColor(.secondary)
        }
    }
    
    func previewView(url: URL) -> some View {
        VStack(spacing: 20) {
            Image(systemName: "checkmark.circle.fill")
                .font(.system(size: 60))
                .foregroundColor(.green)
            
            Text("Voice Created!")
                .font(.title.bold())
            
            Text("Listen to a preview:")
                .foregroundColor(.secondary)
            
            // Audio player for preview
            Button(action: {
                // Play preview
            }) {
                Label("Play Preview", systemImage: "play.circle.fill")
            }
            .buttonStyle(.borderedProminent)
            
            HStack {
                Button("Re-record") {
                    // Start over
                    showingInstructions = true
                    cloner.voiceId = nil
                    cloner.previewAudioURL = nil
                }
                .buttonStyle(.bordered)
                
                Button("Looks Good!") {
                    // Save and continue
                }
                .buttonStyle(.borderedProminent)
            }
        }
    }
    
    let sampleText = """
    Hi, I'm setting up my voice for MessageAI. This voice will be used to read my messages aloud to my family. I can speak naturally, just like I'm talking to a friend. The quick brown fox jumps over the lazy dog. Thanks for setting this up with me!
    """
}

5. Not Included in MVP (Future Roadmap)

Explicitly Out of Scope for First Release:

1. Android Support

❌ iOS only for MVP
❌ Android version requires separate Kotlin codebase
❌ Will add in v2 if iOS proves successful

Rationale: Focus resources on single platform, iterate faster, prove concept before expanding.

2. Advanced Voice Features

❌ Real-time voice translation (too complex for MVP)
❌ Emotion detection in voice (interesting but not essential)
❌ Voice effects/filters (gimmicky for parent persona)
❌ Multi-language voice cloning (stick to English for MVP)
❌ Voice authentication/verification (security feature for later)

3. Offline Voice Processing

❌ On-device STT (Whisper requires cloud)
❌ On-device TTS (ElevenLabs requires cloud)
❌ Offline voice cloning (impossible without cloud)

Rationale: All voice AI requires cloud processing. Focus on aggressive caching instead.

4. Advanced Audio Features

❌ Audio effects (reverb, equalization, noise cancellation beyond iOS defaults)
❌ Background music for summaries
❌ Custom audio intros/outros
❌ Audio annotations/bookmarks

Rationale: Nice-to-haves that don't solve core problems.

5. Multi-Voice Conversation Playback (Phase 3A)

❌ Defer to post-MVP (Phase 3B Summary Audio is MVP)
❌ Audio stitching complexity too high for timeline
❌ Cost per use too expensive ($0.75 vs $0.53 for summary)

Rationale: Summary audio is 60% cheaper, 3x faster, and more practical for daily use. Can add full playback as premium feature later.

6. Social/Sharing Features

❌ Share voice clones with other users
❌ Public voice message library
❌ Voice message reactions/comments beyond text
❌ Voice note playlists

Rationale: Focus on utility, not social features.

7. Analytics/Insights

❌ Voice usage statistics dashboard
❌ Most active speakers in group
❌ Conversation insights (sentiment analysis, etc.)
❌ Voice quality metrics

Rationale: Defer until users demand it.

8. Premium/Monetization Features

❌ Unlimited voice cloning (limit to 1 voice in MVP)
❌ Premium voices (celebrity, character voices)
❌ Advanced summarization options (custom styles, lengths)
❌ Priority processing (faster TTS/STT)

Rationale: Prove value first, monetize later.

9. Integration/Ecosystem

❌ Export audio to podcast apps
❌ Integration with Apple Calendar (auto-add extracted events)
❌ Siri Shortcuts integration
❌ HomeKit/Smart home integration (play on HomePod)
❌ Apple Watch companion app

Rationale: MVP is self-contained. Integrations are v2+.

10. Collaboration Features

❌ Shared voice profiles (family voice)
❌ Voice message threading/replies
❌ Collaborative audio editing
❌ Voice note folders/organization

Rationale: Core messaging + voice is enough complexity for MVP.

6. Technical Risks & Mitigation

Risk 1: Voice Cloning Quality

Risk: Users dissatisfied with cloned voice quality (doesn't sound like them).

Likelihood: Medium
Impact: High (core feature)

Mitigation:

Provide clear sample text that captures voice range
Allow preview before confirming
Allow re-recording unlimited times
Set expectations: "AI voice, ~80% similarity"
Fallback: Use high-quality default voice if user unhappy
Collect feedback early (beta testing critical)

Risk 2: API Costs Spiral Out of Control

Risk: ElevenLabs/OpenAI costs exceed budget as usage grows.

Likelihood: High
Impact: Critical (could kill project)

Mitigation:

Aggressive caching: Never regenerate same audio twice
Rate limiting: Max 50 voice generations per user per day
Smart defaults: Only generate TTS on explicit user request (not auto)
Cost monitoring: Alert if daily spend >$10
Fallback: Disable voice features if monthly budget exceeded
User quotas: Free tier = 100 voice messages/month, paid = unlimited
Batch processing: Generate multiple TTS requests in single API call where possible
Use cheaper models: GPT-4o-mini instead of GPT-4 for summaries

Budget Planning:

Monthly Budget: $200
- ElevenLabs: $100 (335k characters = ~6,700 messages)
- OpenAI Whisper: $50 (~8,300 transcriptions)
- OpenAI GPT-4o-mini: $50 (~6,250 summaries)

Per-user cost: ~$0.50/month (assumes 30 voice messages + 2 summaries)
Breakeven: 400 active users at $0.50/user

Risk 3: Audio Processing Latency

Risk: TTS generation takes >5 seconds, feels slow and broken.

Likelihood: Medium
Impact: High (UX suffers)

Mitigation:

Optimistic UI: Show "Generating audio..." with progress indicator
Background generation: Pre-generate audio for recent messages on app open
Streaming: Explore ElevenLabs streaming TTS (plays while generating)
Caching strategy: Cache aggressively on device
Fallback: If generation >10 seconds, offer "Listen later" option
Server-side optimization: Use edge functions for faster API calls

Risk 4: Poor Transcription Accuracy

Risk: Whisper transcribes incorrectly, creates embarrassing/confusing messages.

Likelihood: Medium
Impact: Medium (user can edit, but annoying)

Mitigation:

Always show transcript before sending (user can review/edit)
Confidence scoring: Flag low-confidence transcriptions for review
Noise detection: Warn if background noise detected
Retry mechanism: "Transcription unclear, try recording again?"
User education: "Speak clearly, minimize background noise"
Quality settings: Allow user to choose "High quality" (slower) vs "Fast" mode

Risk 5: Voice Cloning Misuse/Abuse

Risk: Users clone others' voices without consent, use for impersonation.

Likelihood: Low (within family app)
Impact: High (legal/ethical issues)

Mitigation:

Terms of Service: Explicitly prohibit cloning others without consent
One voice per account: Users can only clone their own voice
No voice sharing: Can't export/share voice_id to other users
Watermarking: ElevenLabs adds inaudible watermark to detect synthetic audio
Reporting mechanism: Users can flag suspicious voice messages
Age verification: Require 18+ for voice cloning (protect minors)

Risk 6: Storage Costs for Audio Files

Risk: Audio files accumulate, storage costs balloon.

Likelihood: High
Impact: Medium (manageable with cleanup)

Mitigation:

Auto-deletion: Delete audio files after 30 days (configurable)
Compression: Use efficient codecs (AAC, Opus)
Selective storage: Only store audio for important messages (starred/pinned)
User control: "Delete all audio older than X days" option
Cloud optimization: Use Cloudflare R2 (cheapest storage, free egress)
Lifecycle policies: Automatically move old files to cold storage

Cost Projection:

Average voice message: 30 seconds = ~500KB (compressed)
1,000 users × 10 messages/day × 30 days = 300,000 messages
300,000 × 500KB = 150GB
Storage cost: 150GB × $0.015/GB = $2.25/month (Cloudflare R2)

Risk 7: Real-Time Performance Degradation

Risk: WebSocket connection struggles with audio uploads/downloads.

Likelihood: Medium
Impact: Medium (affects core messaging)

Mitigation:

Separate channels: Use HTTP for audio upload/download, WebSocket only for signaling
Chunked uploads: Upload large files in chunks
Progress indicators: Show upload/download progress clearly
Retry logic: Auto-retry failed uploads with exponential backoff
Compression: Compress audio before upload
CDN delivery: Serve audio from CDN, not origin server

Risk 8: Platform Restrictions (iOS/Android)

Risk: Apple/Google reject app for voice features or restrict functionality.

Likelihood: Low
Impact: Critical (could block launch)

Mitigation:

Review guidelines compliance: Study App Store/Play Store voice policies
Privacy disclosures: Clearly state audio recording/processing in privacy policy
Permissions: Request microphone permissions with clear explanation
Data handling: Comply with data retention/deletion requirements
No restricted content: Voice features can't be used for prohibited content
Age gating: Require 13+ (COPPA compliance)

Specific Concerns:

iOS: Apple requires clear disclosure of cloud voice processing
Android: Google requires runtime microphone permission with rationale
Both: Must allow users to delete all voice data

Risk 9: Accessibility Requirements

Risk: Voice features create barriers for deaf/hard-of-hearing users.

Likelihood: Medium
Impact: Medium (ethical + legal concerns)

Mitigation:

Always show transcript: Every voice message has text alternative
Visual indicators: Waveform, captions, speaker labels
Screen reader support: Full VoiceOver/TalkBack compatibility
Subtitles: Auto-generate captions for all audio playback
Haptic feedback: Vibration for audio-related actions
Settings: "Prefer text" mode disables auto-play, shows text by default

Risk 10: Network Failures During Critical Moments

Risk: User tries to send urgent voice message, upload fails due to poor network.

Likelihood: High
Impact: High (defeats purpose of hands-free messaging)

Mitigation:

Offline queueing: Store audio locally, auto-upload when connected
Retry logic: Exponential backoff with max 5 retries
Fallback: If transcription fails, send audio-only message
Network detection: Warn user if on weak connection
Local transcription attempt: Try on-device speech recognition as backup
Status indicators: Clear "Queued", "Uploading", "Sent" states

7. Success Criteria & KPIs

Launch Goals (First 30 Days)

Adoption Metrics:

60%+ of users try voice-to-text within first week
40%+ of users complete voice cloning setup
30%+ of users listen to at least one message via TTS
20%+ of users try conversation summary feature

Engagement Metrics:

25%+ of messages sent via voice-to-text
15%+ of messages played via TTS
5+ conversation summaries generated per active user
70%+ retention after first voice message sent

Quality Metrics:

<5% transcription error rate (user-reported)
>80% voice cloning satisfaction (survey)
<3 second average TTS generation time
>90% audio playback success rate (no errors)

Technical Metrics:

<2% API failure rate (ElevenLabs, OpenAI)
<1% audio upload failures
<5 second average message send latency (including voice)
Zero security incidents (voice data leaks)

Cost Metrics:

Stay under $200/month API budget
<$0.50 per active user per month
<10% over-budget on any single API

Rubric Score:

100+ points total (A+ with bonuses)
10/10 on Advanced AI Capability
+3 Innovation bonus
+3 Polish bonus

8. Development Timeline

Phase 1: Voice-to-Text (Weeks 1-2)

Week 1:

Day 1-2: Audio recording UI/UX (React Native expo-av)
Day 3-4: Whisper API integration
Day 5: Transcription display & editing
Day 6-7: Message storage (database schema, file upload)

Week 2:

Day 1-3: End-to-end testing (record → transcribe → send)
Day 4-5: Error handling & edge cases
Day 6-7: Polish, optimizations, offline support

Deliverable: Working voice-to-text messaging

Phase 2: Text-to-Voice (Weeks 3-4)

Week 3:

Day 1-2: Voice cloning UI flow
Day 3-4: ElevenLabs Voice Design API integration
Day 5-7: User profile voice storage & preview

Week 4:

Day 1-3: TTS generation on demand
Day 4-5: Audio player component (play, pause, speed, seek)
Day 6-7: Caching strategy & offline playback

Deliverable: Every message playable in sender's voice

Phase 3: Conversation Audio (Weeks 5-6)

Week 5 (Summary Path - RECOMMENDED):

Day 1-2: GPT-4 summarization logic
Day 3-4: Summary UI/UX ("Hear Summary" button)
Day 5-7: TTS generation for summaries, storage

Week 6:

Day 1-3: Smart triggering (suggest summary for long threads)
Day 4-5: Polish and optimization
Day 6-7: End-to-end testing, demo prep

Deliverable: AI audio summaries for group chats

Week 7: Integration, Testing, Demo

Integration Week:

Day 1-2: Integrate all 3 phases into main app
Day 3-4: Cross-feature testing (voice-to-text + TTS + summary)
Day 5: Performance testing, cost analysis

Demo Prep:

Day 6: Record demo video (5-7 minutes)
Day 7: Write Persona Brainlift doc, create social post

Buffer: Week 8 for polish, bug fixes, documentation

9. Testing Strategy

Unit Testing

Voice-to-Text:

Audio recording starts/stops correctly
Transcription API called with correct parameters
Transcript displayed and editable
Message saves with audio URL and transcript

Text-to-Voice:

Voice cloning API called correctly
Voice_id stored in database
TTS generation uses correct voice_id
Audio caching works (second play is instant)

Conversation Audio:

Summary generation captures key points
Audio generates for summaries
Trigger logic works (20+ messages)

Integration Testing

Record voice → transcribe → send → other user receives both audio & text
Clone voice → send text message → other user plays in sender's voice
Long group thread → generate summary → play audio → verify accuracy
Offline: record voice → go online → transcription completes

User Acceptance Testing (Beta)

Recruit 10 beta testers (actual busy parents):

5 working parents with kids in activities
2 caregivers for elderly parents
2 single parents with shared custody
1 parent with vision impairment

Testing Scenarios:

Driving scenario: Send voice message while simulating driving
Cooking scenario: Listen to messages while hands are messy
Group chat catchup: 50+ message thread, test summary
Voice cloning: Complete setup, verify satisfaction with voice quality
Multi-day usage: Use app for 7 days, report friction points

Feedback Collection:

Daily diary (what worked, what didn't)
Exit survey (NPS, feature ratings)
Usage analytics (which features used most)

Performance Testing

Transcription latency: Average <2 seconds for 30-second audio
TTS generation: Average <3 seconds for 50-character message
Summary generation: <10 seconds for 50-message thread
Audio playback: No stuttering/buffering on 4G connection
App launch: <2 seconds cold start with voice features enabled

Cost Testing

Simulate usage patterns:

100 users × 10 voice messages/day × 30 days = 30,000 transcriptions
Cost: 30,000 × (30 seconds average) ÷ 60 = 15,000 minutes
15,000 minutes × $0.006 = $90 (within budget ✅)
100 users × 30 messages/day × 30 days × 50% play rate = 45,000 TTS generations
45,000 × 50 chars = 2.25M characters
2.25M × $0.30/1k = $675 (OVER BUDGET ❌)

Adjustment needed: Aggressive caching reduces to ~5,000 unique TTS generations = $75 ✅

10. Open Questions for Review

Technical Decisions to Make:

Q1: React Native or Native Development?

Recommendation: React Native (faster, cross-platform)
Your preference: _______________
Rationale: _______________

Q2: Phase 3 - Multi-Voice Playback OR Summary Audio?

Recommendation: Summary Audio (simpler, cheaper, more practical)
Your preference: _______________
Rationale: _______________

Q3: Voice Cloning Required or Optional?

Recommendation: Optional (fallback to default voice)
Your preference: _______________
Rationale: _______________

Q4: Auto-Play TTS or Explicit Play Button?

Recommendation: Explicit button (respect user's attention)
Your preference: _______________
Rationale: _______________

Q5: Audio File Retention Period?

Recommendation: 30 days (balance storage cost and usefulness)
Your preference: _______________
Rationale: _______________

Product Decisions to Make:

Q6: Free Tier Voice Limits?

Recommendation: 100 voice messages/month, unlimited after
Your preference: _______________
Rationale: _______________

Q7: Voice Cloning Age Requirement?

Recommendation: 18+ (protect minors from impersonation)
Your preference: _______________
Rationale: _______________

Q8: Default Voice Selection?

Recommendation: Let user choose from 5 ElevenLabs pre-made voices
Your preference: _______________
Rationale: _______________

Q9: Voice Message Max Length?

Recommendation: 2 minutes (balance cost and usefulness)
Your preference: _______________
Rationale: _______________

Q10: Conversation Summary Max Messages?

Recommendation: 200 messages max per summary (cost control)
Your preference: _______________
Rationale: _______________

11. Dependencies & Prerequisites

Required Before Starting Development:

Accounts & Access:

OpenAI API key (for Whisper + GPT)
ElevenLabs account (Starter plan minimum: $5/month)
Cloud storage account (Cloudflare R2 or AWS S3)
Database hosting (Railway, Supabase, or Render)

Development Environment:

Node.js 20+ installed
Expo CLI installed (npm install -g expo-cli)
iOS Simulator (Mac) or Android Emulator
Physical devices for testing (iOS + Android)

Backend Setup:

PostgreSQL database provisioned
File storage bucket created
Environment variables configured
HTTPS domain for API (required for mobile)

Third-Party Services:

Firebase Auth or Auth0 (existing)
WebSocket server running (existing)
Existing MessageAI backend functional

12. Next Steps After PRD Approval

Review & Feedback Session (30 minutes)
- Go through each section
- Make decisions on open questions
- Identify any missing requirements
Tech Stack Finalization (1 hour)
- Confirm React Native + Node.js
- Set up API accounts (OpenAI, ElevenLabs)
- Test API calls with sample audio
Database Schema Updates (2 hours)
- Write migration scripts
- Add voice-related columns
- Test schema changes
Sprint Planning (1 hour)
- Break down Phase 1 into tickets
- Assign story points
- Set up project board (GitHub Projects, Jira, etc.)
Kickoff Development (Week 1, Day 1)
- Start with audio recording UI
- Parallel: Backend Whisper integration
- Daily standups to track progress

13. Appendix

A. Sample Voice Cloning Script

"Hi, I'm [Your Name]. I'm setting up my voice for MessageAI so my family 
can hear my messages read aloud. This is my natural speaking voice. I'm 
excited to try this new feature. The quick brown fox jumps over the lazy 
dog. Thanks for listening!"

(30 seconds, covers vocal range)

B. Error Messages & User Communication

Transcription Failed:

"Couldn't transcribe audio. Please try again in a quieter environment."

Voice Cloning Failed:

"We couldn't process your voice sample. Please re-record in a quiet space and speak clearly for 30 seconds."

TTS Generation Failed:

"Audio unavailable. You can still read the message."

Summary Generation Failed:

"Couldn't generate summary right now. Try again in a moment."

Over Quota:

"You've reached your monthly voice limit (100 messages). Upgrade for unlimited."

C. Privacy & Data Handling

What We Store:

Voice recordings (for messages you send)
Voice cloning samples (to create your voice profile)
Generated audio files (for playback)
Transcripts (searchable text)

What We Don't Store:

Real-time audio streams
Voice data from other apps
Biometric voice fingerprints

User Rights:

Delete all voice data at any time
Export voice recordings
Revoke voice cloning (removes voice_id)
Opt-out of voice features entirely

Compliance:

GDPR: Right to erasure, data portability
CCPA: Right to know, delete
COPPA: No voice data from users <13

D. Competitive Analysis

Why MessageAI Voice Features Are Unique:

Feature	MessageAI	WhatsApp	Telegram	Discord
Voice-to-Text	✅ Whisper AI	❌	✅ Basic	❌
Voice Cloning	✅ Your voice	❌	❌	❌
Text-to-Voice	✅ Sender's voice	❌	❌	Basic TTS
AI Summaries	✅ GPT-4	❌	❌	❌
Multi-Voice Playback	✅ (Phase 3)	❌	❌	❌
Persona-Specific	✅ Busy parents	❌ General	❌ General	❌ Gamers

Our Moat: Voice cloning + AI summarization for family coordination.

14. Sign-Off

Product Manager: _________________ Date: _______

Engineering Lead: _________________ Date: _______

Design Lead: _________________ Date: _______

Document Status: 🟡 Draft - Awaiting Review

Next Review Date: _________________

Approved for Development: [ ] Yes [ ] No

Questions? Comments? Edits?

Areas that need more detail:

_______________________________________________
_______________________________________________
_______________________________________________

Concerns or blockers:

_______________________________________________
_______________________________________________
_______________________________________________

Additional features to consider:

_______________________________________________
_______________________________________________
_______________________________________________

FilesExpand file tree

voice_features_prd.md

Latest commit

History

voice_features_prd.md

File metadata and controls

Product Requirements Document (PRD)

Voice Features for MessageAI - Busy Parent/Caregiver Edition

1. Executive Summary

Vision

Problem Statement

Solution

Success Metrics

2. User Stories

Persona: Sarah - Working Parent with 2 Kids in Activities

User Stories:

Persona: Marcus - Single Dad with Shared Custody

User Stories:

Persona: Aisha - Caregiver for Aging Parents

User Stories:

3. Key Features Required for MVP

Phase 1: Voice-to-Text (Foundation) - MUST HAVE

Feature 1.1: Voice Message Recording

Feature 1.2: Transcription Display & Editing

Feature 1.3: Message Storage (Text + Audio)

Phase 2: Text-to-Voice (Voice Cloning) - MUST HAVE

Feature 2.1: Voice Profile Setup

Feature 2.2: Text-to-Speech Generation

Feature 2.3: Audio Playback Controls

Phase 3: Conversation Audio (Choose One) - SHOULD HAVE

Option A: Multi-Voice Conversation Playback

Option B: AI Summary Audio (RECOMMENDED FOR MVP)

4. Tech Stack

✅ CONFIRMED STACK

Frontend: Native iOS (Swift + SwiftUI)

Backend: Firebase (Serverless)

1. Firebase Authentication

2. Firebase Firestore (Database)

3. Firebase Storage

4. Firebase Cloud Functions (Serverless Backend)

Backend: Firebase Alternatives Considered ❌

AI Services

✅ Speech-to-Text: OpenAI Whisper API

✅ Text-to-Speech & Voice Cloning: ElevenLabs

✅ LLM for Summaries: OpenAI GPT-4o-mini

Database: Firebase Firestore

File Storage: Firebase Storage

Real-Time Communication: Firebase Realtime Database (for WebSocket)

Tech Stack Summary

Swift + Firebase Architecture

Audio Playback (Text-to-Speech)

Voice Cloning Setup

5. Not Included in MVP (Future Roadmap)

Explicitly Out of Scope for First Release:

1. Android Support

2. Advanced Voice Features

3. Offline Voice Processing

4. Advanced Audio Features

5. Multi-Voice Conversation Playback (Phase 3A)

6. Social/Sharing Features

7. Analytics/Insights

8. Premium/Monetization Features

9. Integration/Ecosystem

10. Collaboration Features

6. Technical Risks & Mitigation

Risk 1: Voice Cloning Quality

Risk 2: API Costs Spiral Out of Control

Risk 3: Audio Processing Latency

Risk 4: Poor Transcription Accuracy

Risk 5: Voice Cloning Misuse/Abuse

Risk 6: Storage Costs for Audio Files

Risk 7: Real-Time Performance Degradation

Risk 8: Platform Restrictions (iOS/Android)

Risk 9: Accessibility Requirements

Risk 10: Network Failures During Critical Moments

7. Success Criteria & KPIs

Launch Goals (First 30 Days)

8. Development Timeline

Phase 1: Voice-to-Text (Weeks 1-2)

Phase 2: Text-to-Voice (Weeks 3-4)