Skip to content

[Android] Kotlin API improvements for audio LLM models #19817

@kirklandsign

Description

@kirklandsign

🚀 The feature, motivation and pitch

ExecuTorch's Android API already has foundational audio support (LlmModule.prefillAudio, AsrModule, C++ MultimodalRunner), but several gaps remain for running decoder-only audio LLMs like Voxtral smoothly from Java/Kotlin.

Current State

LlmModule has prefillAudio(float[]/byte[], ...) and prefillRawAudio(byte[], ...) -- works but limited
AsrModule is encoder-decoder specific (Whisper-style) -- not applicable to decoder-only audio LLMs
Model type constants: only MODEL_TYPE_TEXT (1) and MODEL_TYPE_MULTIMODAL (2) -- no audio-specific semantics

Proposed Improvements

  1. Add ByteBuffer variants for audio prefill (zero-copy parity with image API)

Image prefill has prefillImages(ByteBuffer, ...) for zero-copy JNI, but audio prefill only accepts arrays. For long audio (Voxtral handles multi-minute clips), the extra JNI copy is wasteful.
kotlin
// Missing today:
fun prefillAudio(audio: ByteBuffer, batchSize: Int, nBins: Int, nFrames: Int)
fun prefillRawAudio(audio: ByteBuffer, batchSize: Int, nChannels: Int, nSamples: Int)
2. Add WAV file path API to LlmModule

AsrModule can accept WAV paths directly (internally uses load_wav_audio_data()), but LlmModule requires the caller to manually decode audio in Java. For audio LLMs this should be as easy as:
kotlin
llmModule.prefillAudioFromFile("/path/to/audio.wav")
3. Add MODEL_TYPE_TEXT_AUDIO constant

Currently MODEL_TYPE_MULTIMODAL (aliased as MODEL_TYPE_TEXT_VISION) has vision-centric naming. Adding an explicit audio model type improves discoverability and documentation:
kotlin
const val MODEL_TYPE_TEXT_AUDIO = 3
4. Better raw audio type support

prefillRawAudio(byte[], ...) is awkward -- real PCM audio is typically short[] (16-bit) or float[] (32-bit). Add typed variants:
kotlin
fun prefillRawAudio(audio: ShortArray, batchSize: Int, nChannels: Int, nSamples: Int) // PCM-16
fun prefillRawAudio(audio: FloatArray, batchSize: Int, nChannels: Int, nSamples: Int) // float32
5. Audio-specific configuration in LlmModuleConfig

Add fields for audio preprocessing parameters:
kotlin
data class LlmModuleConfig(
// ... existing fields ...
val sampleRate: Int = 16000,
val preprocessorPath: String? = null, // optional .pte for mel spectrogram extraction
)
6. Unified multimodal generation entry point

Currently audio prefill and text generation are separate calls with no way to combine them in a single config. Consider a builder pattern:
kotlin
llmModule.generate {
audio("/path/to/audio.wav")
prompt("Transcribe the above audio:")
maxSeqLen(512)
onToken { token -> /* stream */ }
}

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions