
On-device speech-to-text: extracts clean 16kHz audio, downloads and runs local models, returns transcripts with timing, language and confidence — no server or audio upload.
On-device speech-to-text for iOS and Android, from one Kotlin core.
Docs and quickstart: eknuth.github.io/earshot
Hand Earshot an audio or video file and it gives you back the transcript, running entirely on the phone. No server, no per-user cost, no audio leaving the device. The only network call in the whole pipeline is the one-time model download on first run.
Earshot is the transcription engine extracted out of a working app and cleaned up into a reusable Kotlin Multiplatform library with a small, honest API.
I did not train a speech model. Whisper is OpenAI's. On iOS the Whisper runtime is WhisperKit by Argmax. On Android the model is a Microsoft Olive export of Whisper running on ONNX Runtime.
Earshot is the part around the model: getting it onto the device, extracting clean 16kHz audio from whatever file you started with, running the model inside a phone's memory and battery budget, and exposing one API that behaves the same on both platforms. That integration layer is the hard, unglamorous part of shipping a model onto someone's phone, and it is the part this library is about.
Same model family, two on-device runtimes, one shared Kotlin core orchestrating both.
| Stage | iOS | Android |
|---|---|---|
| Audio extraction | AVFoundation (AVAssetReader) |
MediaExtractor + MediaCodec
|
| ASR runtime | WhisperKit (CoreML) | ONNX Runtime + Extensions |
| Model | Whisper (CoreML, fetched + cached by WhisperKit) | Whisper (Olive ONNX, int8) |
| Model download | WhisperKit, internal |
ModelDownloader (plain HTTPS) |
The cross-platform contract lives in commonMain as expect classes
(AudioExtractor, TranscriptionEngine, ModelDownloader) with platform actual
implementations. OnDeviceTranscriber is a thin facade that wires the extractor and
engine together.
// build.gradle.kts
dependencies {
implementation("dev.eknuth:earshot:0.3.0")
}In Xcode: File → Add Package Dependencies, enter https://github.com/eknuth/earshot,
and pick the latest version. Or add it to a Package.swift:
dependencies: [
.package(url: "https://github.com/eknuth/earshot", from: "0.3.0")
]This vends the shared Earshot framework, which is the cross-platform API. The iOS ASR
runtime itself is WhisperKit, so also add
WhisperKit and register a provider at launch,
as shown under iOS below.
Maintainers: the release process (Maven Central + the SPM XCFramework) is in PUBLISHING.md.
class OnDeviceTranscriber(engine: TranscriptionEngine, audioExtractor: AudioExtractor) {
suspend fun prepare(config: TranscriptionConfig = TranscriptionConfig()): Boolean
fun isReady(): Boolean
fun modelStatus(): ModelStatus
suspend fun transcribeAudio(wavPath: String): TranscriptionEngineResult
suspend fun transcribeMedia(mediaPath: String, scratchWavPath: String): TranscriptionEngineResult
fun release()
}TranscriptionEngineResult is a sealed Success(text, language, confidence, processingTimeMs)
or Error(message, cause).
val modelsDir = File(context.filesDir, "models")
val downloader = ModelDownloader(modelsDir)
downloader.downloadModelSync(WhisperModels.WHISPER_TINY_EN)
val engine = TranscriptionEngine().apply {
setModelPath(File(modelsDir, WhisperModels.WHISPER_TINY_EN.name).absolutePath)
}
val transcriber = OnDeviceTranscriber(engine, AudioExtractor())
transcriber.prepare()
when (val r = transcriber.transcribeMedia(videoPath, "${context.cacheDir}/clip.wav")) {
is TranscriptionEngineResult.Success -> println("${r.text} (${r.processingTimeMs}ms)")
is TranscriptionEngineResult.Error -> println("failed: ${r.message}")
}Register the WhisperKit provider once at launch, then use the same shared API. The
reference Swift glue is in ios-support/WhisperKitTranscriptionProvider.swift;
add WhisperKit via Swift Package Manager.
// at launch
NativeTranscriptionProviderHolder.shared.implementation = WhisperKitTranscriptionProvider()
// anywhere after
let transcriber = OnDeviceTranscriber(engine: TranscriptionEngine(),
audioExtractor: AudioExtractor())
_ = try await transcriber.prepare()
let result = try await transcriber.transcribeAudio(wavPath: wavPath)The point of on-device is that you can prove it works where it runs, so this library exists to be measured, and the numbers come from real hardware. Scored on 25 LibriSpeech clips, Whisper tiny.en lands at 8.38% word error on an iPad's Neural Engine (WhisperKit / CoreML) and 8.98% on a Pixel 9a (ONNX Runtime). On the iPad it transcribes at about 0.02x real time, roughly a minute of speech a second.
For a model that shares one runtime across both platforms, Earshot also runs NVIDIA Parakeet-TDT-0.6b-v3 (600M params) through sherpa-onnx. On the same Pixel 9a it cuts word error to 2.59%, more than a 3x improvement over Whisper tiny.en, and decodes faster per clip because a transducer runs in one pass rather than Whisper's beam search. The cost is memory: about 1.17GB peak versus 241MB. That trade, far more accurate against far heavier, is the kind of thing this harness exists to measure on hardware you own, not on a datacenter GPU.
Earshot also runs NVIDIA Nemotron-Speech-Streaming-En-0.6b, the streaming sibling of Parakeet. It is a cache-aware streaming model (the benchmarked export uses a 1120ms chunk), so it runs through sherpa's online recognizer rather than the offline one. On the iPad it scores 13.77% word error at about 0.11x real time with a 720MB peak. Streaming buys incremental, low-latency output as audio arrives, and the accuracy gap against the offline Parakeet is the price of decoding in chunks rather than seeing the whole clip at once.
Word error rate is scored offline by one algorithm over identical references, so the
runtimes are comparable by construction. The harness, per-clip results, speed and memory
are in benchmark/ and rendered at
the benchmarks page; see
benchmark/README.md to reproduce on your own device.
See MODELS.md for each model, where it comes from, and its license.
sample-android/ is a minimal Compose app: pick a file, transcribe
on-device, see the text and timing. Build it with ./gradlew :sample-android:assembleDebug.ios-sample/ is the SwiftUI equivalent, wired with XcodeGen and
WhisperKit. See its README to build and run.Transcription is real and working on both platforms today. Audio extraction, the Whisper runtimes, and model download all run on-device.
MIT. See LICENSE. The bundled integration code is MIT; the speech models it loads carry their own licenses (see MODELS.md).
On-device speech-to-text for iOS and Android, from one Kotlin core.
Docs and quickstart: eknuth.github.io/earshot
Hand Earshot an audio or video file and it gives you back the transcript, running entirely on the phone. No server, no per-user cost, no audio leaving the device. The only network call in the whole pipeline is the one-time model download on first run.
Earshot is the transcription engine extracted out of a working app and cleaned up into a reusable Kotlin Multiplatform library with a small, honest API.
I did not train a speech model. Whisper is OpenAI's. On iOS the Whisper runtime is WhisperKit by Argmax. On Android the model is a Microsoft Olive export of Whisper running on ONNX Runtime.
Earshot is the part around the model: getting it onto the device, extracting clean 16kHz audio from whatever file you started with, running the model inside a phone's memory and battery budget, and exposing one API that behaves the same on both platforms. That integration layer is the hard, unglamorous part of shipping a model onto someone's phone, and it is the part this library is about.
Same model family, two on-device runtimes, one shared Kotlin core orchestrating both.
| Stage | iOS | Android |
|---|---|---|
| Audio extraction | AVFoundation (AVAssetReader) |
MediaExtractor + MediaCodec
|
| ASR runtime | WhisperKit (CoreML) | ONNX Runtime + Extensions |
| Model | Whisper (CoreML, fetched + cached by WhisperKit) | Whisper (Olive ONNX, int8) |
| Model download | WhisperKit, internal |
ModelDownloader (plain HTTPS) |
The cross-platform contract lives in commonMain as expect classes
(AudioExtractor, TranscriptionEngine, ModelDownloader) with platform actual
implementations. OnDeviceTranscriber is a thin facade that wires the extractor and
engine together.
// build.gradle.kts
dependencies {
implementation("dev.eknuth:earshot:0.3.0")
}In Xcode: File → Add Package Dependencies, enter https://github.com/eknuth/earshot,
and pick the latest version. Or add it to a Package.swift:
dependencies: [
.package(url: "https://github.com/eknuth/earshot", from: "0.3.0")
]This vends the shared Earshot framework, which is the cross-platform API. The iOS ASR
runtime itself is WhisperKit, so also add
WhisperKit and register a provider at launch,
as shown under iOS below.
Maintainers: the release process (Maven Central + the SPM XCFramework) is in PUBLISHING.md.
class OnDeviceTranscriber(engine: TranscriptionEngine, audioExtractor: AudioExtractor) {
suspend fun prepare(config: TranscriptionConfig = TranscriptionConfig()): Boolean
fun isReady(): Boolean
fun modelStatus(): ModelStatus
suspend fun transcribeAudio(wavPath: String): TranscriptionEngineResult
suspend fun transcribeMedia(mediaPath: String, scratchWavPath: String): TranscriptionEngineResult
fun release()
}TranscriptionEngineResult is a sealed Success(text, language, confidence, processingTimeMs)
or Error(message, cause).
val modelsDir = File(context.filesDir, "models")
val downloader = ModelDownloader(modelsDir)
downloader.downloadModelSync(WhisperModels.WHISPER_TINY_EN)
val engine = TranscriptionEngine().apply {
setModelPath(File(modelsDir, WhisperModels.WHISPER_TINY_EN.name).absolutePath)
}
val transcriber = OnDeviceTranscriber(engine, AudioExtractor())
transcriber.prepare()
when (val r = transcriber.transcribeMedia(videoPath, "${context.cacheDir}/clip.wav")) {
is TranscriptionEngineResult.Success -> println("${r.text} (${r.processingTimeMs}ms)")
is TranscriptionEngineResult.Error -> println("failed: ${r.message}")
}Register the WhisperKit provider once at launch, then use the same shared API. The
reference Swift glue is in ios-support/WhisperKitTranscriptionProvider.swift;
add WhisperKit via Swift Package Manager.
// at launch
NativeTranscriptionProviderHolder.shared.implementation = WhisperKitTranscriptionProvider()
// anywhere after
let transcriber = OnDeviceTranscriber(engine: TranscriptionEngine(),
audioExtractor: AudioExtractor())
_ = try await transcriber.prepare()
let result = try await transcriber.transcribeAudio(wavPath: wavPath)The point of on-device is that you can prove it works where it runs, so this library exists to be measured, and the numbers come from real hardware. Scored on 25 LibriSpeech clips, Whisper tiny.en lands at 8.38% word error on an iPad's Neural Engine (WhisperKit / CoreML) and 8.98% on a Pixel 9a (ONNX Runtime). On the iPad it transcribes at about 0.02x real time, roughly a minute of speech a second.
For a model that shares one runtime across both platforms, Earshot also runs NVIDIA Parakeet-TDT-0.6b-v3 (600M params) through sherpa-onnx. On the same Pixel 9a it cuts word error to 2.59%, more than a 3x improvement over Whisper tiny.en, and decodes faster per clip because a transducer runs in one pass rather than Whisper's beam search. The cost is memory: about 1.17GB peak versus 241MB. That trade, far more accurate against far heavier, is the kind of thing this harness exists to measure on hardware you own, not on a datacenter GPU.
Earshot also runs NVIDIA Nemotron-Speech-Streaming-En-0.6b, the streaming sibling of Parakeet. It is a cache-aware streaming model (the benchmarked export uses a 1120ms chunk), so it runs through sherpa's online recognizer rather than the offline one. On the iPad it scores 13.77% word error at about 0.11x real time with a 720MB peak. Streaming buys incremental, low-latency output as audio arrives, and the accuracy gap against the offline Parakeet is the price of decoding in chunks rather than seeing the whole clip at once.
Word error rate is scored offline by one algorithm over identical references, so the
runtimes are comparable by construction. The harness, per-clip results, speed and memory
are in benchmark/ and rendered at
the benchmarks page; see
benchmark/README.md to reproduce on your own device.
See MODELS.md for each model, where it comes from, and its license.
sample-android/ is a minimal Compose app: pick a file, transcribe
on-device, see the text and timing. Build it with ./gradlew :sample-android:assembleDebug.ios-sample/ is the SwiftUI equivalent, wired with XcodeGen and
WhisperKit. See its README to build and run.Transcription is real and working on both platforms today. Audio extraction, the Whisper runtimes, and model download all run on-device.
MIT. See LICENSE. The bundled integration code is MIT; the speech models it loads carry their own licenses (see MODELS.md).