How MediaFind transcribes your media entirely on-device with Whisper

Every other capability in MediaFind — semantic search, Ask, chapters, speaker labels — is downstream of one thing: a good transcript with accurate timestamps. If the transcript is wrong or the timing drifts, “jump to the exact moment” stops being magic and becomes a guessing game. So transcription is the part of the pipeline we sweat the most.

It's also the part most products quietly send to the cloud. MediaFind doesn't. The entire path — decode, resample, transcribe, align — runs on your Mac.

Step 1: Decode anything into clean 16 kHz audio

Your library is messy: .mp4, .mov, .mkv, .m4a, .wav, weird codecs, variable sample rates, multiple audio tracks. Whisper-class models want exactly one thing: mono PCM at 16 kHz. So the first step is normalization with ffmpeg, which we bundle with the macOS app so there's nothing to install.

ffmpeg -i input.mov \
  -vn \                # drop the video stream
  -ac 1 \             # downmix to mono
  -ar 16000 \         # resample to 16 kHz
  -f f32le -          # stream raw float32 to stdout

We stream the decoded audio rather than writing a giant temporary WAV to disk. For a two-hour recording that's the difference between a few megabytes flowing through a pipe and a multi-gigabyte temp file thrashing your SSD.

Why bundle ffmpeg? A search tool that only works if the user already ran brew install ffmpeg isn't a product. The macOS app ships a static, self-contained ffmpeg/ffprobe so decoding works out of the box on a clean machine.

Step 2: Transcribe with a Whisper-class model

Transcription uses an OpenAI Whisper-class model running locally. Whisper is an encoder–decoder transformer trained on a very large, very diverse set of audio — which is why it's robust to accents, background noise, music beds, and code-switching in a way that older ASR systems aren't.

The model family gives us a quality/speed dial:

Model	Relative speed	Best for
`tiny` / `base`	Fastest	Quick first pass, huge libraries
`small` / `medium`	Balanced	The everyday default
`large-v3`	Slowest	Maximum accuracy, hard audio

On Apple Silicon we lean on quantized, CPU/Neural-Engine-friendly implementations so a laptop can chew through a backlog without a discrete GPU. The work is embarrassingly parallel across files, so indexing a folder scales with your cores.

Step 3: Chunking and word-level timestamps

Whisper natively operates on 30-second windows. Naively cutting audio every 30 seconds slices words in half and produces garbage at the seams, so we segment on silence and energy boundaries instead — splitting where a person actually pauses. Each chunk carries its absolute offset in the file, so when we stitch results back together the global timeline stays correct.

The model emits segments, and we keep the word-level timestamps. That granularity is what powers the product's signature move:

You search a phrase → we rank the matching segment → the result links to the exact second it was spoken, and the clip exporter can cut precisely around those word boundaries.

Step 4: Persist segments for everything downstream

Each transcript is stored as a list of timestamped segments — roughly:

{
  "start": 134.20,
  "end": 138.75,
  "text": "three, two, one — and we have liftoff",
  "speaker": "SPEAKER_01"
}

This single structure feeds the rest of the system: semantic search embeds each segment, Ask retrieves and cites segments, chapters are derived by grouping them, and diarization annotates each one with a speaker label. Get the transcript right once and every other feature inherits the timing for free.

Why on-device is the whole point

Cloud transcription means uploading your raw recordings — depositions, therapy sessions, unreleased footage, family videos — to someone else's servers, usually metered per minute. MediaFind's core path opens zero external sockets: you can confirm it yourself.

$ mediafind audit
✓ core path opened 0 external sockets.

The only network activity is opt-in and predictable: a one-time download of the model weights on first run, after which transcription is fully offline. Your audio never leaves the machine — not the first time, not ever.

That timestamped transcript is the foundation. In the next post we cover what we build on top of it: turning text and pixels into vectors so you can search your library by meaning instead of keywords.

Make your own library searchable

Free for up to 10 files. No account, no API keys, nothing uploaded.

Download for macOS

Keep reading

Search by meaning: embeddings, CLIP and a local vector index · Search Who said it, who's in it — diarization & face recognition, privately · People & privacy