Who said it, who's in it — diarization & face recognition, privately
Knowing what was said is half the story. The other half is who — which speaker, which person on screen. These are the most privacy-sensitive features we build, so they're also the most carefully sandboxed. Here's how they work, and how we keep them on-device.
Biometric features are where “private by default” stops being a slogan and becomes an engineering constraint. Voice prints and face embeddings are some of the most sensitive data a person owns. So in MediaFind both are computed locally, the face library is strictly opt-in, and you can delete everything with a single command. Nothing about a face or a voice ever touches a server.
Part 1 — Speaker diarization: “who spoke when”
Diarization answers a question transcription doesn't: a transcript tells you the words, but not that the first paragraph was Alice and the reply was Bob. The pipeline has three stages:
- Segmentation — split the audio into short, single-speaker chunks at speech boundaries.
- Speaker embeddings — turn each chunk into a vector that captures vocal identity (timbre, pitch, cadence), not words.
- Clustering — group chunks with similar embeddings; each cluster becomes
SPEAKER_00,SPEAKER_01, and so on.
The embedding model is the heart of it. We use a neural speaker encoder (Resemblyzer-style, derived from the GE2E voice-verification line of work) that maps a few seconds of speech to a fixed-length d-vector. Two clips of the same person land close together regardless of what they're saying.
--num-speakers 2. The lesson: a real neural embedder is non-negotiable, and silent fallbacks should be loud. The encoder is now a core dependency, bundled with the app, and warns instead of degrading quietly.The speaker labels are written back onto the transcript segments, so they flow through to the rest of the product — you can read a meeting as a labeled dialogue, or filter search to “only when this speaker was talking.”
Part 2 — Faces: an opt-in, on-device people library
Face recognition is off until you turn it on (--faces). When enabled, it runs the standard recognition pipeline locally, built on the open InsightFace stack:
- Detect faces in sampled keyframes and find facial landmarks.
- Align each face to a canonical pose so lighting and angle matter less.
- Embed it into a face vector (an ArcFace-style model trained so the same identity clusters tightly and different identities sit far apart).
- Cluster the vectors so every appearance of the same person is grouped — no manual tagging.
The result drives the product's nicest interaction: click a face, jump to every moment that person appears across your whole library, each with a timestamp. It's the visual analogue of semantic search.
Auto-naming public figures, keylessly
Clusters start anonymous — “Person 3” isn't very useful. So MediaFind ships a bundled gallery of roughly a thousand notable public figures (sourced from Wikidata, with attribution), each represented as a reference face embedding. When one of your clusters matches a gallery embedding closely enough, it's auto-named and badged with a ⭐.
The important part: this is a local nearest-neighbor lookup against a file shipped inside the app. There's no facial-recognition API, no upload, no “search the web for this face.” Match or no match, your images never leave the device.
Why these features are the privacy litmus test
It would be easy — and cheap — to send a thumbnail to a cloud face API. We don't, on principle:
| Decision | What it protects |
|---|---|
Faces are opt-in (--faces) | No biometric processing unless you ask for it |
| All embeddings computed on-device | Voice prints & face vectors never transmitted |
| Bundled celebrity gallery | Auto-naming needs no network call |
mediafind people --forget | One command erases all stored face & people data |
And as with everything else, it's checkable — the core path opens zero external sockets:
$ mediafind audit
✓ core path opened 0 external sockets.
The throughline
Across all three posts the pattern is the same: take a best-in-class open model — Whisper for speech, CLIP for vision, neural encoders for voices and faces — and run it locally, so the magic of cloud media tools comes with none of the cloud's exposure. Your recordings, your transcripts, your faces, your searches: all yours, all on your Mac.
Private people-search, on your machine
Diarization comes free; the opt-in face library is part of Pro. Either way, nothing is uploaded.
Download for macOS