Who said it, who's in it — diarization & face recognition, privately

Biometric features are where “private by default” stops being a slogan and becomes an engineering constraint. Voice prints and face embeddings are some of the most sensitive data a person owns. So in MediaFind both are computed locally, the face library is strictly opt-in, and you can delete everything with a single command. Nothing about a face or a voice ever touches a server.

Part 1 — Speaker diarization: “who spoke when”

Diarization answers a question transcription doesn't: a transcript tells you the words, but not that the first paragraph was Alice and the reply was Bob. The pipeline has three stages:

Segmentation — split the audio into short, single-speaker chunks at speech boundaries.
Speaker embeddings — turn each chunk into a vector that captures vocal identity (timbre, pitch, cadence), not words.
Clustering — group chunks with similar embeddings; each cluster becomes SPEAKER_00, SPEAKER_01, and so on.

The embedding model is the heart of it. We use a neural speaker encoder (Resemblyzer-style, derived from the GE2E voice-verification line of work) that maps a few seconds of speech to a fixed-length d-vector. Two clips of the same person land close together regardless of what they're saying.

A bug worth remembering: when the real speaker encoder was missing, the system silently fell back to a weak hand-rolled MFCC embedder — and it collapsed everyone into a single speaker, even with --num-speakers 2. The lesson: a real neural embedder is non-negotiable, and silent fallbacks should be loud. The encoder is now a core dependency, bundled with the app, and warns instead of degrading quietly.

The speaker labels are written back onto the transcript segments, so they flow through to the rest of the product — you can read a meeting as a labeled dialogue, or filter search to “only when this speaker was talking.”

Part 2 — Faces: an opt-in, on-device people library

Face recognition is off until you turn it on (--faces). When enabled, it runs the standard recognition pipeline locally, built on the open InsightFace stack:

Detect faces in sampled keyframes and find facial landmarks.
Align each face to a canonical pose so lighting and angle matter less.
Embed it into a face vector (an ArcFace-style model trained so the same identity clusters tightly and different identities sit far apart).
Cluster the vectors so every appearance of the same person is grouped — no manual tagging.

The result drives the product's nicest interaction: click a face, jump to every moment that person appears across your whole library, each with a timestamp. It's the visual analogue of semantic search.

Auto-naming public figures, keylessly

Clusters start anonymous — “Person 3” isn't very useful. So MediaFind ships a bundled gallery of roughly a thousand notable public figures (sourced from Wikidata, with attribution), each represented as a reference face embedding. When one of your clusters matches a gallery embedding closely enough, it's auto-named and badged with a ⭐.

The important part: this is a local nearest-neighbor lookup against a file shipped inside the app. There's no facial-recognition API, no upload, no “search the web for this face.” Match or no match, your images never leave the device.

Why these features are the privacy litmus test

It would be easy — and cheap — to send a thumbnail to a cloud face API. We don't, on principle:

Decision	What it protects
Faces are opt-in (`--faces`)	No biometric processing unless you ask for it
All embeddings computed on-device	Voice prints & face vectors never transmitted
Bundled celebrity gallery	Auto-naming needs no network call
`mediafind people --forget`	One command erases all stored face & people data

And as with everything else, it's checkable — the core path opens zero external sockets:

$ mediafind audit
✓ core path opened 0 external sockets.

The throughline

Across all three posts the pattern is the same: take a best-in-class open model — Whisper for speech, CLIP for vision, neural encoders for voices and faces — and run it locally, so the magic of cloud media tools comes with none of the cloud's exposure. Your recordings, your transcripts, your faces, your searches: all yours, all on your Mac.

Private people-search, on your machine

Diarization comes free; the opt-in face library is part of Pro. Either way, nothing is uploaded.

Download for macOS

Keep reading

How MediaFind transcribes your media entirely on-device with Whisper · Transcription Search by meaning: embeddings, CLIP and a local vector index · Search