Search by meaning: embeddings, CLIP and a local vector index

Keyword search has a fatal flaw for media: you have to remember the words. But you rarely remember the transcript verbatim — you remember the idea. “The bit where she explained why the deal fell through.” “The shot of the city skyline at night.” None of those are strings you can grep for.

MediaFind solves this by searching in meaning space. Everything in your library — spoken words and on-screen visuals alike — gets turned into a vector, and so does your query. Similar meanings land close together, so the right clip surfaces even when the wording is completely different.

Embeddings, briefly

An embedding is a list of numbers (a vector) that captures the meaning of a piece of content. A good text model maps “we have liftoff” and “a rocket blasting off” to nearby vectors, because they mean nearly the same thing — while “liftoff” the word and “elevator lift” land far apart despite sharing letters.

“Close” is measured with cosine similarity — the angle between two vectors. Search becomes geometry: embed the query, find the nearest segment vectors, rank by similarity.

score = (q · d) / (‖q‖ · ‖d‖)   # cosine similarity, 1.0 = identical meaning

The text side: sentence embeddings over transcript segments

Recall from the transcription post that every file becomes a list of timestamped segments. We embed each segment with a compact Sentence-Transformers model that produces 384-dimensional vectors — small enough to run fast on a CPU, strong enough to capture sentence-level meaning.

A hard-won lesson: the query and the index must use the same embedding model and dimensionality. We once shipped a build where a fallback encoder produced 512-d query vectors against a 384-d index — every search silently returned nothing. Dimension mismatch doesn't error; it just quietly fails. Now the encoder is pinned and verified at startup.

Because embeddings are attached to segments, a hit doesn't just tell you which file — it points at the exact timestamp, the same granularity the player and clip exporter use.

The visual side: CLIP for searching what's on screen

Audio is only half your media. For the picture, MediaFind uses CLIP (Contrastive Language–Image Pre-training). CLIP's trick is that it embeds images and text into the same shared space, having been trained to pull matching image–caption pairs together and push mismatches apart.

The practical payoff: you can search images with words. We sample keyframes from each video, embed them with CLIP's image encoder, and at query time embed your text with CLIP's text encoder. A search for “city skyline at night” compares your text vector directly against keyframe vectors — no tags, no manual labels.

Keyframe sampling — frames are extracted on scene changes and intervals so we cover the footage without embedding every single frame.
Open-vocabulary — CLIP isn't limited to a fixed label set, so “a person holding a coffee cup” works as well as “dog.”
Reused for more — the same keyframe embeddings power zero-shot category tagging and brand-logo detection.

OCR: the text inside the picture

Slides, lower-thirds, whiteboards, captions, license plates — a lot of meaning lives as text within the video. We run OCR over keyframes and index the recognized strings, so searching “Q3 revenue” can match a slide title that was never spoken aloud.

One query, three signals, ranked together

A single search fans out across modalities and the results are merged into one ranked list:

Modality	What it matches
Semantic transcript	What was said, by meaning
CLIP visual	What's on screen, by description
OCR	Text shown in the frame

We apply a relevance floor so weak, barely-related matches don't pad the results — an empty, honest result set beats ten irrelevant ones. When semantic recall comes up short, search also falls back to keyword matching so a literal phrase you do remember never returns nothing.

From search to Ask

The same vector index powers Ask: a retrieval-augmented question-answer flow. Your question is embedded, the most relevant segments are retrieved, and the answer is grounded in — and cites — those source segments with their timestamps. It's the difference between “here are some clips” and “here's the answer, and here's exactly where it came from.”

Local index, local privacy

Every vector — text, visual, OCR — is computed and stored on your Mac. There's no vector database in the cloud, no embeddings API, no per-query cost. The first run downloads the model weights once; after that, both indexing and searching are fully offline. Your library stays yours.

Searching what happened is powerful. Next we add who: how MediaFind labels speakers and, opt-in, recognizes faces — without ever phoning home.

Try meaning-based search on your own files

Point it at a folder and describe what you remember. Free for up to 10 files.

Download for macOS

Keep reading

How MediaFind transcribes your media entirely on-device with Whisper · Transcription Who said it, who's in it — diarization & face recognition, privately · People & privacy