Technology
Behind the Scenes: How FrameQuery Indexes Your Videos
A look at how FrameQuery turns raw footage into an instantly searchable index, covering transcription, object detection, face recognition, and scene understanding.
When you drop a video into FrameQuery, the goal is simple: turn an opaque media file into something you can search with plain English.
Your originals never leave your machine
FrameQuery extracts frames and audio on your device and sends only those for analysis. Your originals never leave your machine, and the extracted data is discarded the moment analysis completes.
This keeps upload times tiny even for large camera originals. A 40 GB RED file produces a sampled set of keyframes and an audio stream that together carry every signal our models need, without shipping the source footage anywhere.
What happens during indexing
The extracted frames and audio flow through multiple AI models in parallel:
-
Transcription: Speech-to-text with word-level timestamps. When you search for a phrase, FrameQuery points you to the exact second it was spoken. It handles multiple speakers and can distinguish between them.
-
Object detection: Identifying what appears in the frame, from people and products to vehicles and text overlays.
-
Face detection and recognition: Detecting faces and clustering them across the video so you can search by person, regardless of camera angle or lighting. Recognition runs 100% locally on your device - biometric embeddings and person labels never leave your machine and are never included in shared indexes.
-
Scene descriptions: Generating natural-language summaries of what is happening, so you can search by describing a scene in your own words.
FrameQuery uses proprietary techniques to determine which parts of your footage need the most analysis and which can be handled more efficiently. This is how we keep indexing fast and costs low without sacrificing search quality.
The result: a local search index
All of this data (transcript segments, detected objects with timestamps, face clusters, scene descriptions) gets assembled into a compact, searchable index file.
The index is downloaded to your machine. From this point on, all search is local. No network requests, no API calls, no per-search costs. The index files are compact, typically a few megabytes per hour of video, so they do not meaningfully impact your storage.
Search that understands what you mean
The index supports both exact keyword matching and semantic search. That means "person talking about launch" can match even if nobody in the video said the word "launch." You get the precision of exact search and the flexibility of natural-language queries, working together.
Try it yourself
After indexing, you can search across all your indexed videos instantly. Type a query, get timestamped results, click to jump to the exact moment. It is the experience text search has had for decades, finally applied to video.
Join the waitlist to be among the first to try it.