Processing
Select files or folders, and FrameQuery handles the rest. Your original media never leaves your machine; only a lightweight proxy is uploaded for analysis, then deleted.
Pick files or entire folders. FrameQuery detects duplicates with Blake3 hashing so nothing gets processed twice.
A compressed proxy is created locally (GPU-accelerated where available) and uploaded. The original file stays on your machine.
Video and audio are analysed in parallel: scene detection, object recognition, and speech-to-text with speaker separation run simultaneously.
Results are written to a local Tantivy search index. The proxy is deleted. Search is instant and offline from here on.
Processing hours are included with paid plans: 5 hrs/mo on Starter, 50 hrs/mo on Pro, 300 hrs/mo on Max. Hours are measured in video duration, not processing time. Most videos finish in 5-10 minutes. See pricing.
What Gets Extracted
People, vehicles, animals, props, text overlays, and hundreds more, identified frame-by-frame and stored as searchable metadata. Search for “red car” or “laptop” and jump straight to the frame.
Natural-language summaries of each scene including setting, action, mood, and composition. Also extracts shot type (wide, close-up, medium), shot angle, and dominant colour, so you can search by how a shot looks, not just what's in it.
Full speech-to-text with word-level timestamps and automatic speaker diarisation. Each segment is tagged with a speaker so you can search by who said something, not just what was said.
No transcoding required. Point FrameQuery at your media as-is and it handles the rest. Professional camera RAW formats use native SDKs with GPU acceleration; everything else goes through FFmpeg.
On-Device Recognition
Tag a person once and FrameQuery finds every appearance across your entire library, by what they look like and what they sound like. Both models run entirely on your machine. Your biometric data never leaves your device.
Powered by InsightFace Buffalo-L (ONNX) with CUDA and Metal acceleration. FrameQuery extracts key frames, detects faces, generates 512-dimension embeddings, and clusters similar faces automatically.
Tag a face once with a name and search with @Sarah to find every appearance of that person.
Powered by ECAPA-TDNN (ONNX). Speaker embeddings are generated from the audio track and clustered to identify unique voices. If a face and voice appear in the same frame, FrameQuery automatically links them to the same person.
Filter transcript search results to only show what a specific person said, across both visual and audio content.
Available on every plan including Free. Optional and consent-gated . Enable them when you're ready, and FrameQuery will backfill any videos you've already processed.
Search
Once processed, your index is stored locally using Tantivy (a Rust port of Lucene). Search runs entirely on your machine: no internet required, no API calls, no per-query cost.
Results are ranked with BM25 relevance scoring and field-based boosting across transcripts, scene descriptions, objects, and filenames. The most relevant moments surface first.
The index is yours to keep even if you cancel your plan. Search on a plane, on set, or anywhere else; results return in milliseconds.
@SarahFilter by person, face or voice. Use @"full name" for multi-word names.
"quarterly goals"Exact phrase match. Finds the precise moment someone said this.
-interviewExclude a term. Combine with other queries to narrow results.
codec:prores res:4kMetadata filters: codec, resolution, camera, FPS, ISO, lens, and more.
Also filter by time range, dominant colour, match type (transcript, object, scene), date range, source folder, project, or named index. Combine any of these with free-text search.
NLE Export
Export search results or selections as timeline-ready files. Free on every plan.