The Challenge
Transcription, object detection, scene analysis, face recognition, and a search engine to tie them together. Each piece is a project in itself. FrameQuery bundles the entire pipeline into a single API.
Capabilities
Speech-to-text with word-level timestamps and automatic speaker diarization.
Identify people, vehicles, animals, props, and other objects frame by frame.
Natural language descriptions of scenes including shot type, composition, and dominant color.
Detect, cluster, and identify faces across videos with on-device biometric processing.
Integration
Send a video to the processing endpoint. 50+ formats supported, up to 50 GB per file.
Cloud processing extracts transcripts, detects objects, analyses scenes, and clusters faces. About five minutes per hour of video.
Query the results programmatically. Full-text search, semantic search, and visual similarity across your processed library.
Format Support
Submit R3D, BRAW, ProRes, MXF, H.264, H.265, and 50+ other formats directly. No transcoding needed. See all supported formats.
Pricing
Every plan includes the full feature set.
Free
Free
Search only
Starter
$19/mo
10 hrs processing
Pro
$54/mo
50 hrs processing
Max
$228/mo
300 hrs processing
FAQ
Over 50 formats including R3D, BRAW, ARRIRAW, ProRes, ProRes RAW, DNxHR, XAVC, MXF, CinemaDNG, H.264, H.265, AV1, and more. See the full list on our compatibility page.
Processing takes approximately five minutes per hour of video. Exact times vary by resolution, codec complexity, and queue load.
Video is processed in the cloud. Lightweight proxies are used for analysis and deleted after processing. Your original files are never stored on our servers.
Face and voice recognition run on the desktop application using on-device models. The API provides transcription, object detection, and scene analysis.
FrameQuery is currently in pre-launch. Join the waitlist to be notified when the API becomes available.