Workflows

How Video Indexing Works: From Raw Footage to Searchable Library

Video indexing turns opaque media files into searchable data through five steps: file detection, proxy generation, multi-modal analysis, local index building, and instant search. Here is what happens at each stage.

FrameQuery Team21 April 20265 min read

Video files are opaque by default. Your operating system sees a filename, a file size, and a date. It does not see the interview at minute 23, the product shot at minute 7, or the CEO's comments about next quarter. To make video content searchable, you need to extract what is inside each file and store it in a format a search engine can query.

That extraction process is video indexing. Here is how it works, step by step.

47Discovered
3Queued
4Processing
38Indexed
2Error
Success rate: 81%Errors: 2
The five-stage video indexing pipeline from file detection to searchable index

Step 1: File detection

Indexing starts with knowing what files exist. FrameQuery scans the folders you configure and catalogs every video file it finds. It supports over 50 formats natively, including professional cinema formats like R3D, BRAW, CinemaDNG, and XAVC, alongside common formats like MP4, MOV, and MKV.

The initial scan builds a file manifest: path, format, duration, resolution, codec, creation date, and other technical metadata. This metadata is immediately indexed, so you can filter by date, resolution, or codec before any content analysis runs.

Auto-scan watches your folders continuously. When you dump a new camera card or add footage from a shoot, FrameQuery detects the new files automatically and queues them for processing. No manual import step required.

Step 2: Proxy generation

Professional video files are large. A single hour of R3D footage can be 300 GB or more. Sending that to a processing server would take hours and waste bandwidth, because AI analysis does not need the full resolution original.

FrameQuery generates lightweight proxy files on your machine. These are compressed, lower-resolution copies that retain enough visual and audio quality for accurate analysis. A 300 GB original might produce a proxy of a few hundred megabytes.

Only the proxy is sent to the processing servers. Your original files never leave your machine. This approach keeps upload times reasonable and means you do not need to worry about full-resolution footage sitting on someone else's infrastructure.

Step 3: Multi-modal analysis

This is where the actual intelligence happens. The proxy is analyzed by multiple AI models running in parallel, each extracting a different layer of information.

Transcription. A speech-to-text model converts every spoken word into timestamped text. Timestamps are at the word level, meaning each individual word has a precise start and end time. Speaker diarization runs alongside transcription, segmenting the transcript by who is speaking. The result is a complete, speaker-attributed transcript of every video.

Object detection. Computer vision models scan frames and identify objects: people, vehicles, electronics, furniture, food, animals, signage, and hundreds of other categories. Each detection is tied to a specific timecode, so you can search for "laptop" and find the exact frames where a laptop is visible.

Scene description. A vision language model generates natural-language summaries of what is happening visually. Instead of just labeling individual objects, it describes the scene: "person presenting to a group in a conference room," "aerial shot of a coastline at sunset," "close-up of hands assembling a circuit board." These descriptions capture context that object labels alone miss.

Face detection. Faces are detected in video frames and converted into numerical embeddings (vector representations of facial features). These embeddings are used to cluster appearances of the same person across your library. Face processing runs 100% on your device, not in the cloud. Biometric data never leaves your machine, and face embeddings and person labels are never included in shared or exported indexes. Recipients of a shared index can run their own recognition locally and label people themselves.

Each of these analysis passes produces structured data: transcript segments, object labels, scene descriptions, and face clusters. All timestamped.

Step 4: Local index building

The structured data from all four analysis passes is assembled into a search index on your machine. FrameQuery uses Tantivy, a Rust-based search engine, to build and maintain this index.

Tantivy is the same class of technology as Elasticsearch or Apache Lucene, but compiled to run locally with minimal overhead. The index stores every transcript word, every detected object, every scene description, and every face cluster, all cross-referenced with timestamps and video IDs.

The index is compact. An hour of video typically produces a few megabytes of index data. A 1,000-hour library might have an index of a few gigabytes, which is trivial compared to the footage itself.

Once built, the index is self-contained. It does not need a server, an internet connection, or any ongoing processing. It is a file on your drive that a search engine can query instantly.

Step 5: Search

With the index built, searching is instant. You type a query and the search engine checks it against all four modalities simultaneously. Results are ranked by relevance using BM25 scoring and returned with timestamps, thumbnails, and the matched text or description.

A search for "Sarah explaining the product roadmap" could match across multiple modalities: Sarah's face (face recognition), the words "product roadmap" in the transcript (transcription), and a description of someone presenting (scene description). The search engine combines these signals into a single ranked result set.

Searches run entirely locally. There are no network requests, no API calls, and no per-query costs. Whether you run ten searches or ten thousand searches in a day, the cost is the same: zero.

Processing speed

The full four-modality analysis takes roughly five minutes per hour of video. This varies based on the content (dialogue-heavy footage takes slightly longer for transcription, visually complex footage takes slightly longer for scene analysis) but five minutes per hour is a reliable estimate for planning purposes.

You do not need to process your entire library at once. Start with active projects. Queue older footage for overnight processing. The index is additive, so newly processed videos are immediately searchable alongside everything already indexed.

What gets stored where

Understanding where data lives matters, especially for sensitive footage.

| Data | Location | Leaves your machine? | |------|----------|---------------------| | Original video files | Your drives | No | | Proxy files | Your machine (temporary) | Sent to processing servers, deleted after | | Transcripts and metadata | Local search index | No (generated from cloud processing) | | Face and voice embeddings | Local encrypted storage | Never | | Search index (Tantivy) | Your boot drive | No |

The processing servers see proxy files and return structured metadata. They do not store your proxies after processing completes. Biometric data (face embeddings, voice profiles) is generated and stored exclusively on your device.

From opaque files to instant answers

Video indexing is the process that transforms a folder of opaque media files into a searchable knowledge base. Each step strips away one layer of opacity: file detection catalogs what you have, proxy generation makes processing practical, multi-modal analysis extracts what is inside, index building makes it queryable, and search makes it instant.

The result is that your footage becomes as accessible as your email. You think of what you need, type it, and get a timestamped answer.

Join the waitlist to index your footage library when FrameQuery launches.