Workflows
How Video Scene Detection Breaks Footage Into Searchable Segments
A 30-minute clip might contain 50 distinct shots. Scene detection identifies each transition and turns every segment into a searchable unit with its own description, objects, and metadata. Here is how it works.
A single video file is not a single piece of content. A 30-minute interview clip might contain 50 distinct shots: wide establishing shot, medium two-shot, close-up of the interviewer, close-up of the subject, cutaway to hands, B-roll insert, and back again. A 10-minute highlight reel might contain 80 or more shots cut together.
If you index the entire file as one unit, a search for "close-up of hands" has to return the whole 30-minute file. You still need to scrub to find the actual moment. The file is technically "found," but the specific shot is not.
Scene detection solves this by breaking each file into its constituent shots. Each shot becomes its own searchable segment with its own description, objects, timestamps, and metadata. Search stops returning files and starts returning moments.
Interview subject seated against dark background, warm key light from camera left
Conference stage with presenter at podium, blue and white lighting, audience visible
Establishing shot of harbor at golden hour, boats moored along dock, warm light across water
What scene detection identifies
Scene detection analyzes the visual content of a video frame by frame and identifies points where the content changes significantly. These changes fall into several categories.
Hard cuts. An abrupt transition from one shot to another. Frame 1000 shows an office interior; frame 1001 shows an outdoor landscape. This is the easiest type of transition to detect because the visual difference between adjacent frames is dramatic.
Dissolves and crossfades. A gradual blending from one shot into another over several frames. The visual change is spread across a range of frames rather than concentrated at a single point. Detection requires analyzing the rate of visual change over a window of frames, not just the difference between two adjacent ones.
Significant camera movements. A static shot that transitions into a pan, tilt, or dolly movement can indicate a new compositional intent even without a cut. Similarly, the end of a camera move into a static hold often marks a new shot in editorial terms. These are harder to detect reliably and depend on the sensitivity of the detection algorithm.
Fade to black or white. Common in edited content. A shot fades to black, holds, then a new shot fades in. Detection identifies both the fade-out and fade-in as boundaries.
The goal is not to identify every possible visual change. It is to find the boundaries that correspond to distinct editorial shots: the points where, if you were logging the footage manually, you would say "new shot here."
Why per-scene indexing beats per-file indexing
The difference between indexing per file and indexing per scene is the difference between a book index that lists chapter titles and one that lists every topic on every page.
With per-file indexing, a search for "aerial shot of highway" returns a list of files. One of them is a 45-minute assembly that contains a 6-second aerial shot somewhere in the middle. You found the file. Now you need to find the moment. You are back to scrubbing.
With per-scene indexing, the same search returns the specific 6-second segment with a thumbnail, timestamp, and direct playback link. You found the moment. No scrubbing required.
Per-scene indexing also produces more accurate search results. When an entire file is described as a single unit, the description has to generalize across all the content in the file. A 10-minute clip containing both indoor and outdoor shots might get a description like "mixed indoor and outdoor footage." That description matches too many queries and not enough of them precisely.
When each shot is described individually, the indoor shots get indoor descriptions and the outdoor shots get outdoor descriptions. A search for "outdoor" returns only the outdoor segments, not the entire file that happens to contain some outdoor footage.
How granularity affects search precision
Scene detection granularity is the number of segments a file gets divided into. More segments means finer granularity, which means more precise search results. But there are trade-offs.
Too coarse (too few segments): multiple shots get grouped together, descriptions become vague, and search results point to ranges of footage rather than specific moments. You still need to scrub within the result.
Too fine (too many segments): individual camera moves or minor lighting changes create separate segments, descriptions become redundant, and search results contain many near-duplicate entries for what is functionally the same shot.
The sweet spot is editorial granularity: segments that correspond to the distinct shots a human editor would identify. A 30-minute multicam interview might produce 40 to 60 segments. A 5-minute montage might produce 30 to 50. A single locked-off wide shot with no cuts might produce just one.
Good scene detection adapts to the content. A fast-cut music video needs higher sensitivity than a static interview. A single-take documentary scene needs lower sensitivity than an edited highlight reel.
What happens after detection
Scene detection is not the end of the process. It is the foundation. Once shot boundaries are identified, each segment goes through additional analysis.
Scene description. A vision model generates a natural language description of what happens in the segment: "Close-up of a person's hands adjusting a camera lens on a wooden workbench." This description is indexed for text and semantic search.
Object detection. Objects visible in the segment are identified and listed: camera, lens, workbench, hands, tools. These are indexed as structured data that can be searched or filtered.
Transcript alignment. If the segment contains speech, the corresponding portion of the transcript is associated with it. A search for a spoken phrase returns the specific segment where it was said, not the entire file.
Metadata extraction. Timestamps, duration, and position within the source file are recorded. When you find a segment, you can jump directly to its location in the original file.
The result is that each segment becomes a self-contained searchable unit with multiple layers of indexed information. The search engine can match against any combination of description, objects, transcript, and metadata.
A practical example
Consider a 20-minute corporate video shoot. The camera rolled continuously for the entire shoot. The footage contains:
- A 15-second establishing shot of the building exterior
- A 30-second walk-and-talk through the lobby
- A 4-minute interview segment (medium shot)
- A 2-minute cutaway to the product demo
- A return to the interview (different angle, close-up)
- 3 minutes of B-roll: office spaces, team meetings, whiteboard sessions
- A closing wide shot
Without scene detection, this is one 20-minute file. A search for "whiteboard" returns the whole file. A search for "building exterior" returns the whole file. Every query returns the same result, and you still need to find the moment manually.
With scene detection, this becomes roughly 15 to 20 separate segments. The establishing shot is its own segment described as "wide shot of a modern office building exterior." The whiteboard B-roll is its own segment with "whiteboard" in both the description and object list. Each is independently searchable, and each result points to the exact timecode range.
How FrameQuery handles scene detection
FrameQuery runs scene detection as the first step of the processing pipeline. The detection algorithm analyzes visual content to identify shot boundaries, producing a list of segments with start and end timecodes for each source file.
Each segment then passes through the analysis pipeline: scene description via a cloud-hosted vision model, object detection, and transcript alignment. Face and voice recognition run locally on your device. All results are stored in a local search index built on Tantivy, with both BM25 text matching and MiniLM semantic embeddings.
The system handles the full range of professional footage. A locked-off interview produces a small number of long segments. A fast-cut montage produces many short segments. Multi-camera shoots where each camera ran continuously produce segments that correspond to the distinct shots visible from each angle.
Processing runs at roughly five minutes per hour of footage. Your originals stay on your machine. The resulting index is local, works offline, and supports 50+ formats natively, so you do not need to transcode before processing.
The foundation of useful video search
Scene detection is not the most visible part of a video search system. Editors interact with the search bar, the results list, and the preview player. But detection is the step that determines whether those results are useful. Without it, search points to files. With it, search points to moments.
The difference is the difference between "the clip you need is somewhere in this 45-minute file" and "the shot you need starts at 14:32 and runs for 6 seconds." The first answer saves you from checking every file. The second saves you from scrubbing through the one you found.
Join the waitlist to search your footage at the shot level when FrameQuery launches.