Workflows

How to Find Every Clip of a Specific Person Across Hours of Footage

Whether it is a talent pull, a compliance check, or an edit request, finding every appearance of one person across a large footage library is one of the most time-consuming tasks in post-production. Here is how face recognition changes the workflow.

FrameQuery Team8 April 20265 min read

The producer calls at 4 PM. "We need every shot of Sarah from the three-day shoot. The client wants a sizzle reel focused on her by tomorrow."

Three days of shooting. Four cameras. Roughly 20 hours of footage across interviews, B-roll, behind-the-scenes, and event coverage. Sarah appears intermittently throughout all of it. Sometimes on camera A, sometimes caught in the background of camera C, sometimes only her voice in an interview cutaway.

Finding every appearance of one specific person across a large footage library is one of the most tedious tasks in post-production. It is also one of the most common.

The manual approach

Without any tooling, this is a scrubbing job. Open each clip, watch it (usually at 1.5x or 2x speed), and note every timecode where Sarah appears. Then go back and verify your marks, because at 2x speed you inevitably miss a few. Background appearances, brief cutaways, shots where she walks through frame for two seconds.

For 20 hours of footage at 1.5x speed, that is roughly 13 hours of watching. Add time for note-taking, re-checking, and organising the results, and you are looking at two full days of work for one person request.

Now imagine the client asks for the same thing for three more people.

The metadata approach

Some teams try to solve this at ingest. Tag every clip with who appears in it. Build a database. Search the tags later.

In theory, this works. In practice, it requires someone to watch every clip at ingest and manually tag every person who appears. That is the same scrubbing problem, just moved earlier in the pipeline. And it only works if the tagging was done consistently, if the tagger knows everyone by name, and if nobody skipped a clip.

It breaks the moment the team is under time pressure, which is always.

How face recognition changes the workflow

Face recognition automates the hardest part: watching. Instead of a human scanning every frame for a familiar face, a model does it. The process works in stages.

Detection finds faces in frames and returns bounding boxes. At this stage, the system does not know who anyone is. It just knows where faces are.

Embedding takes each detected face and passes it through a recognition model that produces a mathematical representation: a compact numerical description of the face's features. Two photos of the same person produce similar embeddings. Two different people produce dissimilar ones.

Clustering groups all detected faces by similarity. Without anyone providing names, the system can determine that the face in clip 7 at 00:04:12 is the same person as the face in clip 43 at 01:22:08. This works across different cameras, lighting conditions, and angles.

Naming is the one manual step. You review the clusters and assign names. "This cluster is Sarah. This one is the CEO." Once named, every appearance in that cluster becomes searchable.

Now the original request is trivial. Search for Sarah. Every clip where she appears comes back with timestamps. The 13-hour scrubbing job becomes a 30-second search.

People5 identified
Lena Moreau
Lena Moreau7F 2V
Sarah Chen
Sarah Chen5F 1V
James Park
James Park4F 1V
Dr. Amara Osei
Dr. Amara Osei3F 1V
?
Unknown1F
Face recognition runs 100% locally. Embeddings and labels are never included in shared indexes.

Privacy matters more than you might think

Face recognition produces biometric data. The embeddings generated from someone's face are legally classified as biometric information under laws like BIPA (Illinois), GDPR (EU), and a growing list of state and international regulations. These laws carry real enforcement: BIPA allows individuals to sue directly at $1,000 to $5,000 per violation, and GDPR fines can reach 4% of global annual revenue.

For video teams, the practical question is where biometric data is processed and stored. A cloud tool that keeps embeddings on its servers creates a compliance surface you need to manage: jurisdiction, access controls, retention, breach risk. On-device processing eliminates that entire category. If embeddings are computed locally and never leave your machine, there is no server-side biometric data to worry about.

For more on the legal landscape, see our posts on people matching and privacy and a video editor's guide to biometric privacy law.

How FrameQuery handles person search

FrameQuery uses InsightFace Buffalo-L for face recognition and ECAPA-TDNN for voice recognition. Both models run 100% on your device. Biometric embeddings are computed locally and stored in an encrypted local database. They are never sent to FrameQuery's servers. When you share an index, embeddings and person labels are automatically stripped - recipients can run their own recognition and label people themselves.

The workflow:

  1. Process your footage. Lightweight proxies are sent to the cloud for transcription, object detection, and scene description. Face detection (finding where faces are) also runs in the cloud, but recognition (generating identifiable embeddings) runs locally.
  2. Review face clusters. FrameQuery groups detected faces into clusters. You review them in the app and assign names.
  3. Search by person. Use the @ syntax to filter results by person. @Sarah returns every moment where Sarah appears on screen or is heard speaking. Combine it with other search terms: @Sarah product demo finds moments where Sarah appears and the product demo is happening.
  4. Export your selections. Found what you need? Export as FCPXML, EDL, Premiere XML, or LosslessCut CSV. The clips land on your NLE timeline pointing to your original source files.

Voice recognition adds another dimension. If Sarah is speaking off-camera or in a voiceover, her voice embedding matches even when her face is not visible. Face and voice together cover both on-screen and off-screen appearances.

Processing takes roughly five minutes per hour of footage. For the 20-hour shoot in our opening scenario, that is under two hours of processing (running in the background while you work on something else), followed by a few minutes of cluster review, followed by instant search results.

Compare that to two days of manual scrubbing.

Beyond talent pulls

Person search is useful well beyond sizzle reels. Compliance teams need to verify that specific individuals do not appear in published content. Documentary editors need to track subjects across months of footage. Corporate video teams need to pull every appearance of a departing executive before re-editing materials.

The core problem is always the same: find one person across a large volume of footage. Let face recognition do the watching, so you can focus on the editorial decisions.


Find anyone, across everything you have ever shot. Join the waitlist to try on-device face and voice search with your own footage.