Workflows

How to Search Video by Who Said What

Transcript search finds the words. Speaker diarization identifies who said them. Combined, they let you search for specific statements by specific people across your entire footage library.

FrameQuery Team11 April 20265 min read

You know the CEO said something about quarterly targets. You also know it was during one of 30 hours of board meeting recordings spread across an entire year. Transcript search would find every mention of "quarterly targets" across all that footage, but the CEO was not the only person who discussed them. The CFO, the VP of Sales, and half the board mentioned targets too. You need to find what one specific person said, not every instance of the topic.

This is the gap between transcript search and speaker-aware search. Transcript search answers "where was this said?" Speaker-aware search answers "where did this person say this?"

The problem with transcripts alone

Full-text transcript search is already a significant improvement over manual scrubbing. Processing your footage to generate searchable transcripts means you can find any spoken word across your entire library in seconds. For single-speaker content like narration or solo presentations, that is usually sufficient.

The problem surfaces with multi-speaker content, which is most professional footage. Interviews have an interviewer and a subject. Meetings have a table full of participants. Panel discussions have moderators and guests. Documentaries intercut multiple subjects across dozens of sessions. Podcasts regularly feature two, three, or more voices.

In all of these scenarios, searching for a topic returns results from every speaker who mentioned it. If six people discussed budget projections across ten recordings, you get dozens of results to sift through. The search found the words. You still need to find the person.

What speaker diarization does

Speaker diarization is the process of analyzing an audio recording and segmenting it by who is speaking. The AI model listens to the audio, identifies distinct voices based on vocal characteristics, and labels each segment of the transcript with a speaker tag: Speaker 1, Speaker 2, Speaker 3.

This happens automatically during processing. No manual setup is required. The model (ECAPA-TDNN) analyses the acoustic properties of each voice, including pitch, timbre, cadence, and spectral characteristics, to distinguish between speakers. It does not need to know who the speakers are in advance. It simply recognizes that the voice in one segment is different from the voice in another.

The result is a transcript where every sentence is attributed to a specific speaker. Instead of a flat block of text, you get a structured conversation:

  • Speaker 1: "The quarterly targets need to be revised."
  • Speaker 2: "I agree, but we should wait for the final numbers."
  • Speaker 1: "We cannot wait. The board meeting is next week."

Each of these segments is independently searchable, and each is tied to a specific speaker identity.

04:10
Sarah Chen

Welcome everyone, thank you for joining us today for our annual company conference.

04:28
Sarah Chen

Before we dive in, I want to acknowledge the incredible work everyone has done this past year.

08:15
James Park

Our company has grown by 35% year-over-year and we have expanded into three new international markets.

14:22
Dr. Amara Osei

The migration patterns shifted dramatically. We tracked over two hundred species across the delta.

Speaker diarization attributes each transcript segment to the person who said it

Naming your speakers

Speaker tags like "Speaker 1" and "Speaker 2" are functional but not memorable. FrameQuery lets you name your speakers so that generic tags become recognisable identities.

When you review processed footage, you can assign names to detected speakers. Speaker 1 becomes "Sarah Chen." Speaker 2 becomes "James Liu." Once named, those identities persist across your library.

This is where speaker diarization connects with face recognition. FrameQuery runs on-device face recognition (InsightFace Buffalo-L) to identify people visually. When a recognized face is speaking, the system can link the visual identity to the voice identity. Sarah's face on screen corresponds to Sarah's voice in the audio. The connection is automatic once both modalities have been processed.

With named speakers, you can search using a direct syntax. Searching for "@Sarah quarterly targets" returns only the moments where Sarah specifically discussed quarterly targets. Not every mention of the topic by anyone. Just Sarah's comments.

The workflow in practice

The day-to-day workflow is straightforward. Process your footage. FrameQuery automatically transcribes the audio, identifies distinct speakers, and assigns speaker tags. You name the speakers you care about, which takes a few seconds per person. From that point on, every search can be filtered by speaker.

For a documentary editor working with 40 hours of multi-subject interviews, this transforms the research phase. Instead of watching every interview to find specific comments, you search for a topic and filter by the subject you want. The search returns timestamped results showing exactly where that person discussed that topic, with enough surrounding context to evaluate whether it is the moment you need.

For a podcast producer cutting a multi-guest episode, searching "@Dr. Martinez vaccine development" finds only the medical expert's comments on the topic, skipping the host's questions and the other guest's tangential remarks.

For a legal team reviewing deposition footage, the ability to search for a specific witness's statements about a specific topic is not just convenient. It is a fundamental requirement that previously demanded hours of manual review.

Working across sessions

One of the most useful aspects of speaker identification is that it works across multiple recordings. If the same person appears in five different interview sessions recorded on different days, their voice profile carries across all of them.

This matters for any project where you record the same people repeatedly. A documentary that follows subjects over months or years. A corporate archive of board meetings where the same executives appear regularly. A podcast with recurring guests.

Search once, find every instance across every recording. The system does not treat each file as an isolated unit. Your entire library is one searchable index, and speaker identities span the whole thing.

Privacy by design

Speaker voice analysis involves biometric data, which requires careful handling. FrameQuery processes all voice embeddings on your device using the ECAPA-TDNN model running locally. Voice fingerprint data never leaves your machine. It is not uploaded to any server, not stored in any cloud, and not accessible to anyone other than you.

Face recognition follows the same principle. The InsightFace Buffalo-L model runs entirely on-device with GPU acceleration (CUDA on Windows and Linux, Metal on macOS, with CPU fallback). Face embeddings stay local. Face thumbnails are encrypted at rest.

This on-device approach means you can use speaker and face identification on sensitive footage (legal depositions, medical interviews, confidential corporate recordings) without biometric data ever leaving your control.

From "what was said" to "who said what"

Transcript search was the first step: making spoken words findable. Speaker diarization is the second: making those words attributable. The combination changes search from a blunt instrument that returns every mention of a topic into a precise tool that returns exactly the person and the statement you need.

For any workflow involving multi-speaker recordings, this is the difference between search results you can use immediately and search results you still need to manually filter. The filtering happens automatically, before you see the results.

Join the waitlist to search your footage by speaker when FrameQuery launches.