Workflows
How to Search Your Video Library by Dialogue and Spoken Words
Automatic transcription with word-level timestamps and speaker diarization turns your footage library into something you can search by dialogue. Here is how transcript-first search works in practice.
You know someone said it. You can almost hear the phrasing. But you have 400 interview clips across six projects and no idea which file contains the line. So you start scrubbing. Forty-five minutes later, you either find it or give up.
Transcript search eliminates this entirely. If someone said it on camera, you type the words and land on the exact moment. No scrubbing, no guessing, no asking colleagues if they remember which interview it was.
This is not new technology. Automatic speech-to-text has been reliable for years. What has been missing is a search tool that applies it across your entire footage library (not just one project), preserves word-level timing, identifies who said what, and runs fast enough to feel like a conversation with your footage.
Welcome everyone, thank you for joining us today for our annual company conference.
Before we dive in, I want to acknowledge the incredible work everyone has done this past year.
Our company has grown by 35% year-over-year and we have expanded into three new international markets.
The migration patterns shifted dramatically. We tracked over two hundred species across the delta.
How automatic transcription works
When FrameQuery processes a video file, one of the four analysis passes is transcription. The audio track is extracted and sent through a speech-to-text model that returns a complete transcript with timestamps at the word level.
"Word-level timestamps" means the transcript does not just say "this sentence was spoken between 00:01:15 and 00:01:22." It records that "quarterly" started at 00:01:17.340 and ended at 00:01:17.890. When you search for a specific word or phrase, the result links to the exact point in the timeline where it was spoken. Not the beginning of the sentence. Not the start of the clip. The word itself.
This precision matters when you are pulling selects for an edit. You do not want to land "somewhere near" the quote. You want the playhead on it.
Speaker diarization adds who-said-what
Transcription alone tells you what was said. Speaker diarization tells you who said it. The diarization model analyzes the audio to distinguish between different speakers, labeling each segment of dialogue with a speaker identifier.
In FrameQuery, diarization runs as part of the transcription pass. The result is a transcript where each line is attributed to a specific speaker. Combined with face and voice recognition (which clusters people across your library), this means you can search not just for a phrase but for a phrase said by a specific person.
The practical difference: searching for "we need to revisit the timeline" might return hits from twelve different interviews. Searching for that phrase spoken by a particular person narrows it to the two interviews where they said it.
Exact phrase matching
FrameQuery's transcript search supports exact phrase matching using quotation marks. Searching "final delivery date" finds clips where those three words appear consecutively in that order. Without quotes, the search finds clips containing all three words, even if they appear at different points in the conversation.
This distinction matters in interview-heavy projects. A broad search for delivery date final might surface dozens of clips where those words appear independently. An exact phrase search for "final delivery date" narrows to the moments where that specific phrase was used.
Both modes are useful. Broad search is good for exploration when you are not sure of the exact wording. Exact phrase search is good when you remember the quote and need to find it quickly.
The @mention syntax for speaker search
FrameQuery uses an @mention syntax for searching by speaker. Once people are identified in your library through face and voice recognition, you can reference them in search queries:
@Sarahfinds every clip where Sarah speaks.@Sarah budgetfinds clips where Sarah says something about the budget.@Sarah @Davidfinds clips where both Sarah and David are speaking.
This is particularly useful in documentary and corporate video work where you have recurring subjects across multiple shooting days. Instead of remembering which interview covered which topic, you search by person and keyword together.
Practical examples
Interview selects. You shot eight hours of interviews for a documentary across four days. The director wants every moment where any subject mentions "the factory closing." You search that phrase and get timestamped results across all eight hours in seconds. No logging sheets. No timecoded notes. Just a search.
Meeting recordings. Your company records weekly standups. Six months later, someone needs to know what was decided about the API migration. Search "API migration" and find every meeting where it came up, with the exact moment highlighted.
Documentary research. You are assembling a rough cut and need every instance of a particular subject discussing their childhood. Search @SubjectName childhood and pull every relevant moment from 30 hours of footage.
Multilingual footage. If your footage includes dialogue in multiple languages, transcription handles each language independently. You search in the language that was spoken. A Spanish-language interview is searchable in Spanish.
What transcript search does not cover
Transcript search is powerful but it only covers the audio dimension. A B-roll shot of a sunset, a product close-up with no narration, a time-lapse of a city skyline: these clips have no dialogue and produce no transcript data. They are invisible to transcript-only search.
This is why FrameQuery runs four analysis passes rather than one. Object detection, scene description, and face recognition cover the visual content that transcripts miss. A search like "sunset over the ocean" matches through scene descriptions even though nobody said those words on camera.
For footage that does contain speech, though, transcript search is usually the fastest and most precise way to find what you need. Spoken words are specific. Visual descriptions are interpretive. When someone said exactly the phrase you are looking for, transcript search gets you there instantly.
From transcript to timeline
Finding the moment is only half the workflow. Once you locate the right clips, FrameQuery lets you export selections as FCPXML, EDL, Premiere XML, or LosslessCut CSV. The export includes timecodes that reference your original source files, so the clips drop directly into your NLE timeline without any relinking or manual trimming.
The path from "I need the clip where she said X" to having that clip on your timeline takes seconds instead of the hour it used to take.
Join the waitlist to search your footage by dialogue when FrameQuery launches.