Workflows

How to Search Video by What Appears on Screen

Transcript search finds what was said. Object detection finds what was shown. For B-roll, product shots, and any footage without dialogue, searching by visible objects is the only way to find what you need without scrubbing.

FrameQuery Team22 May 20265 min read

You need every shot where a laptop appears on screen. Not shots where someone talks about laptops. Shots where a laptop is visually present in the frame. There are 300 clips from a two-day office shoot, and maybe 40 of them have a laptop visible somewhere. The only way to find those 40 is to watch all 300, unless you can search by what appears on screen.

This is the gap that object detection fills. Transcript search covers the spoken word. Scene description covers the broader context. Object detection covers the specific, tangible things that are visually present in your footage.

Product_Shoot_v3.mov
91%
01:15

Product_Shoot_v3.mov

object
laptopcoffee cupnotebookpendesk

Product tabletop arrangement with laptop and accessories, soft studio lighting

B004_C003_BTS.R3D
78%
12:30

B004_C003_BTS.R3D

object
cameratripodmonitorlightsdolly

Behind the scenes camera rig setup, crew preparing equipment on location

Object detection indexes every visible item, making visual-only footage searchable

The problem with dialogue-free footage

Most video search discussions focus on transcript search, and for good reason. Transcripts are powerful. But they have a fundamental blind spot: footage without dialogue.

B-roll is the most obvious example. Establishing shots, cutaways, product close-ups, lifestyle footage, atmospheric shots. These clips are the connective tissue of almost every edit, and none of them contain spoken words. Search a transcript index for "laptop" and you will find every time someone said the word "laptop." You will not find the 15 B-roll clips where a laptop sits silently on a desk.

Product shots fall into the same category. A 30-second hero shot of a product on a clean background contains no dialogue. Neither does a tracking shot through a showroom, a slow-motion capture of machinery in operation, or a drone flyover of a building.

For all of this footage, transcript search returns nothing. You need a different search modality, one that works with visual content rather than audio.

How object search works in practice

During processing, FrameQuery runs object detection models on sampled frames from your footage. Each frame is analyzed, and recognized objects are labeled: laptop, phone, car, dog, bottle, chair, monitor, and hundreds of other categories. These labels are stored in your local search index with timecodes.

When you search for an object, the engine queries those labels. Searching "laptop" returns every clip and timecode range where a laptop was detected on screen. Results come back with thumbnail previews so you can quickly scan the visual matches and pick the clips you need.

The search is fast because it queries pre-computed metadata, not the video files themselves. Your footage was analyzed once during processing. Every subsequent search is a database lookup that returns results in seconds, regardless of library size.

Combining object search with other modalities

Object search becomes more powerful when combined with other search types.

Object plus scene description. Searching "laptop" returns every clip with a laptop. Searching "laptop office meeting" combines the object detection result with scene description, returning clips where a laptop appears specifically in a meeting context. This filters out laptops in home settings, outdoor cafes, and other locations you do not need.

Object plus person. Searching "laptop @Sarah" returns clips where a laptop is visible and Sarah is on screen. Useful when you need shots of a specific person working at a computer, not just any laptop appearance.

Object plus transcript. Searching "laptop product demo" combines object presence with spoken content. This finds clips where a laptop is visible and someone is discussing a product demo. The object grounds the search in what is shown while the transcript grounds it in what is said.

Object plus time or metadata filters. Filter object search results by date range, resolution, duration, or source folder. Find all laptop appearances from last Tuesday's shoot, or all product shots in 4K resolution.

These combinations let you build precise queries that no single search type could handle alone.

What objects can be found

Object detection models recognize common, well-represented object categories. The practical vocabulary includes hundreds of everyday items: electronics (laptop, phone, monitor, keyboard, mouse), furniture (chair, table, desk, couch, bed), vehicles (car, truck, bus, bicycle, motorcycle), animals (dog, cat, bird, horse), kitchen items (cup, bottle, fork, knife, oven), and many more.

The model works at the category level. It detects "car" but does not distinguish between a Toyota and a Honda. It detects "bottle" but does not know if it is water or soda. For most editing workflows, category-level detection is exactly what you need. You are looking for shots containing a car, not shots containing a specific VIN number.

Where object search has limitations

Small or distant objects. An object that occupies only a few pixels in the frame may not be detected. A pen on a desk in a wide shot, a bird in a distant sky, a logo on a far-away building. The closer and larger the object, the more reliably it is detected.

Unusual or specialized items. The model recognizes objects it was trained on, which skews toward common, widely photographed categories. Standard office supplies, vehicles, animals, and household items are covered. Specialized industrial equipment, custom prototypes, or niche items may not have a matching category.

Ambiguous objects. Items that look similar to other items may be misclassified. A tablet might be detected as a phone or a monitor depending on the angle. A thick book might be detected as a laptop when viewed from certain perspectives. Confidence scores help, but some ambiguity is inherent.

Quantity and arrangement. The model detects that an object is present, but counting exact quantities or understanding spatial relationships between objects is less reliable. It knows there are cups in the frame. It may not reliably tell you there are exactly four cups arranged in a row.

Where this fills the gap

The footage types that benefit most from object search are precisely the ones where other search methods fall short.

B-roll libraries. Thousands of clips with no dialogue, minimal metadata, and vague folder names. Object detection makes every clip searchable by its visual contents.

Product and commercial footage. Finding every shot of the product across a multi-day shoot without watching everything. The product is the subject of the footage but rarely mentioned in any audio track.

Establishing shots and location footage. Searching for clips containing specific architectural or environmental elements: buildings, bridges, trees, water, vehicles.

Archival footage. Old footage with no metadata, no transcripts, and no one left who remembers what is in it. Object detection generates searchable metadata retroactively, making years of accumulated footage findable.

For all of these, the alternative to object search is manual scrubbing: watching footage at 2x speed and hoping you do not miss what you are looking for. Object detection does the watching for you, once, and makes the results permanently searchable.

Join the waitlist to search your footage by what appears on screen when FrameQuery launches.