Searching Broadcast Archives: Making Decades of News Footage Findable

Every broadcast station has an archive problem. Decades of footage sitting on SANs, nearline storage, and LTO tape libraries. Tens of thousands of hours spanning hard news, features, investigative pieces, weather coverage, live events, and B-roll. The footage is technically preserved, but practically inaccessible. Finding a specific clip from 2014 requires either exceptional institutional memory or a very lucky search through incomplete metadata.

The archive is simultaneously one of the station's most valuable assets and one of its least utilized. The cost of the footage has already been paid. The crews shot it, the editors cut it, the stories aired. All that content retains value for future stories, retrospectives, and context pieces, but only if someone can find it.

Most of it cannot be found. Not because it was lost, but because it was never adequately catalogued in the first place.

The cataloguing gap

Broadcast archive metadata was typically entered at the point of ingest. A tape operator or editor would log the date, story slug, reporter name, and a brief description. On a good day, the description included key subjects, locations, and topics. On a busy news day, it might include nothing beyond the slug and date.

This inconsistency compounds over years. An archive built by a dozen different operators across two decades has wildly varying metadata quality. Some clips have detailed descriptions. Others have a two-word slug. The footage from the year the station transitioned from tape to file-based might have gaps where the workflow was still being figured out.

The result is an archive where searching by metadata produces unpredictable results. You find some things. You miss others. You never know what you missed.

The cost of not searching

When archive footage cannot be found, the alternative is to re-shoot or go without. Both have real costs.

Re-shooting means sending a crew to capture something that already exists in the archive. An establishing shot of a courthouse. B-roll of a busy intersection. An exterior of a business that has since closed. For a daily news operation, dispatching a crew costs time and resources that could be allocated to the current story. For historical footage, re-shooting is impossible. You cannot recapture an event that happened ten years ago.

Going without means the story airs with less context. A report on a policy change lacks footage from when the policy was first debated. A profile piece on a public figure omits earlier appearances that would add depth. The story is weaker because the footage was inaccessible, not because it did not exist.

Stations that can effectively search their archives produce richer, more contextualized coverage. The archive is a competitive advantage, but only when it is searchable.

Retroactive AI indexing

The traditional approach to making an archive searchable is a cataloguing project: hire archivists or assign staff to review every clip and add detailed metadata. For an archive of 20,000 hours, this is a multi-year effort costing hundreds of thousands of dollars. Most stations cannot justify the expense, so the archive remains under-catalogued indefinitely.

AI indexing inverts this approach. Instead of humans watching every clip and writing descriptions, AI models analyze the content and generate searchable metadata automatically. The coverage is comprehensive and consistent, regardless of the archive size.

A retroactive indexing pass applies four layers of analysis to every clip:

Transcription. Every word spoken in every clip is converted to searchable text with timestamps. Interviews, press conferences, reporter standups, live shots, and anchor reads all become searchable by what was said.

Speaker diarization. Each voice is identified and tagged, so searches can be filtered by who said something, not just what was said. This is especially valuable for broadcast archives where the same reporters, anchors, and public figures appear across hundreds of clips.

Scene description. AI-generated descriptions of the visual content in each segment. Aerials, establishing shots, close-ups, press conferences, courtroom footage, weather events. The descriptions capture what the camera shows, making B-roll and non-dialogue footage findable.

Face recognition. People are detected and clustered across the entire archive. Once a public figure, reporter, or anchor is identified, every appearance across every clip is linked. This runs on-device, keeping biometric data local.

The output is a search index where every clip is searchable by what was said, who said it, who appeared, and what was shown. No human reviewed any of it. The coverage is uniform across the entire archive, from yesterday's footage to clips from 15 years ago.

Practical considerations for large archives

Processing a large broadcast archive is a significant but finite task. Here are the key factors.

Processing time. FrameQuery processes footage at roughly five minutes per hour of video. For reference:

5,000 hours (small station, five years): approximately 17 days of continuous processing
20,000 hours (mid-market station, 10-15 years): approximately 69 days
50,000 hours (major market station, decades): approximately 174 days

Processing runs in the background and can be paused and resumed. It does not need to complete before the index is useful. Clips are searchable as soon as they are individually processed, so the archive becomes incrementally more searchable over the processing period.

Format support. Broadcast archives contain a mix of formats depending on the era and the equipment used. Common formats include:

MXF (XDCAM, P2, AVC-Intra) - the dominant broadcast acquisition format
XAVC/XAVC-S - Sony's newer codec family
ProRes (various flavors) - common in Apple-based edit environments
DNxHR/DNxHD - Avid edit environments
MPEG-2 - older broadcast and DVD-era footage
H.264/H.265 MP4 - newer file-based acquisition and screen recordings

FrameQuery decodes all of these natively. No transcoding step is required. This matters for broadcast archives because transcoding tens of thousands of hours of footage would add weeks or months to the process and consume enormous storage.

Storage access. Archive footage typically lives on SAN storage, a NAS, or connected LTO libraries. FrameQuery reads from any mounted storage location. Point it at the archive volume, and it processes whatever it finds. Files do not need to be moved or copied.

For tape-based archives that have been migrated to file storage, the footage is ready to process as-is. For footage still on LTO, it needs to be restored to disk-accessible storage first. FrameQuery does not read directly from tape.

Index size. The search index is compact relative to the source footage. A rough estimate is 1 to 2 MB of index data per hour of footage. A 20,000-hour archive produces an index of roughly 20 to 40 GB, which fits comfortably on any modern workstation.

Incremental indexing going forward

Once the backlog is processed, new footage is handled automatically. Source folder monitoring watches designated storage locations for new files. When today's footage is ingested to the archive, FrameQuery detects it and queues it for processing. The index stays current without manual intervention.

This means the cataloguing gap stops growing. Going forward, every clip is fully indexed with transcript, speaker, scene, and face data from the moment it enters the archive. The retroactive processing catches up the backlog. The automatic monitoring prevents a new backlog from forming.

Making the archive earn its keep

Broadcast archives represent millions of dollars of accumulated production investment. The footage was expensive to create. Maintaining the storage is an ongoing cost. The only way to justify that investment is to make the content accessible and useful for current production.

AI-powered search does not replace archivists or librarians. It gives them a tool that covers the content-level search that manual cataloguing could never scale to. And for stations that no longer have a dedicated archivist, it provides the search capability that would otherwise not exist at all.

Join the waitlist to make your broadcast archive searchable when FrameQuery launches.