Dual-System Audio Sync: Auto-Aligning Boom and Camera Tracks

Almost every professional shoot records sound twice. The boom or lavalier feeds a field recorder for the take you actually use. The camera records its own scratch track from the on-board mic so the editor has something to sync against. By the time the cards land on a drive, you have hundreds of clips on each side and a job nobody enjoys: matching them up.

This is dual-system sound, and editors have been paying a tax on it for as long as digital workflows have existed.

What is dual-system audio sync?

Dual-system audio sync is the process of aligning a separately recorded audio file (typically from a field recorder, boom, or lavalier) with the corresponding video clip from a camera, so the clean audio replaces the camera's on-board scratch track on the timeline. It can be done by hand, by timecode, by waveform analysis, or by a combination of all three. Automatic sync tools like PluralEyes, Resolve's audio sync, and FrameQuery aim to remove the manual step entirely.

The two ways to sync, and where each one breaks

There are really only two reliable signals for aligning a camera clip with a separate audio file: timecode and the audio waveform itself. Both are useful. Neither is sufficient on its own.

Timecode works when both devices were jammed from the same source and the recorder embedded BWF (Broadcast Wave Format) timecode in its files. When the chain holds, sync is exact. When it does not (different timecode islands, drift over a long day, a recorder set to free-run that nobody jammed at lunch, a camera that does not write timecode at all), you get clips that claim to share a clock but are seconds apart on the timeline.

Waveform sync sidesteps the clock entirely. It compares what the two devices actually heard and aligns them by acoustic content. This is what tools like PluralEyes popularised and what Resolve has built into its inspector. It works well when both mics captured similar audio. It struggles when they did not, and it really struggles when the camera mic was running automatic gain control (AGC).

Most editing time on the sync stage is spent in that gap: clips where timecode lies and the waveform looks too different to match cleanly.

How FrameQuery handles it

We built sync to assume the gap is the default case, not the edge case. The pipeline runs in two phases, and you do not have to start either of them. Both run in the background while your library indexes.

Phase one: timecode, but only if it is trustworthy

When a file lands in your library, FrameQuery reads any embedded BWF timecode and the camera clip's start timecode. If the two windows overlap within a one-second tolerance, the pair is treated as a high-confidence timecode match and linked automatically.

The one-second window is deliberate. Tighter than that and small drift on long takes kicks valid pairs out. Looser and you start linking clips that share a clock but were genuinely from different setups. One second is wide enough to absorb real-world drift, narrow enough that random collisions are rare.

If timecode is missing, ambiguous, or contradicted by other files in the same window, the match is not made. The pair drops to phase two.

Phase two: acoustic fingerprinting, then waveform

Phase two runs on every file regardless of whether phase one succeeded. It is the part that matters when timecode cannot be trusted.

The first step is an audio fingerprint. FrameQuery extracts a compact acoustic signature from every clip and indexes it in a local database. The algorithm is a custom Rust implementation of spectral landmark fingerprinting, the technique behind the original Shazam paper. We did not pull in an off-the-shelf library because the off-the-shelf options are tuned for the wrong problem. Chromaprint and its peers are designed for matching a song against a catalogue of millions of songs. We are matching a field recorder's clean signal against the same audio captured by a camera mic two metres away, with automatic gain riding the level the entire time.

Camera AGC is the silent killer of waveform sync. The camera is constantly adjusting its input level to keep things audible, which means the loud peaks on the field recorder do not look loud on the camera track. A naive correlation gives up. FrameQuery's fingerprinter applies PCEN (per-channel energy normalisation), spectral whitening, and pre-emphasis before extracting peaks, which means the fingerprints are derived from the parts of the signal that survive AGC and cheap-mic frequency response.

Once fingerprints exist on both sides, we look up each query file's hashes in an inverted index across your entire library, including RAW camera formats, and build a histogram of time offsets. Hashes that show up everywhere (room tone, hum, generic transients) are weighted down using inverse document frequency, the same trick search engines use to ignore common words. The result is a coarse offset between two files, accurate to about 200 milliseconds.

Two hundred milliseconds is not sync. It is a candidate.

The candidate is then refined with two passes of envelope cross-correlation. The first pass uses a spectral onset envelope, which fires sharply on transients like consonants, claps, or a door closing. This pins down the lag to within a few samples. The second pass uses a multi-scale amplitude envelope to score the alignment over the overlapping region, which is what tells us whether the match is real or coincidental. The peak position is interpolated parabolically between samples, so the final offset is sub-sample accurate. At 48 kHz that is well under a frame.

Confidence tiers, and why we do not just auto-link everything

Most of the work in a sync tool is not finding matches. It is deciding which matches to trust.

FrameQuery sorts every candidate into one of three tiers based on the cross-correlation score and the number of supporting hash matches:

High confidence (correlation at least 0.85 with twenty or more matching hashes): linked automatically. You do not see anything; the audio is already paired with the video when you next open the clip.
Medium confidence (correlation at least 0.6 with ten or more matching hashes): staged for review in the audio match dialog. You see the pair, the offset, the score, and a waveform preview. You confirm or reject with a click.
Low confidence: discarded. Nothing surfaces, because a wrong auto-link is worse than no link at all.

The thresholds were not picked from a paper. They came out of running the system across real production audio with known ground truth and finding the point where false positives effectively went to zero. The medium tier exists precisely because the right answer for ambiguous pairs is "ask the human", not "guess".

Where this still falls back to you

We are not going to pretend this replaces every sync workflow. A few honest limits:

The two files have to share audio. If the boom was off or the camera was rolling on something the field recorder never heard, there is nothing to align against. No algorithm fixes that.
Heavy continuous noise lowers confidence. Wind, traffic, ocean, generators. The fingerprints still work, but more pairs will land in the review tier rather than auto-linking.
Very short clips are harder. Below about three seconds of overlapping audio, the cross-correlation score has too few independent peaks to discriminate confidently.
A clean slate is still the right call on a chaotic set. Sync claps exist for a reason. They give the algorithm an unambiguous transient that survives any mic or AGC. If you have them, the system uses them; if you do not, it does its best with what was actually recorded.

What it looks like in practice

Drop your camera cards and your audio cards into the same FrameQuery library. Indexing runs in the background while you start logging. Within a few minutes, the obvious pairs (timecode-jammed, clean audio, strong transients) are already linked. The medium-confidence stack lands in the AudioMatchReviewDialog with the offsets, scores, and a quick preview, and you walk through it the way you would walk through a stack of selects: confirm, confirm, reject, confirm.

By the time you are ready to cut, sync is mostly already done. The only thing left in your queue is the handful of pairs the algorithm was not sure about, which are exactly the pairs you would want a human to look at anyway. From there, your synced selects can move straight onto a timeline using FrameQuery's FCPXML export, or be filtered down further with speaker-aware search once the clean audio is attached.

Frequently asked questions

Can I sync separate audio without timecode?

Yes. FrameQuery's second matching phase uses acoustic fingerprinting and waveform cross-correlation, which does not require timecode at all. As long as the camera scratch track and the field recorder share some overlapping audio, the system can recover an accurate offset, typically to sub-frame precision at 48 kHz.

How does FrameQuery handle camera AGC?

Most off-the-shelf waveform sync tools fail when the camera was running automatic gain, because peaks no longer line up. FrameQuery applies per-channel energy normalisation (PCEN), spectral whitening, and pre-emphasis before extracting fingerprint peaks, so the features being matched survive AGC and cheap-mic frequency response. In practice this is the single biggest reliability difference on run-and-gun footage.

What happens to pairs the algorithm is not sure about?

Low-confidence candidates are discarded. Medium-confidence candidates (correlation around 0.6 to 0.85) are surfaced in a review dialog with the offset, score, and a waveform preview, so you can confirm or reject them in a single click rather than re-syncing from scratch. High-confidence pairs are linked automatically.

Join the waitlist to try dual-system audio sync against your own production footage.