How We Built People Matching: Face and Voice Recognition With Privacy That Actually Holds Up

FrameQuery can now match people across your entire video library by face and voice. Assign a name to someone once, and every video they appear in becomes searchable by who is in it, not just what is in it.

Building this required integrating two neural networks, designing an encrypted local biometric database, and making the whole thing compliant with GDPR, CCPA, and BIPA from the start. The privacy constraints shaped nearly every technical decision, so this post covers both how the matching works and why the architecture looks the way it does.

The Problem

Video search by transcript and visual content gets you most of the way there. But queries like "CEO speaking" or "Alice B-roll" require knowing who is in the footage, not just what objects or scenes are present. Manual tagging does not scale. We needed automatic matching that works across a library of thousands of videos.

This breaks down into two distinct problems: recognizing faces across different videos, lighting conditions, and angles, and identifying speakers across different recordings, microphones, and acoustic environments.

Face Matching

Face recognition runs a two-model pipeline from the InsightFace buffalo_l suite. The detection model (det_10g.onnx) finds faces in frames and returns bounding boxes. The recognition model (w600k_r50.onnx) takes each cropped face and produces a 512-dimensional embedding vector.

The processing flow splits between cloud and desktop deliberately:

The cloud-side face worker runs detection only, generating bounding boxes and thumbnails
Thumbnails appear in the desktop app, where you assign names and give consent
The desktop downloads relevant frames via signed URLs
The desktop runs InsightFace recognition locally: crop the bounding box, generate the 512-d embedding
The embedding is stored encrypted in a local SQLite database

The split matters for privacy. Detection (finding where faces are) happens in the cloud. Recognition (turning a face into a unique vector that could identify someone) happens exclusively on your machine.

Similarity Search

Matching uses brute-force cosine similarity across the embedding index. We chose brute-force over approximate nearest neighbors (ANN) to avoid SIMD compilation issues across Windows, macOS, and Linux. For the library sizes most editors work with (hundreds to low thousands of identified people), cosine distance on 512-dimensional vectors is fast enough that ANN indexing is not worth the portability cost.

The distance metric is 1.0 - cosine_similarity, where lower means more similar. Results come back sorted by distance, and the top matches get linked to the person record in your local database.

The face index is serialized as a flat binary format: an entry count followed by packed records of key, dimension count, and float values. The entire file is encrypted at rest with AES-256-GCM.

Voice Matching

Voice identification uses ECAPA-TDNN, a speaker recognition model from SpeechBrain trained on VoxCeleb. It takes 16 kHz mono audio and outputs a 192-dimensional embedding.

The pipeline builds on top of existing transcript data (which already includes speaker diarization from cloud processing):

For each speaker segment in the transcript, the desktop extracts the longest continuous audio chunk (minimum three seconds)
Audio gets resampled to 16 kHz mono
A mel spectrogram is computed: 80 filterbank channels, 25ms Hann windows, 10ms hops, covering 0-8 kHz, using a 512-point FFT
The spectrogram feeds into the ECAPA-TDNN model, producing the 192-d voice embedding
The embedding is stored encrypted alongside face embeddings

The mel spectrogram extraction is implemented in pure Rust using rustfft, with no dependency on Python audio libraries. This keeps the desktop app self-contained.

Like faces, voice embeddings are matched with cosine similarity. When you link a voice embedding to a person who also has face embeddings, you can find them by either modality: face in frame, or voice in the audio track.

What You Can Actually Search For

Once people are set up, the search capabilities are concrete:

"Alice" resolves the name to all linked biometric IDs (face and voice), then queries the occurrence table across your library
"Alice speaking" narrows to voice occurrences with timestamp ranges
People panel on a video returns all matched people with their appearance types and confidence scores

Name resolution is case-insensitive. The video_biometric_occurrences table logs every match with the video ID, biometric ID, match type (face or voice), and metadata including confidence and timestamps. You can also merge duplicate person records if you realize two entries are the same individual, and all biometric links transfer to the merged record.

Privacy: Why the Architecture Looks This Way

Building biometric features that comply with GDPR, CCPA, and BIPA is not optional. These laws have real enforcement mechanisms and specific requirements around biometric data. Rather than treating compliance as an afterthought, we used it as a design constraint from the beginning.

Embeddings Never Leave Your Machine

Face and voice embeddings are stored in an encrypted SQLite database on your local disk. The encryption is AES-256-GCM with keys stored in your OS keychain (DPAPI on Windows, Keychain on macOS, secretservice on Linux). Each encryption operation generates a fresh 12-byte random initialization vector, included alongside the ciphertext.

Person names are local-only and never synced to our servers. The cloud side never receives embeddings, appearance logs, confidence scores, or the names you have assigned. All biometric matching data stays entirely on your machine.

Consent That Satisfies BIPA

Illinois' Biometric Information Privacy Act is one of the strictest biometric privacy laws in the US. It requires informed consent before collection, a published retention schedule, and a plan for destruction. Our consent system tracks:

Whether consent was given, and when (ISO-8601 timestamp)
The consent version (currently "1.0"), so we can handle changes to terms
Whether the retention period was acknowledged
Separate consent states for sharing face data, voice data, and person names

Every consent action gets logged to a local audit table with the user ID, action type, JSON details, and timestamp. Actions include face_consent_given, voice_consent_given, retention_purge, and others. This creates the audit trail that BIPA requires.

Before any biometric processing begins, the app presents a setup dialog with explicit checkboxes: "I understand biometric data is processed locally" and "I acknowledge the 1-year retention policy." No boxes are pre-checked.

Retention and Deletion

A retention enforcement daemon runs every 24 hours on app startup. It checks for records older than the retention period (365 days from first consent by default) and purges them automatically, logging every purge action.

The right-to-delete implementation goes further than removing database rows. When you delete your biometric data:

Cascading deletes run across all tables: people, biometric links, embeddings, appearances
The face database file gets overwritten with zeros before deletion
The face index file and voice index file are deleted
All face thumbnail images are removed
The encryption key is deleted from the OS keychain

This is not a soft delete. The data is gone, and the encryption keys that could have decrypted it are gone too. There is no "undo" and no server-side backup to worry about.

Share Controls

If you use FrameQuery's review feature to share videos that contain identified people, you get granular control over what biometric context travels with the review. The share consent state tracks three independent toggles: include face data, include voice data, include person names. All default to off. Each toggle is a separate consent action in the audit log.

What We Do Not Collect

FrameQuery does not collect video file contents, search queries, file paths, screenshots, location data, keystrokes, clipboard data, or video filenames. Analytics (via PostHog) are opt-in and limited to pseudonymized feature usage events and crash reports. The only third-party services with any data access are Clerk (authentication), PostHog (if you opt in), and Stripe (payments, PCI-DSS compliant).

Trade-offs Worth Acknowledging

Running recognition locally means the models ship with the app. The InsightFace suite is roughly 300 MB, and ECAPA-TDNN adds another 100 MB. These download on first use, not at install time, so you only pay the bandwidth cost if you enable people matching.

Brute-force cosine search will not scale to millions of embeddings. For production teams with genuinely massive libraries, we will need to revisit this with a portable ANN solution. For now, the priority is correctness and cross-platform reliability over raw throughput.

Local-only storage also means you are responsible for backups. If your machine dies and you did not back up the encrypted biometric database, those embeddings are gone. We are considering encrypted backup options, but nothing ships today.

Join the waitlist to try people matching when we open the next batch of invites.