Technology

How We Built People Matching: Face and Voice Recognition With Privacy That Actually Holds Up

FrameQuery can match faces and voices across your entire video library. We built the encryption architecture for GDPR, CCPA, and BIPA compliance from day one.

FrameQuery Team23 February 20265 min read

FrameQuery can now match people across your entire video library by face and voice. Assign a name to someone once, and every video they appear in becomes searchable by who is in it, not just what is in it.

Building this required integrating face and voice recognition models, designing an encrypted local biometric database, and making the whole thing compliant with GDPR, CCPA, and BIPA from the start. The privacy constraints shaped nearly every technical decision, so this post covers both how the matching works and why the architecture looks the way it does.

The Problem

Video search by transcript and visual content gets you most of the way there. But queries like "CEO speaking" or "Alice B-roll" require knowing who is in the footage, not just what objects or scenes are present. Manual tagging does not scale. We needed automatic matching that works across a library of thousands of videos.

This breaks down into two distinct problems: recognising faces across different videos, lighting conditions, and angles, and identifying speakers across different recordings, microphones, and acoustic environments.

Face Matching

Face recognition runs a two-stage process: detection finds faces in frames and returns bounding boxes, then a recognition model takes each cropped face and produces a mathematical representation (an embedding) that uniquely characterises that face.

The processing flow splits between cloud and desktop deliberately:

  1. Cloud-side processing runs detection only, finding where faces are in frames and generating thumbnails
  2. Thumbnails appear in the desktop app, where you assign names and give consent
  3. The desktop downloads the relevant frame regions
  4. Recognition runs locally on your machine, generating the face embeddings
  5. Embeddings are stored encrypted on your local disk

The split matters for privacy. Detection (finding where faces are) happens in the cloud. Recognition (turning a face into a unique mathematical representation that could identify someone) happens exclusively on your machine.

How matching works

When you identify a person, FrameQuery compares their face embedding against all other detected faces in your library using similarity scoring. Results come back ranked by confidence, and the top matches get linked to the person record in your local database.

The entire face index is encrypted at rest using AES-256-GCM, with encryption keys stored in your OS keychain.

Voice Matching

Voice identification uses a speaker recognition model that takes audio and outputs a compact mathematical representation of the speaker's voice. This works alongside the existing transcript data, which already includes speaker diarization from cloud processing.

The pipeline:

  1. For each speaker segment in the transcript, the desktop extracts the longest continuous audio chunk (minimum three seconds)
  2. Audio is preprocessed into a format suitable for the voice recognition model
  3. The model produces a voice embedding
  4. The embedding is stored encrypted alongside face embeddings

All voice processing runs locally on your machine, using a pure Rust audio processing pipeline with no external dependencies.

When you link a voice embedding to a person who also has face embeddings, you can find them by either modality: face in frame, or voice in the audio track.

What You Can Actually Search For

Once people are set up, the search capabilities are concrete:

  • "Alice" resolves the name to all linked biometric data (face and voice), then queries across your library
  • "Alice speaking" narrows to voice matches with timestamp ranges
  • People panel on a video shows all matched people with their appearance types and confidence scores

You can also merge duplicate person records if you realise two entries are the same individual, and all biometric links transfer to the merged record.

Privacy: Why the Architecture Looks This Way

Building biometric features that comply with GDPR, CCPA, and BIPA is not optional. These laws have real enforcement mechanisms and specific requirements around biometric data. Rather than treating compliance as an afterthought, we used it as a design constraint from the beginning.

Embeddings Never Leave Your Machine

Face and voice embeddings are stored in an encrypted database on your local disk. The encryption uses AES-256-GCM with keys stored in your OS keychain (DPAPI on Windows, Keychain on macOS, secretservice on Linux). Each encryption operation generates a fresh random initialisation vector.

Person names are local-only and never synced to our servers. The cloud side never receives embeddings, appearance logs, confidence scores, or the names you have assigned.

Consent That Satisfies BIPA

Illinois' Biometric Information Privacy Act is one of the strictest biometric privacy laws in the US. It requires informed consent before collection, a published retention schedule, and a plan for destruction. Our consent system tracks:

  • Whether consent was given, and when (ISO-8601 timestamp)
  • The consent version, so we can handle changes to terms
  • Whether the retention period was acknowledged

Every consent action gets logged to a local audit table. This creates the audit trail that BIPA requires.

Before any biometric processing begins, the app presents a setup dialogue with explicit checkboxes. No boxes are pre-checked.

Retention and Deletion

A retention enforcement process runs regularly and checks for records older than the retention period (365 days from first consent by default), purging them automatically and logging every action.

The right-to-delete implementation goes beyond removing database rows. When you delete your biometric data:

  1. Cascading deletes run across all related tables
  2. Biometric data files are securely overwritten before deletion
  3. Index files are removed
  4. Thumbnail images are deleted
  5. Encryption keys are deleted from the OS keychain

This is not a soft delete. The data is gone, and the encryption keys that could have decrypted it are gone too.

Shared Indexes and Biometric Data

Biometric data is never included in shared or exported indexes. When you publish or share an index, all face embeddings, voice embeddings, and person labels are automatically stripped. Only non-biometric metadata (descriptions, transcripts, tags, and timestamps) travels with the shared index.

If a recipient has access to the underlying video files, they can run their own face and voice recognition locally on their own machine and assign their own labels. That processing is entirely independent of your biometric data - it generates fresh embeddings on their device, governed by their own consent.

What We Do Not Collect

FrameQuery does not collect video file contents, search queries, file paths, screenshots, location data, keystrokes, clipboard data, or video filenames. Analytics are opt-in and limited to pseudonymised feature usage events and crash reports.

Trade-offs Worth Acknowledging

Running recognition locally means the models ship with the app. They download on first use, not at install time, so you only pay the bandwidth cost if you enable people matching.

Local-only storage also means you are responsible for backups. If your machine dies and you did not back up the encrypted biometric database, those embeddings are gone. We include encrypted backup and restore to help with this.

Join the waitlist to try people matching when we open the next batch of invites.