Zero-Copy GPU Decoding on Every Platform

A single 4K RGB16 frame is roughly 50 megabytes. When you are decoding thousands of frames from professional cinema footage, the cost of copying that data between CPU and GPU memory adds up fast. On a typical PCIe bus, each unnecessary copy burns time that could be spent decoding the next frame.

We built FrameQuery's decode pipeline to minimize (and on Apple Silicon, eliminate) those copies. Each platform gets a different strategy because each platform has a different memory architecture.

The memory problem

The naive approach to GPU-accelerated decoding looks like this: allocate a regular CPU buffer, decode a frame into it, allocate a GPU buffer, copy the data over, run the GPU kernel, copy the result back. That is two full-frame copies per frame, each hitting the PCIe bus.

For a 4K RGB16 frame at 50MB, those two copies alone take roughly 1.5ms on PCIe 4.0. At 24fps that is 36ms per second spent just moving memory around. At 8K the numbers quadruple. The decode pipeline stalls waiting for transfers instead of doing useful work.

The fix is different on every platform. CUDA has pinned memory. Metal has shared memory. Vulkan has compute shaders that can keep everything on-device. We use all three.

CUDA pinned buffers on Windows

On machines with NVIDIA GPUs, we use CUDA's page-locked (pinned) memory to accelerate host-to-device transfers. Normal malloc memory can be paged out by the OS at any time, which forces the CUDA driver to stage it through an internal pinned buffer before transferring to the GPU. Pinning the memory ourselves skips that intermediate copy.

We wrote a PinnedBuffer<T> RAII wrapper around cudaMallocHost that cleans up with cudaFreeHost on drop:

struct PinnedBuffer<T> {
    ptr: *mut T,
    capacity: usize,
    len: usize,
}

impl<T> PinnedBuffer<T> {
    fn new(capacity: usize) -> Result<Self, String> {
        let size_bytes = capacity * std::mem::size_of::<T>();
        let mut ptr: *mut std::ffi::c_void = ptr::null_mut();
        unsafe {
            let err = cudaMallocHost(&mut ptr, size_bytes);
            if err != 0 {
                return Err(format!("cudaMallocHost failed: {}", err));
            }
        }
        Ok(Self { ptr: ptr as *mut T, capacity, len: capacity })
    }
}
unsafe impl<T> Send for PinnedBuffer<T> {}
unsafe impl<T> Sync for PinnedBuffer<T> {}

We mark PinnedBuffer as Send + Sync because CUDA pinned memory is accessible from any CPU thread. This lets us share buffers across the decode pipeline without locking.

The pipeline uses two buffer slots (double buffering). While one buffer is being filled by the CPU decoder, the other is being transferred to the GPU for indexing. The stages look like this:

CPU decodes raw sensor data into pinned buffer A
Async H2D copy transfers buffer A to GPU memory
GPU processing kernel runs on the transferred data
D2H copy pulls the result back to pinned buffer B

There is a subtle optimisation in step 4. A cudaMemcpy with cudaMemcpyDeviceToHost implicitly synchronises with the GPU. We do not need an explicit cudaDeviceSynchronize during normal operation. The only place we call cudaDeviceSynchronize is at cleanup, to make sure everything has finished before freeing resources. This keeps the pipeline moving without unnecessary sync points.

We also use resolution-adaptive decoding. For on-device frame extraction and search indexing, there is no reason to decode at full resolution. The vendor SDKs expose multiple decode resolutions (full, half, quarter, eighth, sixteenth). Our pipeline picks the smallest resolution that still provides enough pixels for the final output, then lets FFmpeg scale to the exact target size. Decoding at a quarter of the native resolution is dramatically faster and reduces bandwidth through every downstream stage.

Metal zero-copy on Apple Silicon

Apple Silicon has a unified memory architecture. The CPU and GPU share the same physical memory. Metal's storageModeShared lets us allocate a buffer that both processors can access directly, with no copy at all.

We wrap these shared buffers in an RAII struct:

struct MetalSharedBuffer {
    buffer: *mut std::ffi::c_void,  // id<MTLBuffer> retained
    contents: *mut u8,              // .contents pointer (CPU-accessible)
    size: usize,
}

The buffer field holds a retained MTLBuffer for passing to GPU kernels. The contents field is the CPU-accessible pointer from .contents. Because the memory is shared, these point to the same physical pages. The CPU writes raw sensor data directly into the buffer, the GPU reads it for processing, the GPU writes output, and the CPU reads it back. Zero copies. Zero PCIe transfers. The data never moves.

This is a genuine architectural advantage of Apple Silicon for video workloads. On discrete GPU systems, even with pinned memory, you are still limited by PCIe bandwidth. On unified memory, the bottleneck shifts entirely to compute throughput.

The same RAII pattern applies here. When the buffer wrapper is dropped, the underlying MTLBuffer is released via the vendor SDK's deallocation function, which handles the Objective-C runtime calls behind a C interface that Rust can link against.

GPU initialisation and fallback

At startup, FrameQuery probes for available GPU backends using compile-time platform detection:

#[cfg(not(target_os = "macos"))]
let gpu_backend = GpuBackend::Cuda;

#[cfg(target_os = "macos")]
let gpu_backend = GpuBackend::Metal;

The vendor SDKs accept a flag indicating which GPU backend to initialise. If the GPU is not available (no NVIDIA GPU on a Windows machine, or an older Mac without Metal support), the decoder falls back to CPU silently. The caller does not need to handle this. The same API produces the same frames regardless of the decode path, just at different speeds.

ProRes RAW via Vulkan compute

ProRes RAW is a different problem. Apple designed it, and the only first-party decode path is through AVFoundation on macOS. That is fine on a Mac, but it leaves Windows without a native option.

We built a cross-platform ProRes RAW pipeline using FFmpeg 8.1 and Vulkan compute shaders. Vulkan runs on every major platform, so this gives us a single code path that works everywhere.

The pipeline uses a three-stage cascade with automatic fallback:

Full GPU: -hwaccel vulkan with scale_vulkan and h264_vulkan. Everything stays on the GPU. Zero CPU transfer.
Hybrid: -hwaccel vulkan with hwdownload and libx264. The GPU handles decode, the CPU handles encode. This kicks in when the full GPU path hits a driver limitation.
Software: CPU decode with libx264. The last resort when Vulkan is not available at all.

The Vulkan compute shaders handle Bayer demosaicing directly. This matters because FFmpeg does not have a software demosaic path for ProRes RAW. Without Vulkan, you either use AVFoundation on macOS or you cannot decode the format at all.

ProRes RAW also requires HDR tone mapping to produce correct colors. The raw Bayer output has a wider dynamic range than SDR displays can show, and the RGGB Bayer pattern has a 2:1 ratio of green pixels to red and blue. Without proper tone mapping, the output has a visible green tint. Our tone mapping chain handles this:

zscale=t=linear:npl=100,format=gbrpf32le,zscale=p=bt709,tonemap=hable:desat=0,zscale=t=bt709:m=bt709:r=tv

This converts from linear light to BT.709, applies Hable tone mapping with no desaturation, and outputs in TV range. The intermediate gbrpf32le format gives us 32-bit float precision for the tone mapping maths, which prevents banding in gradients.

FFmpeg 8.1 was a meaningful upgrade here. Earlier versions had format negotiation issues between hwdownload and Vulkan surfaces that caused the hybrid pipeline to fail on certain drivers. The fix landed upstream and saved us from maintaining a patched FFmpeg fork.

The same patterns, different SDKs

The BRAW decoder follows the same architecture. It uses the same PinnedBuffer type for CUDA pinned memory on Windows, and the same Metal shared buffers on macOS.

The BRAW pipeline adds frame-level parallelism through mpsc channels with a capacity of 4. Three pooled buffers rotate through the pipeline: one is being decoded, one is sitting in the channel waiting to be consumed, and one is being written to the output encoder. This keeps all three stages (decode, transfer, encode) busy simultaneously without allocating new buffers per frame.

Trade-offs

Pinned memory is not free. cudaMallocHost allocates page-locked memory that the OS cannot swap out. If you pin too much, you reduce the memory available to everything else on the system. We keep the pinned allocation to exactly what the double-buffer pipeline needs: two frame-sized buffers per active decoder.

The Vulkan ProRes RAW path adds FFmpeg as a dependency, which increases the binary size and introduces version sensitivity. We pin to FFmpeg 8.1 and ship it bundled rather than relying on system installs.

Metal shared memory on Apple Silicon is genuinely zero-copy, but it means the CPU and GPU contend for the same memory bus bandwidth. For our workload this is still a net win because the transfer cost we eliminate is larger than the contention we introduce. For workloads that are purely GPU-bound with minimal CPU interaction, dedicated GPU memory would be faster.

What you get

As a user, you do not see any of this. You drop R3D, BRAW, or ProRes RAW files into FrameQuery and they decode as fast as your hardware allows. The app picks the fastest path automatically. No settings to configure, no GPU toolkit to install.

The result is that on-device frame extraction and search indexing on professional cinema footage runs at a speed that makes it practical for real libraries. Thousands of clips from a multi-day shoot can be indexed without dedicating an entire weekend to the task.

Download FrameQuery to try FrameQuery with your own camera originals.