CUDA, Metal, and CPU: Cross-Platform GPU Acceleration in a Desktop Video App

FrameQuery runs on Windows, macOS, and Linux. On each platform, it needs to decode professional video formats as fast as possible. That means GPU acceleration. But every platform has a different GPU API: NVIDIA CUDA on Windows and Linux, Apple Metal on macOS. And some machines have no suitable GPU at all.

We handle all three cases from a single Rust codebase.

The problem

Professional cinema RAW formats (RED R3D, Blackmagic BRAW) store raw sensor data that must be debayered before it becomes a viewable image. Debayering is computationally expensive. At full 8K resolution, a single frame can take hundreds of milliseconds on CPU. Multiply that by thousands of frames and proxy generation becomes painfully slow.

GPUs are perfect for this. Debayering is a per-pixel operation that parallelises beautifully. A CUDA or Metal kernel can process a full 8K frame in a fraction of the time.

But supporting multiple GPU backends in a single application requires careful architecture.

Compile-time platform gating

Rust's cfg attributes let us include platform-specific code at compile time. Our CUDA code only compiles on Windows and Linux. Our Metal code only compiles on macOS.

#[cfg(any(target_os = "windows", target_os = "linux"))]
extern "C" {
    fn cudaMallocHost(ptr: *mut *mut c_void, size: usize) -> i32;
    fn cudaFreeHost(ptr: *mut c_void) -> i32;
    fn cudaMemcpy(dst: *mut c_void, src: *const c_void, count: usize, kind: i32) -> i32;
}

#[cfg(target_os = "macos")]
// Metal shared buffer APIs via Objective-C++ bridge

The build system (build.rs) detects whether CUDA is available by checking environment variables and known install paths. If CUDA is not found, the CUDA code paths compile out entirely. The application still works, just without GPU acceleration.

Runtime pipeline selection

Even within a platform, GPU availability is not guaranteed. A Windows machine might not have an NVIDIA GPU. A Mac might have integrated graphics that does not support the Metal features we need.

So at startup, we probe the GPU before committing to a pipeline.

For BRAW, the SDK has built-in pipeline negotiation. We ask "is CUDA supported?" and "is Metal supported?" before setting the pipeline. If both fail, we fall back to CPU.

For R3D, we pass optional component flags during SDK initialization. The SDK tries to load the GPU components and silently falls back if they are not available. We log which pipeline was selected so debugging is straightforward.

Double-buffered decode pipeline (CUDA)

On Windows and Linux with NVIDIA GPUs, we use a double-buffered pipeline to overlap CPU and GPU work:

Allocate two sets of buffers: pinned host memory (for CPU-side decode output) and device memory (for GPU-side debayering)
Decode frame N into buffer slot A using the CPU
Copy slot A to GPU memory (async DMA transfer using pinned memory)
Launch GPU debayering kernel on slot A
While the GPU processes slot A, decode frame N+1 into buffer slot B on the CPU
When the GPU finishes slot A, copy the result back to host memory and deliver it
Repeat, alternating slots

The key insight is pinned memory. Normal heap allocations can be paged out by the OS. CUDA pinned memory (cudaMallocHost) is page-locked, which means DMA transfers between CPU and GPU happen at full PCIe bandwidth without the OS moving pages around. We wrote RAII wrappers around these allocations so they get freed properly even if an error occurs mid-pipeline.

Shared memory model (Metal)

macOS takes a different approach. Metal supports shared memory buffers that are accessible to both CPU and GPU without explicit copies. The R3D SDK writes decoded data directly into a Metal-accessible buffer, and the GPU reads from the same physical memory.

This eliminates the copy step entirely. The CPU decodes into a shared buffer, the GPU debayers from that same buffer into an output buffer (also shared), and the CPU reads the result directly. The pipeline is simpler than CUDA because there is no explicit memory transfer stage.

The C++ wrapper uses Objective-C++ (the -x objective-c++ compiler flag) to bridge between our C API and Metal's Objective-C interfaces.

Graceful CPU fallback

When no GPU is available, or when GPU decode is not appropriate (like generating a single thumbnail), we fall back to CPU decode. Both SDKs support this natively. The output is the same: decoded RGB pixel data ready for further processing.

CPU decode is slower but completely reliable. For thumbnail generation in BRAW, we actually prefer the CPU pipeline because it handles single-frame decode more reliably than the GPU path, which is optimised for streaming many frames in sequence.

Performance telemetry

Both decoders include per-operation timing statistics to track where time goes:

BRAW decode stats: decoded 150 frames
  pool wait avg 0.15ms, decode avg 8.32ms, send wait avg 0.08ms

This tells us immediately whether the bottleneck is buffer allocation (pool wait), the SDK itself (decode), or the downstream consumer (send wait). During development, these logs helped us find a case where the frame buffer pool was too small and the decoder was stalling waiting for a free buffer.

The build system glue

All of this requires careful build system integration. Our build.rs script:

Detects CUDA by checking CUDA_PATH, CUDA_HOME, and known installation directories
Runs bindgen to generate Rust FFI bindings from C/C++ headers
Compiles C++ wrapper code with platform-appropriate flags
Links against platform-specific libraries (cudart on Windows/Linux, Metal.framework on macOS, COM libraries on Windows)
On macOS, passes -x objective-c++ and -std=c++11 to the C++ compiler

CI runs Rust tests on both macOS and Windows with CUDA_ENABLED=0 to verify that the CPU fallback path works correctly without a GPU present.

The result

A single cargo build produces a binary that automatically uses the best available GPU backend. NVIDIA users get CUDA. Mac users get Metal. Everyone else gets CPU decode that still works. No runtime configuration needed, no user-facing GPU settings to manage.

This is one of the advantages of building a desktop app in Rust rather than wrapping everything in a web service. You can reach as deep into the hardware as you need while keeping a single, maintainable codebase.

Join the waitlist to try FrameQuery on your own hardware.