CUDA, Metal, and CPU: Cross-Platform GPU Acceleration in a Desktop Video App
How we built a single Rust codebase that uses NVIDIA CUDA on Windows and Linux, Apple Metal on macOS, and gracefully falls back to CPU when no GPU is available.
FrameQuery runs on Windows, macOS, and Linux. On each platform, it needs to decode professional video formats as fast as possible. That means GPU acceleration. But every platform has a different GPU API: NVIDIA CUDA on Windows and Linux, Apple Metal on macOS. And some machines have no suitable GPU at all.
We handle all three cases from a single Rust codebase.
The problem
Professional cinema RAW formats (RED R3D, Blackmagic BRAW) store raw sensor data that must be debayered before it becomes a viewable image. Debayering is computationally expensive. At full 8K resolution, a single frame can take hundreds of milliseconds on CPU. Multiply that by thousands of frames and proxy generation becomes painfully slow.
GPUs are perfect for this. Debayering is a per-pixel operation that parallelises beautifully. A CUDA or Metal kernel can process a full 8K frame in a fraction of the time.
But supporting multiple GPU backends in a single application requires careful architecture.
Compile-time platform gating
Rust's cfg attributes let us include platform-specific code at compile time. Our CUDA code only compiles on Windows and Linux. Our Metal code only compiles on macOS.
#[cfg(any(target_os = "windows", target_os = "linux"))]
extern "C" {
fn cudaMallocHost(ptr: *mut *mut c_void, size: usize) -> i32;
fn cudaFreeHost(ptr: *mut c_void) -> i32;
fn cudaMemcpy(dst: *mut c_void, src: *const c_void, count: usize, kind: i32) -> i32;
}
#[cfg(target_os = "macos")]
// Metal shared buffer APIs via Objective-C++ bridge
The build system (build.rs) detects whether CUDA is available by checking environment variables and known install paths. If CUDA is not found, the CUDA code paths compile out entirely. The application still works, just without GPU acceleration.
Runtime pipeline selection
Even within a platform, GPU availability is not guaranteed. A Windows machine might not have an NVIDIA GPU. A Mac might have integrated graphics that does not support the Metal features we need.
So at startup, we probe the GPU before committing to a pipeline.
For BRAW, the SDK has built-in pipeline negotiation. We ask "is CUDA supported?" and "is Metal supported?" before setting the pipeline. If both fail, we fall back to CPU.
For R3D, we pass optional component flags during SDK initialization. The SDK tries to load the GPU components and silently falls back if they are not available. We log which pipeline was selected so debugging is straightforward.
Double-buffered decode pipeline (CUDA)
On Windows and Linux with NVIDIA GPUs, we use a double-buffered pipeline to overlap CPU and GPU work:
- Allocate two sets of buffers: pinned host memory (for CPU-side decode output) and device memory (for GPU-side debayering)
- Decode frame N into buffer slot A using the CPU
- Copy slot A to GPU memory (async DMA transfer using pinned memory)
- Launch GPU debayering kernel on slot A
- While the GPU processes slot A, decode frame N+1 into buffer slot B on the CPU
- When the GPU finishes slot A, copy the result back to host memory and deliver it
- Repeat, alternating slots
The key insight is pinned memory. Normal heap allocations can be paged out by the OS. CUDA pinned memory (cudaMallocHost) is page-locked, which means DMA transfers between CPU and GPU happen at full PCIe bandwidth without the OS moving pages around. We wrote RAII wrappers around these allocations so they get freed properly even if an error occurs mid-pipeline.
Shared memory model (Metal)
macOS takes a different approach. Metal supports shared memory buffers that are accessible to both CPU and GPU without explicit copies. The R3D SDK writes decoded data directly into a Metal-accessible buffer, and the GPU reads from the same physical memory.
This eliminates the copy step entirely. The CPU decodes into a shared buffer, the GPU debayers from that same buffer into an output buffer (also shared), and the CPU reads the result directly. The pipeline is simpler than CUDA because there is no explicit memory transfer stage.
The C++ wrapper uses Objective-C++ (the -x objective-c++ compiler flag) to bridge between our C API and Metal's Objective-C interfaces.
Graceful CPU fallback
When no GPU is available, or when GPU decode is not appropriate (like generating a single thumbnail), we fall back to CPU decode. Both SDKs support this natively. The output is the same: decoded RGB pixel data ready for further processing.
CPU decode is slower but completely reliable. For thumbnail generation in BRAW, we actually prefer the CPU pipeline because it handles single-frame decode more reliably than the GPU path, which is optimised for streaming many frames in sequence.
Performance telemetry
Both decoders include per-operation timing statistics to track where time goes:
BRAW decode stats: decoded 150 frames
pool wait avg 0.15ms, decode avg 8.32ms, send wait avg 0.08ms
This tells us immediately whether the bottleneck is buffer allocation (pool wait), the SDK itself (decode), or the downstream consumer (send wait). During development, these logs helped us find a case where the frame buffer pool was too small and the decoder was stalling waiting for a free buffer.
The build system glue
All of this requires careful build system integration. Our build.rs script:
- Detects CUDA by checking
CUDA_PATH,CUDA_HOME, and known installation directories - Runs
bindgento generate Rust FFI bindings from C/C++ headers - Compiles C++ wrapper code with platform-appropriate flags
- Links against platform-specific libraries (cudart on Windows/Linux, Metal.framework on macOS, COM libraries on Windows)
- On macOS, passes
-x objective-c++and-std=c++11to the C++ compiler
CI runs Rust tests on both macOS and Windows with CUDA_ENABLED=0 to verify that the CPU fallback path works correctly without a GPU present.
The result
A single cargo build produces a binary that automatically uses the best available GPU backend. NVIDIA users get CUDA. Mac users get Metal. Everyone else gets CPU decode that still works. No runtime configuration needed, no user-facing GPU settings to manage.
This is one of the advantages of building a desktop app in Rust rather than wrapping everything in a web service. You can reach as deep into the hardware as you need while keeping a single, maintainable codebase.
Join the waitlist to try FrameQuery on your own hardware.