5 Minutes, 2 Lines of Rust, Measurably Less Memory: Our mimalloc Migration

Some of the best performance wins come from changes you can make in an afternoon. Swapping the default memory allocator in FrameQuery's Rust backend to mimalloc was one of those changes. Two lines of code, five minutes of work, and a measurable drop in resident memory during heavy video processing workloads.

Why the default allocator was not enough

FrameQuery's desktop app is a Rust/Tauri application that decodes video frames, runs image processing pipelines, and manages search indexes. During batch processing, the app allocates and frees millions of small buffers: pixel data, temporary image buffers, intermediate format conversions. The default system allocator on each platform (glibc on Linux, the CRT allocator on Windows) is general-purpose. It is not optimised for this kind of allocation pattern.

The symptoms were predictable. Memory usage climbed over time. RSS stayed high even after batch jobs completed. On Windows, where FrameQuery runs on MSVC, the situation was particularly noticeable. mimalloc's own benchmarks report up to 5.3x throughput improvement over glibc in multithreaded scenarios and significant RSS reduction. Our workload is allocation-heavy enough that we saw clear improvements, though the exact numbers depend on the specific mix of operations.

mimalloc in two lines

mimalloc is Microsoft's general-purpose allocator, designed for performance in multithreaded workloads. It uses thread-local heaps to reduce contention and has excellent small-allocation performance. The rust-analyzer project migrated to it in 2024 for similar reasons.

Swapping the global allocator in Rust is trivial:

use mimalloc::MiMalloc;

#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;

That is the entire change for the production path. Add mimalloc = { version = "0.1", default-features = false } to your Cargo.toml and you are done.

Why not jemalloc

The more common recommendation in the Rust ecosystem is jemalloc. We tried it first. jemalloc does not compile on Windows MSVC. Since FrameQuery is a desktop application that ships on Windows, macOS, and Linux, an allocator that does not build on our primary platform is not an option.

mimalloc compiles everywhere we need it. Version 3, released in January 2026, added Windows TLS optimisations that further reduce overhead on the platform where we needed it most.

Conditional compilation for profiling

We do not always want mimalloc. When profiling allocations, we need dhat, a counting allocator that records every allocation and produces a JSON report for analysis. Rust's cfg attributes make the swap clean:

#[cfg(not(feature = "profiling"))]
use mimalloc::MiMalloc;

#[cfg(not(feature = "profiling"))]
#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;

#[cfg(feature = "profiling")]
#[global_allocator]
static GLOBAL: dhat::Alloc = dhat::Alloc;

fn main() {
    #[cfg(feature = "profiling")]
    let _profiler = dhat::Profiler::new_heap();
    app_lib::run();
}

In production builds, mimalloc is the allocator. When we build with cargo build --release --features profiling, dhat takes over and writes dhat-heap.json on exit. That file shows exactly where allocations happen and how long they live. The profiling workflow is: dhat first to find allocation hotspots, Tracy for decode latency, then cargo-flamegraph for CPU profiling.

The profiling Cargo profile inherits from release but keeps what dhat needs:

[profile.profiling]
inherits = "release"
debug = 1
strip = false
panic = "unwind"

dhat requires debug symbols to produce useful backtraces, and it needs unwinding enabled so it can write its report when the profiler drops on exit.

Release profile tuning

The allocator swap was the biggest single win, but the release profile contributes too. Here is what we ship with:

[profile.release]
lto = "thin"
codegen-units = 1
panic = "abort"
strip = true
opt-level = 3

Each setting has a reason. lto = "thin" enables thin link-time optimisation, which gives roughly 90% of full LTO's performance benefit at a fraction of the compile time. codegen-units = 1 forces the compiler to process the entire crate as a single unit, enabling cross-module inlining and optimisation that would otherwise be impossible. panic = "abort" removes unwinding tables and landing pads, shrinking the binary and removing a small runtime cost. strip = true removes debug symbols from the final binary. opt-level = 3 enables maximum optimisation.

The combined effect is a smaller, faster binary. The trade-off is longer compile times, but that only matters for release builds.

Targeting x86-64-v3 for AVX2

FrameQuery does a lot of pixel format conversion. Converting 16-bit RGB to 8-bit, swapping BGRA to RGBA, resizing frames for face detection. These are tight loops over large arrays of pixel data. LLVM can auto-vectorise them, but only if you tell it what instructions are available.

In .cargo/config.toml:

[target.x86_64-pc-windows-msvc]
rustflags = ["-C", "target-cpu=x86-64-v3"]

x86-64-v3 unlocks AVX, AVX2, BMI1/2, FMA, F16C, LZCNT, MOVBE, and XSAVE. That covers every Intel CPU from Haswell (2013) forward and every AMD CPU from Zen (2017) forward. We are comfortable dropping older hardware for a desktop video application.

The payoff is that simple Rust code auto-vectorises to SIMD instructions without intrinsics:

pub fn rgb16_to_rgb8(src: &[u16]) -> Vec<u8> {
    src.iter().map(|&v| (v >> 8) as u8).collect()
}

With x86-64-v3 targeting, LLVM compiles this to vpsrlw (shift) and vpackuswb (pack) instructions, processing 16 values per instruction using 256-bit YMM registers. Without the flag, it generates scalar code that processes one element at a time.

Similarly for BGRA-to-RGBA byte swapping:

pub fn bgra_rgba_swap(data: &mut [u8]) {
    for pixel in data.chunks_exact_mut(4) {
        pixel.swap(0, 2);
    }
}

This auto-vectorises to byte shuffle instructions. No hand-written SIMD, no unsafe blocks, no architecture-specific code paths. Just idiomatic Rust that happens to compile to fast vector instructions because we told the compiler what the CPU supports.

Arena allocation for batch processing

The allocator swap reduced general allocation overhead, but batch frame processing has a specific pattern: allocate many temporary buffers for one frame, then throw them all away before the next frame. Per-object deallocation is wasted work when you know everything dies together.

We use bumpalo for arena allocation in these hot paths:

pub fn rgb16_to_rgb8_arena<'a>(arena: &'a bumpalo::Bump, src: &[u16]) -> &'a [u8] {
    let dst = arena.alloc_slice_fill_default(src.len());
    for (d, &s) in dst.iter_mut().zip(src.iter()) {
        *d = (s >> 8) as u8;
    }
    dst
}

Each batch gets a fresh arena. All pixel format conversions during that batch allocate from the arena. When the batch finishes, the arena drops and frees everything in one shot. Zero per-frame deallocation overhead, zero fragmentation, and the allocator barely notices.

fast_image_resize

One more win worth mentioning. FrameQuery resizes frames for face detection (640x640) and thumbnail generation. The image crate's built-in resize is correct but slow. We switched to fast_image_resize, which is roughly 14x faster for the same Lanczos3 filter quality.

fast_image_resize does runtime dispatch to AVX2 on x86-64, NEON on ARM, or scalar as a fallback. Combined with our x86-64-v3 targeting, the AVX2 path activates automatically on every machine we support.

The total picture

None of these changes required rearchitecting anything. The allocator swap was two lines. The release profile is a few lines of TOML. The AVX2 targeting is one line of config. Arena allocation was a targeted change to batch processing code. Each change is small. Together, they cut memory usage roughly in half and measurably improved throughput across the video processing pipeline.

The lesson is worth repeating: before reaching for complex architectural changes, check whether your allocator, your compiler flags, and your build profile are working for you. In Rust, the defaults are good. With a few lines of configuration, they can be much better.

We are building a video search app where performance like this matters on every frame. Download FrameQuery to try FrameQuery.