Technology

From 500 Threads to Sanity: Taming Concurrency in a Rust Video Pipeline

Tokio's spawn_blocking pool can grow to 500 threads. For CPU-bound video decoding, that is a problem. Here is how we replaced it with bounded concurrency, structured cancellation, and lock-free primitives.

FrameQuery Team26 March 20267 min read

FrameQuery decodes professional video formats (R3D, BRAW, ProRes RAW) on the desktop. These are CPU-heavy operations that run alongside an async Tokio runtime handling UI events, database queries, and network I/O. Getting concurrency right in this setup took several iterations. Here is what we learned.

The spawn_blocking trap

Tokio's spawn_blocking is the standard way to run blocking work without stalling the async runtime. You hand it a closure, and Tokio runs it on a dedicated thread pool separate from the async worker threads. For blocking I/O like file reads or synchronous database calls, this works well.

The problem is that Tokio's blocking pool can grow to roughly 500 threads. Each call to spawn_blocking that cannot be served by an idle thread spawns a new one. For I/O-bound work where threads spend most of their time waiting, this is fine. For CPU-bound video decoding, it is a disaster.

When we first wired up our R3D and BRAW decoders through spawn_blocking, the app would happily spin up dozens of decode threads during a large library scan. Each thread was doing real CPU work: debayering raw sensor data, colour space conversions, pixel format transforms. The result was memory thrashing from per-thread decode buffers, CPU contention from context switching between too many compute-bound threads, and unpredictable latency in the async runtime as the system struggled under load.

The fix: a bounded Rayon decode pool

We replaced spawn_blocking with a dedicated Rayon thread pool sized to the machine's core count. The entire module is 48 lines.

static DECODE_POOL: OnceLock<rayon::ThreadPool> = OnceLock::new();

fn decode_pool() -> &'static rayon::ThreadPool {
    DECODE_POOL.get_or_init(|| {
        let threads = std::thread::available_parallelism()
            .map(|n| n.get())
            .unwrap_or(4)
            .max(2);
        rayon::ThreadPoolBuilder::new()
            .num_threads(threads)
            .thread_name(|i| format!("fq-decode-{i}"))
            .build()
            .expect("failed to create decode thread pool")
    })
}

The pool is lazily initialised via OnceLock and lives for the lifetime of the process. Thread count matches available_parallelism (which respects cgroup limits and processor affinity), with a floor of 2. Named threads (fq-decode-0, fq-decode-1, ...) make profiling and debugging straightforward. When you see a thread in a flamegraph or a log, you know exactly what it is doing.

The key design choice is that this pool has a fixed size. On an 8-core machine, you get 8 decode threads. If 20 decode tasks are submitted, 8 run concurrently and 12 wait in the queue. No runaway thread creation. Predictable memory usage. The CPU stays saturated without being oversubscribed.

Bridging Rayon back to Tokio

The decode pool lives outside Tokio's world. We need a way for async code to submit work and await the result. A oneshot channel handles this cleanly.

pub fn spawn_decode<F, T>(f: F) -> tokio::task::JoinHandle<T>
where
    F: FnOnce() -> T + Send + 'static,
    T: Send + 'static,
{
    let (tx, rx) = tokio::sync::oneshot::channel();
    decode_pool().spawn(move || {
        let _ = tx.send(f());
    });
    tokio::task::spawn(async move {
        rx.await.expect("decode pool task panicked or was cancelled")
    })
}

The function returns a JoinHandle<T>, making it a drop-in replacement for spawn_blocking. Callers do not need to know that work is running on Rayon instead of Tokio's blocking pool. The Rayon thread executes the closure and sends the result over the oneshot channel. A lightweight Tokio task awaits the result and surfaces it to the caller.

We use this in r3d_decoder.rs, braw_decoder.rs, prores_raw_decoder.rs, and prores_raw_vulkan.rs. Swapping in spawn_decode was a one-line change in each file.

There is a trade-off worth noting. If the Rayon pool is saturated, new decode tasks queue up rather than getting a fresh thread. This is intentional. For CPU-bound work, queuing is better than oversubscription. But it means decode latency for any individual frame depends on queue depth. In practice, this is not a problem because the pipeline naturally applies backpressure through bounded channels (more on that below).

Structured shutdown with TaskTracker and CancellationToken

Desktop applications need clean shutdown. When the user closes FrameQuery, we cannot just call std::process::exit and hope for the best. There might be database writes in flight, encode pipelines flushing buffers, or temporary files that need cleanup.

We built a lightweight task manager around tokio_util::task::TaskTracker and tokio_util::sync::CancellationToken.

pub struct AppTasks {
    tracker: TaskTracker,
    token: CancellationToken,
}

This is a global singleton initialised at app startup. Every background task that should participate in structured shutdown is spawned through it. The spawn_cancellable method gives each task a child cancellation token.

pub fn spawn_cancellable<F, Fut>(&self, f: F)
where
    F: FnOnce(CancellationToken) -> Fut,
    Fut: Future<Output = ()> + Send + 'static,
{
    let token = self.token.child_token();
    self.tracker.spawn(f(token));
}

Inside a task, the pattern is always the same: use tokio::select! to race work against cancellation.

tokio::select! {
    _ = token.cancelled() => break,
    result = do_work() => {
        // handle result
    }
}

When shutdown is requested, we cancel the root token and wait for all tracked tasks to complete.

pub async fn shutdown(&self) {
    self.token.cancel();
    self.tracker.close();
    self.tracker.wait().await;
}

Cancellation propagates through child tokens automatically. A scan task with ten subtasks cancels all of them when the parent token fires. No manual bookkeeping.

Bounded concurrency for file scanning

Library scanning is a different concurrency challenge. FrameQuery walks source directories that can contain thousands of video files. Each file needs metadata extraction: duration, resolution, codec, creation date. This is I/O-bound (reading file headers) rather than CPU-bound, so the decode pool is the wrong tool.

We use a Tokio semaphore to bound concurrency.

let semaphore = Arc::new(Semaphore::new(opts.max_threads));
let total_queued = Arc::new(AtomicU64::new(0));
let discovery_complete = Arc::new(AtomicBool::new(false));

The semaphore limits how many files are being processed simultaneously. The atomic counters track progress without locks. An AtomicBool signals when directory discovery is finished so the progress UI can show a determinate progress indicator instead of an indeterminate spinner.

A heartbeat task fires every 200 milliseconds to push progress updates to the frontend. This avoids flooding the UI with per-file events while keeping the progress display responsive.

The reason we did not use Rayon here is that file metadata extraction involves actual I/O (disk reads, sometimes network for NAS-mounted media). Rayon's work-stealing is designed for CPU-bound parallelism. Tokio's semaphore with async tasks is a better fit when threads spend time waiting on I/O.

The concurrency primitive cheat sheet

After a few rounds of refactoring, we settled on a consistent set of primitives across the codebase.

OnceLock<T> for lazy singletons. The database connection pool, HTTP client, and decode pool are all initialised once on first access and live for the process lifetime. No lazy_static macro needed since OnceLock is in the standard library as of Rust 1.70.

Arc<Mutex<T>> for shared mutable state that needs synchronisation. The database write connection and app configuration use this. We use Tokio's async Mutex only when holding the lock across an .await point. Otherwise, std::sync::Mutex is faster because it avoids the overhead of the async runtime.

Arc<AtomicBool> and Arc<AtomicU64> for lock-free flags and counters. Scan cancellation, online/offline state detection, and progress counters all use atomics. They are cheaper than a mutex for simple values and cannot deadlock.

tokio::sync::mpsc for streaming pipelines. The frame decode pipeline uses a bounded channel with capacity 4, which is enough to keep the encoder fed without buffering too many decoded frames in memory. Scan results flow through a channel with capacity 64. Bounded channels provide natural backpressure: if the consumer is slow, the producer blocks instead of filling memory.

tokio::sync::oneshot for the Rayon-to-Tokio bridge described above. One result, one consumer, zero overhead after the send.

CancellationToken and TaskTracker from tokio-util for structured task lifecycle. Every long-running background operation uses these.

Handling child processes on Windows

There is one more concurrency concern that is easy to overlook on desktop: child processes. FrameQuery spawns FFmpeg for certain encode and transcode operations. If the app crashes or is killed, those FFmpeg processes can become orphans consuming CPU and disk I/O indefinitely.

On Windows, we assign child processes to a Job Object with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE. When the FrameQuery process exits for any reason (clean shutdown, crash, task manager kill), the OS automatically terminates all processes in the job. We also set BELOW_NORMAL_PRIORITY_CLASS on these child processes so background transcoding does not compete with the user's foreground work.

This is a platform-specific detail, but an important one. Users should not discover orphaned FFmpeg processes eating 100% CPU after closing the app.

What changed in practice

After these changes, FrameQuery's thread count during a large library scan went from "whatever Tokio felt like spawning" to a predictable number: core count for decode, a bounded set for file scanning, and the small fixed Tokio worker pool for async I/O. Memory usage became stable and predictable. Decode latency became consistent rather than dependent on how many other operations happened to be running.

The total code for all of this is modest. The decode pool module is 48 lines. The task manager is 94 lines. The semaphore-based scanner is a small addition to an existing module. Rust's type system enforces most of the invariants at compile time: you cannot accidentally send a non-Send type across threads, you cannot forget to await a JoinHandle, and the borrow checker prevents data races on shared state.

None of these patterns are novel. Bounded thread pools, structured cancellation, and lock-free counters are well-established concurrency techniques. The insight is that Tokio's defaults are optimised for network services, not desktop applications doing heavy compute. Recognising where those defaults do not fit and replacing them with the right primitive for each job made a measurable difference in how FrameQuery behaves under load.


We are building FrameQuery to handle professional video libraries without melting your hardware. Join the waitlist to try it.