Rayon for data parallelism - Rust for C-Programmers

22.7 Data Parallelism with Rayon

Manually spawning and coordinating threads to parallelize computations across data collections (like vectors or arrays) can be tedious and error-prone. Issues like correctly partitioning the data, load balancing, and managing synchronization are complex. The Rayon crate provides a high-level framework for data parallelism that abstracts away much of this complexity. It leverages a work-stealing thread pool to efficiently distribute computations across available CPU cores.

22.7.1 Using Parallel Iterators

Rayon’s most prominent feature is its parallel iterators. Often, converting sequential iterator-based code to run in parallel requires minimal changes.

First, add Rayon as a dependency in your Cargo.toml:

[dependencies]
rayon = "1.8" # Check for the latest version

Then, bring the parallel iterator traits into scope:

use rayon::prelude::*;

You can then replace standard iterator methods like .iter(), .iter_mut(), or .into_iter() with their parallel counterparts: .par_iter(), .par_iter_mut(), or .into_par_iter(). Most standard iterator adaptors (like map, filter, fold, sum, for_each) have parallel equivalents provided by Rayon.

use rayon::prelude::*; // Import the parallel iterator traits

fn main() {
    let mut data: Vec<u64> = (0..1_000_000).collect();

    // Sequential computation (example: modify in place)
    // data.iter_mut().for_each(|x| *x = (*x * *x) % 1000);

    // Parallel computation using Rayon
    println!("Starting parallel computation...");
    data.par_iter_mut() // Get a parallel mutable iterator
        .enumerate()    // Get index along with element
        .for_each(|(i, x)| {
            // This closure potentially runs in parallel for different chunks of data.
            // Perform some computation (e.g., simulate work based on index)
            let computed_value = (i as u64 * i as u64) % 1000;
            *x = computed_value;
        });
    println!("Parallel modification finished.");

    // Example: Parallel sum after modification
    let sum: u64 = data.par_iter() // Parallel immutable iterator
                       .map(|&x| x * 2) // Map operation runs in parallel
                       .sum(); // Reduction (sum) is performed efficiently in parallel

    println!("Parallel sum of doubled values: {}", sum);

    // Verify a few values (optional, computation is deterministic)
    // # println!("Data[0]={}, Data[1]={}, Data[last]={}", data[0], data[1], data[data.len()-1]);
}

Rayon automatically manages a global thread pool (sized based on the number of logical CPU cores by default). It intelligently splits the data (data vector in the example) into smaller chunks and assigns them to worker threads. If one thread finishes its chunk early, it can “steal” work from another, busier thread, ensuring good load balancing.

22.7.2 The `rayon::join` Function

For parallelizing distinct, independent tasks that don’t naturally fit the iterator model, Rayon provides rayon::join. It takes two closures and executes them, potentially in parallel on different threads from the pool, returning only when both closures have completed.

fn compute_task_a() -> String {
    // Simulate some independent work
    println!("Task A starting on thread {:?}", std::thread::current().id());
    std::thread::sleep(std::time::Duration::from_millis(150));
    println!("Task A finished.");
    String::from("Result A")
}

fn compute_task_b() -> String {
    // Simulate other independent work
    println!("Task B starting on thread {:?}", std::thread::current().id());
    std::thread::sleep(std::time::Duration::from_millis(100));
    println!("Task B finished.");
    String::from("Result B")
}

fn main() {
    println!("Starting rayon::join...");
    let (result_a, result_b) = rayon::join(
        compute_task_a, // Closure 1
        compute_task_b  // Closure 2
    );
    // rayon::join blocks until both compute_task_a and compute_task_b return.
    // They may run sequentially or in parallel depending on thread availability.
    println!("rayon::join completed.");

    println!("Joined results: A='{}', B='{}'", result_a, result_b);
}

22.7.3 Performance Considerations

Rayon makes parallelism easy, but it’s not a magic bullet for performance.

Overhead: There is overhead associated with coordinating threads, splitting work, and potentially stealing tasks. For very small datasets or extremely simple computations per element, this overhead might outweigh the benefits of parallel execution, potentially making the parallel version slower than the sequential one.
Amdahl’s Law: The maximum speedup achievable through parallelism is limited by the portion of the code that must remain sequential.
Work Granularity: The amount of work done per parallel task matters. If tasks are too small, overhead dominates. If too large, load balancing might be poor. Rayon’s work stealing helps, but performance can still depend on the nature of the computation.

Always benchmark and profile your code (e.g., using cargo bench and profiling tools like perf on Linux or Instruments on macOS) to verify that using Rayon provides a tangible performance improvement for your specific workload and target hardware.

Rust for C-Programmers

22.7 Data Parallelism with Rayon

22.7.1 Using Parallel Iterators

22.7.2 The rayon::join Function

22.7.3 Performance Considerations

22.7.2 The `rayon::join` Function