Rust For Scala Developers: Concurrency

Dec 23

As a longitime Scala developer learning Rust, I recently explored various mechanisms for concurrency, parallelism, and asynchrony in Rust code.

We’ll take a look here, and also compare what Rust has going on with another, specific, powerful concurrency mechanism: The ZIO functional effects framework for Scala.

—-

Every Internet piece on these topics begins with the same clarifications, so let’s follow suit:

Concurrency: More than one operation executing over time such that their effects and results may intermingle.

Parallelism: More than one thread of execution running in the same process.

Asynchrony: Code that suspends the calling thread (fiber, stack, etc.) and runs on another, joining with the caller via a callback or suspension return.

CPU-bound operations call for a relatively small number of actual (OS) threads doing work simultaneously. (It is not possible to achieve higher throughput with a number of threads greater than the number of cores available.) Relatively simple thread pools, or even simpler manual threading mechanisms, are suited to such tasks, as are tools like parallel iterators/combinators that encapsulate parallel processing.

Async mechanisms in languages such as Rust, Node.js, C#, etc., generally involve shifting the contents of a state machine to the heap, which sidesteps all the issues with stack growth for threads, and makes possible a very large number of concurrent operations. This motif is ideal for systems involving a large number of IO-bound operations, such as web servers, in which the number of concurrent operations must generally greatly exceed the number of cores as well as OS threads.

Rust

Rust has rich built-in and/or library support for all of these variants:

The std library has support for threads and channels, from which can be built reliable parallel systems
Rayon offers support for parallelism in the form of parallel iterators and thread pools
The async/await mechanism, together with a runtime such as Tokio (the most popular), brings concurrent asynchronous programming to Rust

ZIO

The ZIO model for concurrency is a runtime using lightweight, asynchronous fibers (green threads) that feature structured concurrency (parent-child fiber relationships), interruptibility, and built-in resource safety (a guaranteed finalizer mechanism).

ZIO's fiber-based concurrency model has some things in common with Rust's async/await system (and similar models in other languages):

- There is a multiplexing of fibers/tasks to OS threads (giving great scalability potential)

- Fibers are asynchronous and interruptible (a single effect is the step - the unit of work)

- Fibers are very lightweight compared to OS threads

In a sense, fibers (and the associated runtime) are a mechanism for user-space cooperative multitasking.

The ZIO concurrency model also provides lock-free concurrency (https://zio.dev/reference/concurrency/) and allows arbitrary composition of effects (parallel, sequential, or any mixture of the two) due to the separation of definition from execution inherent in the functional effect model.

One aspect of ZIO's concurrency I find interesting is that the same system is equally suited to both CPU-bound and IO-bound parallelism/concurrency.

—-

On to the examples. I have coded three Rust samples and a single one in ZIO:

Parallel Word Count (CPU-Bound) With Rayon (Rust)
Parallel Word Count (CPU-Bound) With Channels (Rust)
Simple Async/Await Example (IO-Bound) (Rust)
Parallel Word Count (CPU-Bound) With ZIO

The wordcount samples operate on a 100MB text file I created, by concatenating ten times a single 10MB file. On my 2020 M1 Macbook Air with 8 cores (4 performance and 4 efficiency), it was necessary to use this much data to get timings into a reasonable range.

Parallel Word Count (CPU-Bound) With Rayon (Rust)

This first sample uses the popular Rust crate Rayon’s parallel iterators - into_par_iter(). This approach yielded worse performance than the channels approach; this is discussed below.

The code:

Output:

paulfolbrecht@Pauls-MacBook-Air rust-concurrency % cargo run --bin rayon

Finished dev [unoptimized + debuginfo] target(s) in 0.15s

Running `target/debug/rayon`

the: 850887

of: 491499

to: 485172

a: 386523

and: 378621

in: 260856

I: 199611

his: 187335

that: 182907

he: 178371

Time for run: 7,525ms

Parallel Word Count (CPU-Bound) With Channels (Rust)

Spawning threads manually and using a simple channel to send & collect results is slightly more complex than the first method, but produced better performance, and also scaled relatively linearly up to the number of physical cores, as expected.

The code:

Output:

paulfolbrecht@Pauls-MacBook-Air rust-concurrency % cargo run --bin channels

Finished dev [unoptimized + debuginfo] target(s) in 0.06s

Running `target/debug/channels`

the: 850887

of: 491499

to: 485172

a: 386523

and: 378621

in: 260856

I: 199611

his: 187335

that: 182907

he: 178371

Time for run: 2,877ms

Simple Async/Await Example (IO-Bound) (Rust)

Switching gears to the I/O use-case, the use of async/await results in very simple and clean code.

This sample makes an HTTP request (using the popular reqwest crate) and collects the responses.

Note that the sample, simple as it is, demonstrates moving data into an async closure (the URL) as well as out of it (the responses).

The code:

Output:

paulfolbrecht@Pauls-MacBook-Air rust-concurrency % cargo run --bin async

Finished dev [unoptimized + debuginfo] target(s) in 0.06s

Running `target/debug/async`

Result: 200 OK

Parallel Word Count (CPU-Bound) With ZIO

ZIO has a full suite of parallel combinators - collectPar, foreachPar, mergeAllPar - but I elected to manually fork/join instead.

The code:

Output:

paulfolbrecht@Pauls-MacBook-Air scala-concurrency % ./mill cpu_bound run

[46/46] cpu_bound.run

Time for readFile: 554ms

(the,850887)

(of,491499)

(to,485172)

(a,386523)

(and,378621)

(in,260856)

(I,199611)

(his,187335)

(that,182907)

(he,178371)

Time for app: 3908ms

Time for readFile: 260ms

(the,850887)

(of,491499)

(to,485172)

(a,386523)

(and,378621)

(in,260856)

(I,199611)

(his,187335)

(that,182907)

(he,178371)

Time for app: 3056ms

Time for readFile: 212ms

(the,850887)

(of,491499)

(to,485172)

(a,386523)

(and,378621)

(in,260856)

(I,199611)

(his,187335)

(that,182907)

(he,178371)

Time for app: 2945ms

The reason for three successive instead of a single run here will be discussed below.)

—-

An I/O-bound example with ZIO is left as an exercise. Since ZIO handles both types of use-cases, it would be (or could reasonably be) similar in structure, with two differences:

Rather than creating our work effect with ZIO.attemptBlocking, we would use ZIO.succeed (or attempt for a fallible effect), to use the default thread pool.
Relatedly, one should not restrict the number of concurrent I/O tasks to the number of cores or anything similar. It would be best to simply use one of the ZIO parallel combinators here.

(The number one rule of async programming is “do not block the event loop thread/main threads.” As noted, ZIO functions equally well for both CPU-bound and IO-bound tasks, but this detail is one that cannot be ignored.)

Timing

The Rust & Scala code samples are timed in similar fashion, with a utility wrapper/combinator that allows “add-on” timing to any operation.

The ZIO timer is implemented as an extension method that creates a new combinator on the ZIO effect:

(ZIO.timed has pluggable implementations, with the default using Java’s nanoTime().)

Because we are not doing monadic programming in Rust (because it’s not really natural or idiomatic), we do not have an effect or other monad to attach combinators to. But we can still create a wrapper function, which gives us ad-hoc timing ability:

It must be called with a closure, however, making it not terribly ergonomic.

Performance

Discounting the constant file reading & parsing time, increasing the number of CPUs for the Rust Channels and ZIO samples up to the number of physical performance cores resulted in almost perfectly linear scaling.

This did not occur for the Rayon sample. Now, this sample is significantly different than the other two: It does not break the lines into chunks, manually, to be given to fibers/threads. Using a higher-level mechanism like parallel iterators doesn’t really make sense if you’re going to do that. Thus, we are relying on the Rayon library - into_par_iter - to do this chunking of this large number (677,422) of lines for us.

(Note that Rayon will spawn threads according to number of cores, as expected. resulted in the same performance.)

An important matter regarding Scala performance is the “warmup” process endemic to the Java Virtual Machine. The Scala sample was coded to run the program three times in sequence to demonstrate this effect. The sample output is representative of the norm: Performance increases by better than 25% as the VM warms up, stabilizing after ~three iterations.

Notes

The fact that all three word-count implementations give the exact same output is encouraging regarding correctness. It is conceivable that some difference in, for example, how whitespace is defined, could result in small variations in results, but that is not the case, at least in the observed output.

It’s impressive (but not surprising) that Scala is as performant as it is, being a relative “high-level” language.

And it is impressive (but not surprising) that the Rust code is as succinct as it is, Rust being a bare-metal language.

In both Rust and Scala, I chose to have workers produce individual results, which are then merged, rather than having them operate on a shared, mutable data structure (even one wrapped in a monad). I did also write the channels Rust sample using a single, mutable map as an experiment - the code was more complex and performed no better.

Paul Folbrecht

Rust For Scala Developers: Concurrency

Strong AI Is A Theoretical Impossibility