Data and Programming with Will Jones - Let’s have more Rust in Python’s Arrow revolution

The upcoming release of Pandas 2.0 embraces Apache Arrow even more. If you haven’t yet, read Marc Garcia’s “pandas 2.0 and the Arrow revolution”. Pandas will increasingly use Arrow arrays to back data frames. This move towards Arrow, a standard multi-language data format, presents an interesting opportunity for the data science ecosystem: to embrace Rust as a programming language for writing native extensions for Python packages.

In most cases, developers have reached for C++ or Cython to write this code. But these languages and their ecosystems have some very rough edges. As a maintainer of PyArrow and Arrow C++ I’ve seen them first hand. At the same time, I’ve also experienced the pleasure of developing in Rust, as I’ve been maintaining the Rust-based Python deltalake library (built on delta-rs).

What does Rust bring to the table? A few key things:

A helpful compiler that can identify memory safety bugs. If you’re more of Python developer than a Rust/C++ developer, this is key. But I suspect even if you are a seasoned C++ programmer, these kinds of bugs can slide past you. And they are haunting to debug. (I’ve worked through a number of them myself in the Parquet C++ codebase.)
A package manager and a deep catalog of useful packages. Every time I’ve spent an hour debugging a broken CMake script, I can’t help but wonder about the productive things I could have been doing instead if the project was written in Rust.
A built-in linter and formatter. In C++, you have to install these separately, so you can’t take for granted a new contributor has set these up. Whereas Rust’s linter (clippy) only takes a few seconds and is near universally used, clang-tidy, the main C++ linter, is quite slow (sometimes taking tens of minutes) and not universal. In fact, many projects I’ve seen disable most of the lints; whereas in Rust it’s typical for projects to keep all lints enabled and update their code as new lints are released. Cython isn’t officially supported by any major linter/formatter (flake8 used to work by accident).

These apply to the language overall. But what about writing Python packages? Rust has a few excellent tools for this:

A crate called PyO3 for binding to CPython. This is similar to pybind11.
A PEP-517 build backend for Rust projects called maturin. It provides an easy CLI for using and managing a mixed Rust / Python project.
PyArrow integration in the arrow-rs crate. Using Arrow’s C Data Interface, it provides zero-copy conversion between Arrow data in Rust and PyArrow. This allows writing Rust functions that take or return PyArrow types, arrays, record batches (Arrow’s main tabular structure), or even streams of record batches.

Admittedly, Rust can be an intimidating language. But what amazes me about Rust’s community is they are committed to making it as approachable as possible, without diluting its power. “A language empowering everyone to build reliable and efficient software” is its current tagline. To the extent Rust is hard, it tries to be because the underlying concepts—memory management, low-level concurrency—are themselves hard, rather than because it is poorly explained or the tools are obtuse. To illustrate this, I’ll show an example of creating a Python package that integrates with Arrow (and through Arrow, Pandas). You’ll see just how easy it is to set up a new package, download and build dependencies, and write Python modules in Rust.

An example Rust Python package

Before we get into it, this won’t be a full tutorial for maturin or PyO3 or even Rust. You should be able to, and are even encouraged, to follow along. But if you want to learn more in-depth, you should look at the list of resouces at the end of this post.

For our example, we’ll be writing a Python module that exports one function with the signature:

def bernoulli(size: int, prob: float) -> pyarrow.BooleanArray:
   """Create a randomly sampled BooleanArray.
   
   :param size: the number of elements to sample.
   :param prob: the probability any given element should be true. Must be between
       zero and one.
   """
   ...

To start, I’ll assume you already have cargo, the main Rust CLI, already installed. If not, the recomended way to install is with rustup.

Next, install maturin, PyO3’s build tool.

pip install maturin

To start a new project, use maturin

maturin new -b pyo3 pyo3-bernoulli
cd pyo3-bernoulli

This creates the basic project structure. Cargo.toml is the Rust project configuration file, while pyproject.toml is the one for Python and maturin. src/lib.rs is our Rust source file.

$ ls
Cargo.toml  pyproject.toml  src
$ ls src
lib.rs

Before we write our code, we’ll need to install a few dependencies. First, we’ll need to be able to generate random booleans from a Bernoulli distribution. We’ll use the rand crate for that. Second, we’ll need to be able to construct Arrow arrays in Rust, so we’ll need the arrow crate. And third, we’ll want to be able to have those arrays from arrow be converted into PyArrow arrays, so we’ll need the pyarrow feature of the arrow crate.

cargo add rand
cargo add arrow --features pyarrow

Now we can implement the function. Rewrite the whole src/lib.rs file with:

use arrow::{
    array::{Array, ArrayData, BooleanArray},
    pyarrow::PyArrowType,
};
use pyo3::{exceptions::PyValueError, prelude::*};
use rand::{distributions::Bernoulli, prelude::Distribution};

/// Generate a random boolean array from a Bernoulli distribution.
#[pyfunction]
fn bernoulli(size: usize, prob: f64) -> PyResult<PyArrowType<ArrayData>> {
    // Configure the distribution, and handle errors from an invalid `prob` parameter.
    let dist = Bernoulli::new(prob)
      .map_err(|err| PyValueError::new_err(err.to_string()))?;

    let array = BooleanArray::from_iter(
        // Iterator samples random values
        dist.sample_iter(&mut rand::thread_rng())
            // Take `size` values from infinite iterator
            .take(size)
            // Wrap them in Some(), since BooleanArray::from_iter expects an Option<bool>
            .map(|val| Some(val)),
    );

    // Wraps the Rust Arrow array for conversion to a PyArrow array
    Ok(PyArrowType(array.into_data()))
}

/// A Python module implemented in Rust.
#[pymodule]
fn pyo3_bernoulli(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(bernoulli, m)?)?;
    Ok(())
}

Since this isn’t a Rust or PyO3 tutorial, I won’t explain every line. But I will point out a few points of interest:

The parameters and return value of the function bernoulli are just Rust types; PyO3 takes care of validating input from Python and converting them to Python types.
The return type is wrapped in a PyArrowType. This is a marker defined by the arrow crate to indicate ArrayData should be converted into a PyArrow array. PyArrow might not be the only Python Arrow-based array it could convert to in the future; for example, it might make sense to also implement a conversion to a Polars series.

Now we should consider our dependencies in Python. Since we are converting to PyArrow arrays, we need to mark pyarrow as a dependency. We can add this to the end of our pyproject.toml:

dependencies = [
    "pyarrow>=7"
]

With that in place, we can setup the Python environment and build the package:

python -m venv .venv
source .venv/bin/activate
maturin develop

When we ran maturin develop, it downloaded all of our dependencies and built a development version of our module.

Now we can try out the function in a Python interpreter.

>>> from pyo3_bernoulli import bernoulli
>>> bernoulli(10, 0.5)
<pyarrow.lib.BooleanArray object at 0x117060520>
[
  true,
  false,
  true,
  true,
  true,
  true,
  false,
  true,
  false,
  true
]
>>> bernoulli(10, 100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: p is outside [0, 1] in Bernoulli distribution

As expected, our function produces PyArrow arrays and will reject invalid input with a ValueError.

In addition to building with maturin, we can also use the tools from cargo. All the usual Rust commands work in our Python package, including:

cargo fmt    # format the code
cargo check  # type check and borrow check
cargo clippy # lint the code

The command cargo clippy is particularly useful for newcomers to the language. For example, if we run this on the code I showed above, it provides the warning:

warning: redundant closure
  --> src/lib.rs:17:18
   |
17 |             .map(|val| Some(val)), // Wrap them in Some(), since BooleanArray::from_iter expects an Option<bool>
   |                  ^^^^^^^^^^^^^^^ help: replace the closure with the function itself: `Some`
   |
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#redundant_closure
   = note: `#[warn(clippy::redundant_closure)]` on by default

This is a great example of Rust’s lints: it tells us exactly where in the line the issue is and even suggests a fix. Conveniently, clippy can also automatically fix these for you with cargo clippy --fix. But if you are learning the language, it’s well worth reading through the lints, as they often give you tips for writing more performant code. For example, they will point out places where you are making redundant copies of data.

Some closing words

I have been helping maintain the Python deltalake package (delta-rs) with maturin and PyO3. As a maintainer, one of my favorite things about Rust is that it comes with such excellent tooling built in. I never have to instruct new contributors on how to install dependencies, or configure their build, or setup a linter. I know when the CI is green on their PRs, I don’t have to provide in my reviews formatting nits (thanks cargo fmt!), style nits (thanks cargo clippy!), or careful checks for memory safety issues (thanks cargo check!). I can focus on reviewing their business logic and test cases, which are the parts I actually care to read. I am quite happy maintaining Rust projects.

I hope I’ve given a good picture of the power and ease of Rust. There’s a lot of good work going on in the Rust ecosystems. A few worth looking more at are the arrow, parquet, DataFusion, and Polars.

There are a few caveats I should mention:

If you’re a data scientist looking to accelerate a critical bit of code, you’re probably better off using Numba than getting into Rust.
There are still use cases where C++ or Cython makes more sense. For one, if you already have a substantial code base in those languages you need to integrate with, better to keep there. (Although there are projects that have had some success converting from C++ to Rust piece-by-piece.) And if you want to expose your native code as a shared library, that’s much easier done in C++ than Rust; unless you are exposing it as a C ABI.
I don’t want to seem hopeless on C++. I do and will continue maintaining C++ code. Sometimes I might even enjoy it. (I can’t deny that templates are pretty nifty.) But I hope the C++ community and the Python extension developer community are able to take some inspiration from what Rust and PyO3 has done to make their toolchain such a pleasure to use. If there are projects towards that goal, I’d love to hear about them.

Where to learn more about Rust and PyO3

About Rust:

Talk: “Type-driven API Design in Rust” by Will Crichton. This starts with an API in Python and shows how one might approach it in Rust. It doesn’t assume prior Rust knowledge.
The Rust Book. It’s a great introduction to the language overall.
Rust’s iterators (one of my favorite features!):
- Reference docs. This trait defines all the things you can do with an iterator in Rust.
- Rust iterators training by Rainer Stropek.
Rust traits training also by Rainer Stropek.

About PyO3:

PyO3’s user guide
Take inspiration from the codebase of other projects using PyO3 and Arrow:
- Polars. Note: Polars uses the arrow2 crate, which for now is a separate implementation from the arrow crate.
- DataFusion Python
- Lance
- delta-rs