How to run polars-tpch benchmark with FireDucks

By Kazuhisa Ishizaka | Friday, December 06, 2024

Recently we have updated the result of polars-tpch benchmark on 4th generation Xeon processor. The latest result can be found here, and also below in this artice, explaining how to reproduce the same.

For reproducibility, we have used AWS EC2 for this time evaluation. We have used m7i.8xlarge instance type with ubuntu 24.04 image and 128GB EBS SSD. This instance includes:

4th generation Xeon processor: Intel(R) Xeon(R) Platinum 8488C (32cores)
128GB memory

Benchmark Result

The graph shown below compares performance of four dataframe libraries, pandas, polars, modin and fireducks as a speedup from pandas. For an average of 22 queries:

fireducks is 125x faster than pandas, whereas
polars is 57x faster than pandas,
modin is 1.0x faster than pandas

Note that our setting of benchmark is as follows:

Scale factor is 10 (about 10GB dataset) SCALE_FACTOR=10.0
Timings without IO RUN_IO_TYPE=skip

polars-tpch benchmark result

How to run the benchmark with FireDucks

After launching instance and logging into it, you need to install python:

$ sudo apt update
$ sudo apt install python3.10-venv make gcc

Then clone the benchmark code from our repository which includes queries for FireDucks:

$ git clone https://github.com/fireducks-dev/polars-tpch
$ cd polars-tpch

Then run the script which first creates dataset with SCALE_FACTOR=10.0 and runs all 22 queries once with FireDucks as per the above settings:

$ ./run-fireducks.sh

Now you can see timings for all queries in output/run/timings.csv.

How to produce timings for the graph

The graph shown above was created with minimum execution time among three runs for each query.

To do this:

$ SCALE_FACTOR=10.0 make tables                  # create dataset by polars
$ .venv/bin/pip install -U pandas polars modin   # install latest libraries
$ ./run-fppm3.sh                                 # run four libraries three times

How Queries for FireDucks Are Implemented

When we have started to use TPC-H benchmark for evaluation of FireDucks, we have implemented pandas version of queries so that those produce the correct result. Later, when we came to know about polars-tpch benchmark, we have updated our queries as like polars as possible, for example same number of operations, same order of operations, etc.

Since both libraries have different APIs, implementations are not 100% same, but to my surprise both implementations look very similar, thanks to flexible pandas API. Here is query 2 as an example. Can you figure out which one is polars and which one is pandas?

q02

In this post, we have described how to reproduce our results for TPC-H Benchmark. Let’s try and take a look at both queries. If you find something to be improved for reasonable comparison among libraries, let us know.