How to run polars-tpch benchmark with FireDucks
Recently we have updated the result of polars-tpch benchmark on 4th generation Xeon processor. The latest result can be found here, and also below in this artice, explaining how to reproduce the same.
For reproducibility, we have used AWS EC2 for this time evaluation. We have
used m7i.8xlarge
instance type with ubuntu 24.04 image and 128GB EBS SSD.
This instance includes:
- 4th generation Xeon processor: Intel(R) Xeon(R) Platinum 8488C (32cores)
- 128GB memory
Benchmark Result
The graph shown below compares performance of four dataframe libraries, pandas, polars, modin and fireducks as a speedup from pandas. For an average of 22 queries:
- fireducks is 125x faster than pandas, whereas
- polars is 57x faster than pandas,
- modin is 1.0x faster than pandas
Note that our setting of benchmark is as follows:
- Scale factor is 10 (about 10GB dataset)
SCALE_FACTOR=10.0
- Timings without IO
RUN_IO_TYPE=skip
How to run the benchmark with FireDucks
After launching instance and logging into it, you need to install python:
$ sudo apt update
$ sudo apt install python3.10-venv make gcc
Then clone the benchmark code from our repository which includes queries for FireDucks:
$ git clone https://github.com/fireducks-dev/polars-tpch
$ cd polars-tpch
Then run the script which first creates dataset with SCALE_FACTOR=10.0
and runs
all 22 queries once with FireDucks as per the above settings:
$ ./run-fireducks.sh
Now you can see timings for all queries in output/run/timings.csv
.
How to produce timings for the graph
The graph shown above was created with minimum execution time among three runs for each query.
To do this:
$ SCALE_FACTOR=10.0 make tables # create dataset by polars
$ .venv/bin/pip install -U pandas polars modin # install latest libraries
$ ./run-fppm3.sh # run four libraries three times
How Queries for FireDucks Are Implemented
When we have started to use TPC-H benchmark for evaluation of FireDucks, we have implemented pandas version of queries so that those produce the correct result. Later, when we came to know about polars-tpch benchmark, we have updated our queries as like polars as possible, for example same number of operations, same order of operations, etc.
Since both libraries have different APIs, implementations are not 100% same, but to my surprise both implementations look very similar, thanks to flexible pandas API. Here is query 2 as an example. Can you figure out which one is polars and which one is pandas?
In this post, we have described how to reproduce our results for TPC-H Benchmark. Let’s try and take a look at both queries. If you find something to be improved for reasonable comparison among libraries, let us know.