Benchmarks

Evaluation of a few benchmarks using FireDucks

This section shows the results of FireDucks performance using a few popular benchmarks.

(1) Database-like ops benchmark

We evaluated the performance of db-benchmark using FireDucks. Db-benchmark includes scenarios that execute fundamental data science operations across multiple size datasets. As of September 10, 2024, FireDucks seems to be the fastest dataframe library for groupby and join operation with big data.

You can check the complete details of the evaluation result here.

DB-Benchmark

FireDucks shows promise with frequently used Database-like operations like Join, GroupBy etc. on large data with different complexities.

The FireDucks version used in the measurements is as follows

  • fireducks-1.0.4

Server specification (conforms to db-benchmark measurement conditions):

  • CPU model: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
  • CPU cores: 128
  • RAM model: NVMe SSD
  • main memory: 256gb

(2) TPC-H benchmark

Server specification (AWS EC2 m7i.8xlarge):

  • cpu: Intel(R) Xeon(R) Platinum 8488C (32cores)
  • main memory: 128GB

source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size is 10 (dataset of about 10 GB), the time spent on each query was measured excluding I/O (RUN_IO_TYPE=skip), and the input data is generated using pyarrow (instead of polars) for a fair comparison. Please refer to the procedure mentioned in README to reproduce the evaluation result.

The average speedup over pandas for 22 queries was 1.0x for Modin, 57x for Polars, and 125x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.3
  • Modin: 0.32.0
  • Polars: 1.6.0
  • FireDucks: 1.1.2

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas, we have implemented all 22 queries using pandas, and then executed those using FireDucks by import hook. These queries have been developed as per the implementation in polars queries, so that the apple-to-apple performance comparison can be made on these two libraries. Since polars has different APIs from pandas, there are minor differences in the implementation, but we kept the number of time consuming operations like join/merge, filter, groupby etc. as similar to that in polars implementation.

(3) TPCx-BB benchmark

Server specification:

  • cpu: intel(r) xeon(r) gold 5317 cpu @ 3.00ghz x 2sockets (48hw threads total)
  • main memory: 256GB

This section presents a comparison of pandas and FireDucks using TPCx-BB. TPCx-BB includes queries related to data analysis using machine learning and its preprocessing. In this evaluation, we used the pandas implementation of TPCx-BB implemented by the FireDucks development team to perform measurements on pandas and FireDucks. File IO is included in the measurement range.

With TPCx-BB, FireDucks is up to 17 times faster than pandas and 6.7 times faster on average.

TPCx-BB

The versions used in the measurements are as follows

  • pandas-2.1.4
  • fireducks-0.9.3

Benchmark Archive

Older benchmarks