Benchmark Archive

Older benchmarks

2024-06-05

server specs

  • cpu: intel(r) xeon(r) gold 5317 cpu @ 3.00ghz x 2sockets (48hw threads total)
  • main memory: 256gb

comparison of dataframe libraries using tpc-h

source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size, is 10 (dataset of about 10 GB), and the time spent on non-file IO was measured.

The average speedup over pandas for 22 queries was 1.2x for Modin, 16x for Polars, and 27x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.2
  • Modin: 0.30.0
  • Polars: 0.20.29
  • FireDucks: 0.11.4

The following chart shows the comparison between Polars and FireDucks with larger dataset, Scale Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.7 times (sf=10), 1.7 times (sf=20), and 1.8 times (sf=50) faster than Polars.

polars-tpch

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas and Modin, we have implemented all 22 queries using pandas then run those with FireDucks by import hook. Those queries were also used with pandas and modin for the queries. All code for the queries is available at fireducks-dev/polars-tpch

NOTE: Our pandas/modin/fireducks version of queries are implemented keeping two simple things in mind:

  1. the outcome of each query should match with the expected result.
  2. the code should look cleaner with the best possible usage of the chained methods, as there are several ways of implementing the same thing in pandas.

The queries for polars seem to be implemented with the different rule as described here. Thus for the performance comparison with polars, someone might say the evaluation is not apple-to-apple (as in DuckDB, polars which have more similar APIs to that of SQL). As far as the result is same and written in most understandable format (with the best usage of the APIs), the benchmark can be accepted.

2024-02-06

Server Specs

  • CPU: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz x 2sockets (48HW threads total)
  • Main memory: 256GB

Comparison of DataFrame libraries using TPC-H

Source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size, is 10 (dataset of about 10 GB), and the time spent on non-file IO was measured.

The average speedup over pandas for 22 queries was 1.3x for Modin, 13x for Polars, and 18x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.0
  • Modin: 0.26.1
  • Polars: 0.20.7
  • FireDucks: 0.9.8

The following chart shows the comparison results between Polars and FireDucks with larger dataset, Scalar Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.3 times (sf=10), 1.3 times (sf=20), and 1.7 times (sf=50) faster than Polars.

polars-tpch

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas, we have implemented all 22 queries using pandas then run those with FireDucks by import hook. Those queries were also used with pandas and modin for the queries polars/tpch does not provide. All code for the queries is available at fireducks-dev/polars-tpch.

2024-01-12

Server Specs

  • CPU: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz x 2sockets (48HW threads total)
  • Main memory: 256GB

Comparison of DataFrame libraries using TPC-H

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size, is 10 (data set of about 10 GB), and the time spent on non-file IO is measured.

The average speedup over pandas for 22 queries was 1.8x for Modin, 12x for Polars, and 17x for FireDucks.

polars-tpch1

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.1.4
  • Modin: 0.26.0
  • Polars: 0.20.2
  • FireDucks: 0.9.3

For each query, the FireDucks development team implemented the program using pandas for pandas, modin, and FireDucks, and used it for modin and FireDucks with changing import statement. polars/tpch was used for Polars.

The following chart shows the comparison results between Polars and FireDucks with larger dataset, Scalar Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.4 times (sf=10), 1.4 times (sf=20), and 1.6 times (sf=50) faster than Polars.

polars-tpch2