Benchmark Archive

Older benchmarks

2025/02/04

Server specification (AWS EC2 m7i.8xlarge):

  • cpu: INTEL(R) XEON(R) GOLD 6526Y (32cores)
  • main memory: 512GB
  • cpufreq-governor: powersave

source code of the benchmark

The following graph compares four data frame libraries (pandas, DuckDB, Polars, and FireDucks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size is 10 (dataset of about 10 GB), the time spent on each query was measured excluding I/O (RUN_IO_TYPE=skip, the upper graph) and including I/O (RUN_IO_TYPE=parquet, the lower graph), and the input data is generated using pyarrow (instead of polars) for better performance for all libraries. Please refer to the procedure mentioned in README to reproduce the evaluation result.

The average speedup over pandas for 22 queries:

Excluding I/OIncluding I/O
DuckDB63x43x
Polars39x32x
FireDucks78x38x

Excluding I/O

polars-tpch-sf10-skip

Including I/O

polars-tpch-sf10-parquet

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.3
  • DuckDB: 1.1.3
  • Polars: 1.21.0
  • FireDucks: 1.2.0

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas, we have implemented all 22 queries using pandas, and then executed those using FireDucks by import hook. These queries have been developed as per the implementation in polars queries, so that the apple-to-apple performance comparison can be made on these two libraries. Since polars has different APIs from pandas, there are minor differences in the implementation, but we kept the number of time consuming operations like join/merge, filter, groupby etc. as similar to that in polars implementation.

2024-12-06

Server specification (AWS EC2 m7i.8xlarge):

  • cpu: Intel(R) Xeon(R) Platinum 8488C (32cores)
  • main memory: 128GB

source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size is 10 (dataset of about 10 GB), the time spent on each query was measured excluding I/O (RUN_IO_TYPE=skip), and the input data is generated using pyarrow (instead of polars) for a fair comparison. Please refer to the procedure mentioned in README to reproduce the evaluation result.

The average speedup over pandas for 22 queries was 1.0x for Modin, 57x for Polars, and 125x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.3
  • Modin: 0.32.0
  • Polars: 1.6.0
  • FireDucks: 1.1.2

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas, we have implemented all 22 queries using pandas, and then executed those using FireDucks by import hook. These queries have been developed as per the implementation in polars queries, so that the apple-to-apple performance comparison can be made on these two libraries. Since polars has different APIs from pandas, there are minor differences in the implementation, but we kept the number of time consuming operations like join/merge, filter, groupby etc. as similar to that in polars implementation.

2024-09-09

Server specification:

  • cpu: intel(r) xeon(r) gold 5317 cpu @ 3.00ghz x 2sockets (48hw threads total)
  • main memory: 256GB

source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size is 10 (dataset of about 10 GB), the time spent on each query was measured excluding I/O (RUN_IO_TYPE=skip), and the input data is generated using pyarrow (instead of polars) for a fair comparison. Please refer to the procedure mentioned in README to reproduce the evaluation result.

The average speedup over pandas for 22 queries was 0.89x for Modin, 39x for Polars, and 50x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.2
  • Modin: 0.31.0
  • Polars: 1.6.0
  • FireDucks: 1.0.3

The following chart shows the comparison between Polars and FireDucks with larger dataset, Scale Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.3 times (sf=10), 1.3 times (sf=20), and 1.5 times (sf=50) faster than Polars.

polars-tpch

2024-06-05

server specs

  • cpu: intel(r) xeon(r) gold 5317 cpu @ 3.00ghz x 2sockets (48hw threads total)
  • main memory: 256gb

comparison of dataframe libraries using tpc-h

source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size, is 10 (dataset of about 10 GB), and the time spent on non-file IO was measured.

The average speedup over pandas for 22 queries was 1.2x for Modin, 16x for Polars, and 27x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.2
  • Modin: 0.30.0
  • Polars: 0.20.29
  • FireDucks: 0.11.4

The following chart shows the comparison between Polars and FireDucks with larger dataset, Scale Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.7 times (sf=10), 1.7 times (sf=20), and 1.8 times (sf=50) faster than Polars.

polars-tpch

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas and Modin, we have implemented all 22 queries using pandas then run those with FireDucks by import hook. Those queries were also used with pandas and modin for the queries. All code for the queries is available at fireducks-dev/polars-tpch

NOTE: Our pandas/modin/fireducks version of queries are implemented keeping two simple things in mind:

  1. the outcome of each query should match with the expected result.
  2. the code should look cleaner with the best possible usage of the chained methods, as there are several ways of implementing the same thing in pandas.

The queries for polars seem to be implemented with the different rule as described here. Thus for the performance comparison with polars, someone might say the evaluation is not apple-to-apple (as in DuckDB, polars which have more similar APIs to that of SQL). As far as the result is same and written in most understandable format (with the best usage of the APIs), the benchmark can be accepted.

2024-02-06

Server Specs

  • CPU: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz x 2sockets (48HW threads total)
  • Main memory: 256GB

Comparison of DataFrame libraries using TPC-H

Source code of the benchmark

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size, is 10 (dataset of about 10 GB), and the time spent on non-file IO was measured.

The average speedup over pandas for 22 queries was 1.3x for Modin, 13x for Polars, and 18x for FireDucks.

polars-tpch-sf10

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.2.0
  • Modin: 0.26.1
  • Polars: 0.20.7
  • FireDucks: 0.9.8

The following chart shows the comparison results between Polars and FireDucks with larger dataset, Scalar Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.3 times (sf=10), 1.3 times (sf=20), and 1.7 times (sf=50) faster than Polars.

polars-tpch

About the benchmark code

This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas, we have implemented all 22 queries using pandas then run those with FireDucks by import hook. Those queries were also used with pandas and modin for the queries polars/tpch does not provide. All code for the queries is available at fireducks-dev/polars-tpch.

2024-01-12

Server Specs

  • CPU: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz x 2sockets (48HW threads total)
  • Main memory: 256GB

Comparison of DataFrame libraries using TPC-H

The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size, is 10 (data set of about 10 GB), and the time spent on non-file IO is measured.

The average speedup over pandas for 22 queries was 1.8x for Modin, 12x for Polars, and 17x for FireDucks.

polars-tpch1

The versions of the libraries used were as follows (the latest versions at the time of the measurements).

  • pandas: 2.1.4
  • Modin: 0.26.0
  • Polars: 0.20.2
  • FireDucks: 0.9.3

For each query, the FireDucks development team implemented the program using pandas for pandas, modin, and FireDucks, and used it for modin and FireDucks with changing import statement. polars/tpch was used for Polars.

The following chart shows the comparison results between Polars and FireDucks with larger dataset, Scalar Factor 10, 20 and 50. The vertical axis shows how many times faster FireDucks is compared to Polars. On average, FireDucks is 1.4 times (sf=10), 1.4 times (sf=20), and 1.6 times (sf=50) faster than Polars.

polars-tpch2