Benchmark Archive
Older benchmarks
This section shows the results of FireDucks performance using a few popular benchmarks.
We evaluated the performance of db-benchmark using FireDucks. Db-benchmark includes scenarios that execute fundamental data science operations across multiple size datasets. As of September 10, 2024, FireDucks seems to be the fastest dataframe library for groupby and join operation with big data.
You can check the complete details of the evaluation result here.
FireDucks shows promise with frequently used Database-like operations like Join, GroupBy etc. on large data with different complexities.
The FireDucks version used in the measurements is as follows
Server specification (conforms to db-benchmark measurement conditions):
Server specification (AWS EC2 m7i.8xlarge):
The following graph compares four data frame libraries (pandas, modin, polars, and fireducks) on 22 different queries included in the benchmark. The vertical axis shows how many times faster compared to pandas on a logarithmic scale, where anything greater than 1 indicates that it is faster than pandas. The Scale Factor, which represents the data size is 10 (dataset of about 10 GB), the time spent on each query was measured excluding I/O (RUN_IO_TYPE=skip), and the input data is generated using pyarrow (instead of polars) for a fair comparison. Please refer to the procedure mentioned in README to reproduce the evaluation result.
The average speedup over pandas for 22 queries was 1.0x for Modin, 57x for Polars, and 125x for FireDucks.
The versions of the libraries used were as follows (the latest versions at the time of the measurements).
This benchmark is originally from polars/tpch. Because this repository includes all 22 queries for polars but not all for pandas, we have implemented all 22 queries using pandas, and then executed those using FireDucks by import hook. These queries have been developed as per the implementation in polars queries, so that the apple-to-apple performance comparison can be made on these two libraries. Since polars has different APIs from pandas, there are minor differences in the implementation, but we kept the number of time consuming operations like join/merge, filter, groupby etc. as similar to that in polars implementation.
Server specification:
This section presents a comparison of pandas and FireDucks using TPCx-BB. TPCx-BB includes queries related to data analysis using machine learning and its preprocessing. In this evaluation, we used the pandas implementation of TPCx-BB implemented by the FireDucks development team to perform measurements on pandas and FireDucks. File IO is included in the measurement range.
With TPCx-BB, FireDucks is up to 17 times faster than pandas and 6.7 times faster on average.
The versions used in the measurements are as follows
Older benchmarks