Exploring performance benefits of FireDucks over cuDF

Research says that Data scientists spend about 45% of their time on data preparation tasks, including loading (19%) and cleaning (26%) the data. Pandas is one of the most popular python libraries for tabular data processing because of its diverse utilities and large community support. However, due to its performance issue with the large-scale data processing, there is a strong need for high-performance data frame libraries for the community. Although there are many alternatives available at this moment, due to compatibility issues with pandas some of those either compel a user to learn completely new APIs (incurring migration cost) or to switch to a more efficient computational systems, like GPU etc. (incurring hardware cost).

In this article we will discuss two high-performance pandas alternatives that can help a pandas programmer to smoothly migrate an existing application while offering promising speed. They are:

  • cuDF: GPU accelerated DataFrame library with highly compatible pandas APIs
  • FireDucks: A compiler accelerated DataFrame library with highly compatible pandas APIs for speedup even on CPU only systems

FireDucks vs. cuDF

Both FireDucks and cuDF offer the following:

  • ensure zero code changes with promising speedup
  • highly-compatible pandas APIs for a seamless integration with an existing pandas application
  • import-hook feature for a seamless integration with third party library using pandas
  • parallel implementation of the kernel algorithms (like join, groupby etc.) to leverage all the available cores

However, the key differences are:

  • FireDucks can speedup an existing pandas application even on CPU only systems, whereas one needs to prepare a GPU environment before trying cuDF.
  • FireDucks supports a lazy execution model aiming for JIT query optimization, whereas cuDF supports only an eager execution model (similar to pandas). Therefore, if the program is not written carefully with the right data-flow, cuDF might suffer performance issue while FireDucks can outperform cuDF even on CPU only systems due to its efficient query optimization.

Evaluation

Multi-threaded Benefit

Here is an article explaining the key features of cuDF along with its performance. We have used the notebook provided in that article to evaluate pandas, fireducks.pandas, and cudf.pandas respectively.

Here are some details related to the evaluation environment:

  • CPU model: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
  • CPU cores: 48
  • Main memory: 256gb
  • GPU model: NVIDIA Tesla V100

It can be noted that, by simply enabling the extension %load_ext fireducks.pandas or %load_ext cudf.pandas, one can successfully speedup the operations in an existing pandas notebook using FireDucks or cuDF. For this experiment, we have disabled FireDucks lazy-execution mode as follows for a fair comparison among these 3 libraries:

from fireducks.core import get_fireducks_options
get_fireducks_options().set_benchmark_mode(True)

The table below summarizes the query wise execution time for these libraries:

pandas (sec)FireDucks (sec)cuDF (sec)speedup_from_FireDucks_over_pandasspeedup_from_cuDF_over_pandas
data_loading1.850.530.423.49x4.4x
query_12.40.080.3530.0x6.86x
query_20.750.030.0125.0x75.0x
query_36.380.150.0842.53x79.75x

Due to difference in the underlined hardware, cuDF operations (that worked on GPU) definitely performed much better when compared to pandas, but the performance gain from FireDucks over pandas even on CPU is quite promising. In fact, the overall speedup is ~13x (11.37s -> 0.87s) when using cuDF, whereas it is ~14x (11.37s -> 0.79s) when using FireDucks for the same pandas program.

JIT Optimization Benefit

The above case shows how efficiently FireDucks can leverage the available cpu cores to speedup an existing pandas program.

Let’s now understand how FireDucks JIT query optimization can make it even better!!

We have used a sample query from the TPC-H benchmark that deals with a couple of tables of different dimensions for a scale-factor 10.

👉 Purpose: To retrieve the 10 unshipped orders with the highest value.

Here is the pandas implementation for this query:

(
    pd.read_parquet(os.path.join(datapath, "customer.parquet"))
      .merge(pd.read_parquet(os.path.join(datapath, "orders.parquet")), 
             left_on="c_custkey", right_on="o_custkey")
      .merge(pd.read_parquet(os.path.join(datapath, "lineitem.parquet")), 
             left_on="o_orderkey", right_on="l_orderkey")
      .pipe(lambda df: df[df["c_mktsegment"] == "BUILDING"])
      .pipe(lambda df: df[df["o_orderdate"] < datetime(1995, 3, 15)])
      .pipe(lambda df: df[df["l_shipdate"] > datetime(1995, 3, 15)])
      .assign(revenue=lambda df: df["l_extendedprice"] * (1 - df["l_discount"]))
      .groupby(["l_orderkey", "o_orderdate", "o_shippriority"], as_index=False)
      .agg({"revenue": "sum"})[["l_orderkey", "revenue", "o_orderdate", "o_shippriority"]]
      .sort_values(["revenue", "o_orderdate"], ascending=[False, True])
      .reset_index(drop=True)            
      .head(10)
      .to_parquet(os.path.join(datapath, "q3_result.parquet"))      
)

This time we have used the default lazy-execution mode in FireDucks to demonstrate its true strength. The execution time of this query for each DataFrame library is as follows:

  • native pandas: 215.47 sec
  • fireducks.pandas: 1.69 sec
  • cudf.pandas: 26.79 sec

🚀🚀 FireDucks outperformed pandas upto 127x (215.47s -> 1.69s) and cuDF upto 15x (26.79s -> 1.69s) for the avove query.

🤔 You might be wondering how a CPU-based implementation in FireDucks can be faster than a GPU-based implementation in cuDF!!

This speedup from FireDucks is due to the efficient query planning and optimization that is performed by the internal JIT compiler. Instead of executing the input query as it is, it attempts to optimize the same by reducing the scope of the input data for the time consuming join, groupby etc. operations majorly using the following steps:

  • loading only required columns from the input parquet files to reduce the data horizontally
  • performing early filtration to reduce the data vertically

📓 In case of FireDucks lazy-execution mode, when a method like to_parquet, plot, print etc. are called, it enables the compiler to start optimizing the accumulated data flow. Once the optimization phase is completed, it is executed by a multi-threaded CPU kernel backed by arrow memory helping you to experience superfast data processing, along with remarkable reduction in the computational memory.

The optimized implementation for the same query could be as follows:

req_customer_cols = ["c_custkey", "c_mktsegment"] # selecting (2/8) columns
req_lineitem_cols = ["l_orderkey", "l_shipdate", "l_extendedprice", "l_discount"] # selecting (4/16) columns
req_orders_cols = ["o_custkey", "o_orderkey", "o_orderdate", "o_shippriority"] # selecting (4/9) columns
customer = pd.read_parquet(os.path.join(datapath, "customer.parquet"), columns = req_customer_cols)
lineitem =  pd.read_parquet(os.path.join(datapath, "lineitem.parquet"), columns = req_lineitem_cols)
orders =  pd.read_parquet(os.path.join(datapath, "orders.parquet"), columns = req_orders_cols)
    
# advanced-filter: to reduce scope of “customer” table to be processed
f_cust = customer[customer["c_mktsegment"] == "BUILDING"]

# advanced-filter: to reduce scope of “orders” table to be processed
f_ord = orders[orders["o_orderdate"] < datetime(1995, 3, 15)]

# advanced-filter: to reduce scope of “lineitem” table to be processed
f_litem = lineitem[lineitem["l_shipdate"] > datetime(1995, 3, 15)]

(
    f_cust.merge(f_ord, left_on="c_custkey", right_on="o_custkey")
          .merge(f_litem, left_on="o_orderkey", right_on="l_orderkey")
          .assign(revenue=lambda df: df["l_extendedprice"] * (1 - df["l_discount"]))
          .groupby(["l_orderkey", "o_orderdate", "o_shippriority"], as_index=False)
          .agg({"revenue": "sum"})[["l_orderkey", "revenue", "o_orderdate", "o_shippriority"]]
          .sort_values(["revenue", "o_orderdate"], ascending=[False, True])
          .reset_index(drop=True)
          .head(10)
          .to_parquet(os.path.join(datapath, "opt_q3_result.parquet"))
)

The execution time of this optimized implementation for each DataFrame library is as follows:

  • native pandas: 11.13 sec
  • fireducks.pandas: 1.72 sec
  • cudf.pandas: 0.76 sec

It can be noted that:

  • the native pandas could itself be optimized upto ~19x (215.47 sec -> 11.13 sec)
  • there is no visible change in the execution time of FireDucks (since the compiler does the same optimization automatically in the earlier case)
  • the cudf.pandas could be optimized upto ~35x (26.79 sec -> 0.76 sec)

Most importantly there is no impact in the final result due to the optimization performed. You can reproduce the same using this notebook at your end.

Wrapping up

Thank you for your time in reading this article. We have discussed performance benefit of FireDucks over cuDF. While cuDF shows significant speedup without modifying an existing pandas program, its performance relies on the underlined GPU specification and how well the program is written, whereas FireDucks can optimize an existing pandas program efficiently like an expert programmer and execute the same without any extra overhead, that too on CPU only systems.

Being said that, a GPU version of FireDucks is under dvelopment. It internally uses cuDF.pandas for the kernel operations (like groupby, join etc.), while adding the JIT optimization for further acceleration as explained in this article. For example, even when you write the query as in the first implementation, it would be auto-optimized by the FireDucks compiler similar to the optimized implementation and then it will be passed to the cuDF kernel for the execution at the GPU side (helping you to experience the query to be finished in ~0.76 sec). We will be talking about the GPU version of FireDucks in details in some other article.

We look forward your constant feedback to make FireDucks even better. Please feel free to get in touch with us in any of your prefered channel mentioned below: