Unveiling the Optimization Benefit of FireDucks Lazy Execution: Part #2

In the previous article, we have talked about how FireDucks can take care pushdown-projection related optimization for read_parquet(), read_csv() etc. In today’s article, we will focus on the efficient caching mechanism by its JIT compiler.

Let’s consider the below sample query for the same data, used in previous article:

import pandas as pd

df = pd.read_parquet("sample_data.parquet")
f_df = df.loc[df["a"] > 3, ["x", "y", "z"]]
r1 = f_df.groupby("x")["z"].sum()
print(r1)

When executing the above program (saved as sample.py) as follows:

$ FIRE_LOG_LEVEL=3 python -mfireducks.pandas sample.py

You can find the generated IR before and after optimization:

2024-12-05 12:37:21.012481: 958259 fireducks/lib/fireducks_core.cc:64] Input IR:
func @main() {
  %t0 = read_parquet('sample_data.parquet', [])
  %t1 = project(%t0, ['x', 'y', 'z'])
  %t2 = project(%t0, 'a')
  %t3 = gt.vector.scalar(%t2, 3)
  %t4 = filter(%t1, %t3)
  %t5 = groupby_select_agg(%t4, ['x'], ['sum'], [], [], 'z')
  %v6 = get_shape(%t5)
  return(%t5, %v6)
}

2024-12-05 12:37:21.013462: 958259 fireducks/lib/fireducks_core.cc:73] Optimized IR:
func @main() {
  %t0 = read_parquet('sample_data.parquet', ['x', 'a', 'z'])
  %t1 = project(%t0, ['z', 'x'])
  %t2 = project(%t0, 'a')
  %t3 = gt.vector.scalar(%t2, 3)
  %t4 = filter(%t1, %t3)
  %t5 = groupby_select_agg(%t4, ['x'], ['sum'], [], [], 'z')
  %v6 = get_shape(%t5)
  return(%t5, %v6)
}

It can be noted that the compiler correctly identified the projection targets for read_parquet() as “x”, “a”, and “z” columns. Although the “y” column is specified to be projected in the loc indexer, but that column is never used within the above program. Hence, that is not even loaded during the read_parquet stage.

Could lazy execution be expensive?

Now, the question is what will happen if we want to perform another groupby-aggregation on the same filtered dataframe that requires “y” column as follows:

df = pd.read_parquet("sample_data.parquet")
f_df = df.loc[df["a"] > 3, ["x", "y", "z"]]
r1 = f_df.groupby("x")["z"].sum()
print(r1)

r2 = f_df.groupby("y")["z"].sum() # newly added groupby-sum
print(r2)

Since FireDucks performs lazy execution,

  1. will it process two expensive calls as follows?
r1 = (
  pd.read_parquet("sample_data.parquet", columns="["x", "z", "a"])
    .loc[df["a"] > 3, ["x", "z"]]
    .groupby("x")["z"].sum()
)

r2 = (
  pd.read_parquet("sample_data.parquet", columns="["y", "z", "a"])
    .loc[df["a"] > 3, ["y", "z"]]
    .groupby("y")["z"].sum()
)
  1. Or, will it keep the intermediate filtered result (f_df) alive when processing r1? since it will be used later in the given program when processing r2.

👉 The answer is (2). It will effectively keep the intermediate results alive that are to be required at some later stage.

Let’s look into the generated IR of the before and after optimization for the modified program:

2024-12-05 13:26:41.691496: 959435 fireducks/lib/fireducks_core.cc:64] Input IR:
func @main() {
  %t0 = read_parquet('sample_data.parquet', [])
  %t1 = project(%t0, ['x', 'y', 'z'])
  %t2 = project(%t0, 'a')
  %t3 = gt.vector.scalar(%t2, 3)
  %t4 = filter(%t1, %t3)
  %t5 = groupby_select_agg(%t4, ['x'], ['sum'], [], [], 'z')
  %v6 = get_shape(%t5)
  return(%t5, %t4, %v6)
}

2024-12-05 13:26:41.692423: 959435 fireducks/lib/fireducks_core.cc:73] Optimized IR:
func @main() {
  %t0 = read_parquet('sample_data.parquet', ['z', 'x', 'a', 'y'])    <- this time it also loads "y" column (as needed for r2)
  %t1 = project(%t0, ['x', 'y', 'z'])
  %t2 = project(%t0, 'a')
  %t3 = gt.vector.scalar(%t2, 3)
  %t4 = filter(%t1, %t3)
  %t5 = groupby_select_agg(%t4, ['x'], ['sum'], [], [], 'z')
  %v6 = get_shape(%t5)
  return(%t5, %t4, %v6)                                              <- this time it also returns filtered dataframe (%t4)
}

2024-12-05 13:26:41.706225: 959435 fireducks/lib/fireducks_core.cc:64] Input IR:
func @main(%arg0: !table) { later use.
  %t1 = groupby_select_agg(%arg0, ['y'], ['sum'], [], [], 'z')
  %v2 = get_shape(%t1)
  return(%t1, %v2)
}

2024-12-05 13:26:41.706721: 959435 fireducks/lib/fireducks_core.cc:73] Optimized IR:
func @main(%arg0: !table) {
  %t1 = groupby_select_agg(%arg0, ['y'], ['sum'], [], [], 'z')
  %v2 = get_shape(%t1)
  return(%t1, %v2)
}

The first “Optimized IR” is generated when processing r1. This time the compiler identifies the “y” column and the filtered dataframe (f_df) will be used at later stage when computing r2. Hence it will also load the “y” column and keep the intermediate filtered dataframe alive (in other word, cache it) by returning it (%t4) along with the result of r1 (%t5) to avoid further processing at later use.

👉If you carefully notice the previous IR returned only (%t5, %v6), when there was no computing related to r2 in the input program.

The second “Optimized IR” is generated when processing r2. The input %arg0 is the filtered dataframe (%t4) that the compiler kept alive. Hence only groupby-sum is performed when processing r2.

How to profile?

You can also check kernel-wise execution time, number of calls etc. by executing the program as follows:

$ FIREDUCKS_FLAGS="--trace=3 --trace-file=-" python -mfireducks.pandas sample.py

It will produce some profiling output as follows:

== kernel ==
fireducks.gt.vector.scalar                           0.004    8.26%          1
fireducks.read_parquet                               0.003    6.02%          1
fireducks.groupby_select_agg                         0.002    3.06%          2
fireducks.to_pandas.frame.metadata                   0.001    1.89%          2
fireducks.filter                                     0.001    1.34%          1
fireducks.project                                    0.000    0.03%          2

It can clearly be seen that the method related to read_parquet, filter etc. is called only once. In order to produce the similar profiling on Jupyter notebook, you can use the cell magic: %%fireducks.profile.

Wrapping-up

Thank you for your time in reading this article. In case you have any queries or have an issue to report, please feel free to get in touch with us in any of your prefered channel mentioned below: