Cache or Eliminate? How FireDucks increase opportunity of optimization

By Kazuhisa Ishizaka | Tuesday, December 17, 2024

As described here, FireDucks uses lazy execution model with define-by-run IR generation. Since FireDucks uses MLIR compiler framework to optimize and execute IR, first step of the execution is creating MLIR function which holds operations to be evaluated. This article describes how important this function creation step is for optimization, thus performance.

In the simple example below, execution of IR is kicked by the print statement which calls df2.__repr__().

df0 = pd.read_parquet("date.parquet", [])
df1 = df0.sort_value("a")
df2 = df1[["b", "c"]]
print(df2)
  :

When the execution is started, FireDucks collects operations which are required to compute df2 and put those into a MLIR function. Simplest MLIR function would be as below.

// MLIR function (simplified)
func main() {
  %t0 = read_parquet("data.parquet", [])
  %t1 = sort_values(%t0, ["a"], [True])
  %t2 = project(%t1, ["b", "c"])
  return %t0, %t1, %t2
}

In this function, the return op at the end of the function returns three values, %t0, %t1 and %t2. Returned values will be bound to the variables in python, i.e. df0, df1 and df2, after the execution. If such variables will be used after print statement, those results will be used. One can say that the result of the operations are cached to avoid re-execution.

This function is however very strict for IR optimization. Because all output values of ops, %t0, %t1 and %t2, are returned, an optimizer has to preserve these values. On the other hand, for example, if return op returns only %t2, an optimizer can change IR as below. In this IR, only three column "a", "b", "c" will be read from a file by read_parquet op to minimize reading time and memory footprint because the result of it, %t0, is only used to compute %t2. As this simple example shows, because returning values from a function restricts optimization opportunity, returned values have to be carefully chosen.

// MLIR function (simplified)
func main() {
  %t0 = read_parquet("data.parquet", ["a", "b", "c"]) // only three columns are read from a parquet file.
  %t1 = sort_values(%t0, ["a"], [True])
  %t2 = project(%t1, ["b", "c"])
  return %t2
}

Liveness analysis

To this end, FireDucks uses liveness analysis of python variables when creating a function. For example, if print statement is the last line of a python script, the liveness analysis can say that df0 and df1 are dead at the print statement because no user of those variables is in the script after the statement. With this analysis, FireDucks can eliminate %t0 and %t1 from the return op. As an another example, if df1 is used after print statement, the liveness analysis says that it is live, and %t1 is not eliminated from return op.

In lazy execution system, re-execution sometimes drastically decreases performance. By using liveness analysis, FireDucks detects whether variables are used after execution or not in order to increase optimization opportunity with avoiding re-execution. AFAIK, this is the reason why FireDucks is very faster than polars in this article. You can reproduce it on colab.

How does it work in a notebook?

You may wonder how liveness analysis works when you are writing in a notebook or ipython. In such environment, because all python variables might be used in future cells, liveness analysis has to say that all variables are live. This results in limited optimization opportunity.

One workaround is that using chained style:

pd.read_parquet("date.parquet", []).sort_values("a")[["b", "c"]]

In this style, because results of intermediate operations, read_parquet and sort_values are not bound to any python variables, there is no chance to be used in future cells. In this case, FireDucks’s liveness analysis can say that those are dead when evaluating the result of last operation.