Cache or Eliminate? How FireDucks increase opportunity of optimization
As described here, FireDucks uses lazy execution model with define-by-run IR generation. Since FireDucks uses MLIR compiler framework to optimize and execute IR, first step of the execution is creating MLIR function which holds operations to be evaluated. This article describes how important this function creation step is for optimization, thus performance.
In the simple example below, execution of IR is kicked by the print
statement
which calls df2.__repr__()
.
df0 = pd.read_parquet("date.parquet", [])
df1 = df0.sort_value("a")
df2 = df1[["b", "c"]]
print(df2)
:
When the execution is started, FireDucks collects
operations which are required to compute df2
and put those into a MLIR
function.
Simplest MLIR function would be as below.
// MLIR function (simplified)
func main() {
%t0 = read_parquet("data.parquet", [])
%t1 = sort_values(%t0, ["a"], [True])
%t2 = project(%t1, ["b", "c"])
return %t0, %t1, %t2
}
In this function, the return
op at the end of the function returns three
values, %t0
, %t1
and %t2
. Returned values will be bound to the variables
in python, i.e. df0
, df1
and df2
, after the execution. If such variables
will be used after print
statement, those results will be used. One can say
that the result of the operations are cached to avoid re-execution.
This function is however very strict for IR optimization. Because all output
values of ops, %t0
, %t1
and %t2
, are returned, an optimizer has to
preserve these values. On the other hand, for example, if return
op returns
only %t2
, an optimizer can change IR as below. In this IR, only three column
"a", "b", "c"
will be read from a file
by read_parquet
op to minimize reading time and memory footprint because the
result of it, %t0
, is only used to compute %t2
. As this simple example shows,
because returning values from a function restricts optimization opportunity,
returned values have to be carefully chosen.
// MLIR function (simplified)
func main() {
%t0 = read_parquet("data.parquet", ["a", "b", "c"]) // only three columns are read from a parquet file.
%t1 = sort_values(%t0, ["a"], [True])
%t2 = project(%t1, ["b", "c"])
return %t2
}
Liveness analysis
To this end, FireDucks uses liveness analysis of python variables when creating
a function. For example, if print
statement is the last line of a python
script, the liveness analysis can say that df0
and df1
are dead at the
print
statement because no user of those variables is in the script after the
statement. With this analysis, FireDucks can eliminate %t0
and %t1
from
the return
op. As an another example, if df1
is used after print
statement, the liveness analysis says that it is live, and %t1
is not
eliminated from return
op.
In lazy execution system, re-execution sometimes drastically decreases performance. By using liveness analysis, FireDucks detects whether variables are used after execution or not in order to increase optimization opportunity with avoiding re-execution. AFAIK, this is the reason why FireDucks is very faster than polars in this article. You can reproduce it on colab.
How does it work in a notebook?
You may wonder how liveness analysis works when you are writing in a notebook or ipython. In such environment, because all python variables might be used in future cells, liveness analysis has to say that all variables are live. This results in limited optimization opportunity.
One workaround is that using chained style:
pd.read_parquet("date.parquet", []).sort_values("a")[["b", "c"]]
In this style, because results of intermediate operations, read_parquet
and
sort_values
are not bound to any python variables, there is no chance to be
used in future cells. In this case, FireDucks’s liveness analysis can say that
those are dead when evaluating the result of last operation.