Tips
The know-how of pandas, such as avoiding loops and apply, is also useful for FireDucks. Here are some tips to improve performance in FireDucks.
Avoid loops
Looping out data from a DataFrame is slow, so it is better to use the DataFrame API as much as possible (this is also true for pandas).
For example, the following loop processes the elements of a Series one by one.
s = 0
for i in range(len(df)):
if df["A"][i] > 2:
s += df["B"][i]
Using the API, you can write the following.
s = df[df["A"] > 2]["B"].sum()
The same applies to Dataframe.iterrows
and so on.
Do not use apply
Passing user-defined functions such as DataFrame.apply
is not currently supported by FireDucks’ current optimizer that generates an intermediate language and compiles it.
Do not use attribute-style column references
Column references can be written as df["A"]
or df.A
, but the latter may conflict with the original attributes of the DataFrame, so it is better to write df["A"]
in bracket format.
In FireDucks, the df.A
format requires processing to determine whether A
is a column name or not, which may result in loss of compiler optimization.
Do not use undefined behavior of pandas
Whether df["A"]
returns a reference or a copy in the following cases is undefined in pandas, but if it returns a copy, df
is not updated. In many cases this will not be the intended behavior.
df["A"][1] = 2
Such undefined behavior may work differently in FireDucks than in pandas, so what works in pandas may not work in FireDucks. It is better not to use undefined behavior.
In this example, it is safe to write the following. However, as mentioned above, element-by-element access is inefficient, so if another implementation is possible, it is better to use it.
df.iloc[1, 0] = 2 # if A is the first column
Avoid Fallback
FireDucks has a feature called fallback that calls pandas internally. This is a feature to improve pandas compatibility by using pandas to perform functions not currently supported by FireDucks. On the other hand, this is a disadvantage in terms of execution time and memory usage, since it involves converting the FireDucks data structure to pandas, executing the pandas method, and then converting it to the FireDucks data structure again.
We are continuously working to reduce fallbacks, but it is also effective to improve performance by avoiding fallbacks in user programs. The environment variable FIREDUCKS_FLAGS="-Wfallback"
can be used to log when fallbacks occur.
Profiling kernel-wise performance when using notebook
From FireDucks 0.10.9, we have provided some magic methods to profile the execution time of each frame/series related methods when running from Jupyter Notebook.
Use %load_ext fireducks.ipyext
to load the ipython extension module and then apply the cell-magic as %%fireducks.profile
in the target cell you want to trace.
The following figure shows an example of the newly introduced magic-methods:
📢 From FireDucks 0.12.2, fireducks.pandas
module internally loads fireducks.ipyext
. Therefore if fireducks.pandas
is already loaded as %load_ext fireducks.pandas
, then loading fireducks.ipyext
can be skipped.
By default the profile shows the top-10 methods based on the execution time.
You may like to configure the same as follows: pd.options.styler.render.max_rows = <desired-number>
âš Please do not use any import statement within the cell that is to be profiled using cell-magic: %%fireducks.profile
.
You may like to import the required libraries/modules in some different cell.
👉 Measurement for expensive fallback
If you find any method taking longer time due to fallback, you may either consider an alternative solution to avoid the fallback or you may like to report us for supporting the method with a reproducible example.
Also, when executing a python program, you may like to set the environment variable FIREDUCKS_FLAGS="--trace=3"
.
This will produce a file named, trace.json
containing several traces related to your program performance in the current working directory.
If you like to share that json file with us (it doesn’t contain any confidential information related to your data),
we will look into the areas for improvement and may suggest you corrective measures.