Execution Model
The execution model of FireDucks differs from that of pandas. pandas is an eager execution model in which the process is executed immediately upon method invocation, while FireDucks is a lazy execution model in which the process is executed in batches when the results are needed.
Lazy execution model
The following figure shows the execution images of pandas and FireDucks.
In pandas, for example, calling the read_csv
method reads data from a CSV file. FireDucks, on the other hand, only generates an intermediate language equivalent to read_csv
, but does not read the data. Therefore, the line df = pd.read_csv("data.csv")
ends immediately in FireDucks.
Thus, the main methods of FireDucks do not actually process the data frame, but generate the intermediate language. Each time a method is called, more and more intermediate language is generated, and when the result is needed (e.g., when writing to a csv file), the previously generated intermediate language is executed all at once.
FireDucks performs actual execution at the following points
- Saving to a file (
DataFrame.to_csv
,DataFrame.to_parquet
, etc.) - Displaying a data frame (
print(df)
, etc.)
Because of these differences in execution models, those familiar with pandas may find that time-consuming methods finish in a fraction of a second, or that saving to a csv file takes longer than expected.
About time measurement
FireDucks performs delayed execution, so if you want to measure the actual processing time for each method, a little additional effort is required.
For example, if you insert timers are below, the last interval (t3 - t2
) will include the time of all processes.
t0 = time.time()
df = pd.read_csv("data.csv")
t1 = time.time()
df = df.sort_values("a")
t2 = time.time()
df.to_csv("sorted.csv")
t3 = time.time()
From version 0.9.1, benchmark mode has been introduced for this purpose. When bechmark mode is enabled, FireDucks actually executes a method immediately after it is called. Since this disables some optimizations of FireDucks, it should be used only when you want to measure individual methods.
To enable benchmark mode, please use environment variable:
FIREDUCKS_FLAGS="--benchmark-mode"
Or it can be enabled in a code as:
from fireducks.core import get_fireducks_options
get_fireducks_options().set_benchmark_mode(True)
The other workaround is calling FireDucks’ own method _evaluate
for immediate execution.
t0 = time.time()
df = pd.read_csv("data.csv")._evaluate()
t1 = time.time()
df = df.sort_values("a")._evaluate()
t2 = time.time()
df.to_csv("sorted.csv")._evaluate()
t3 = time.time()
Point to be noted
It is advised to refrain from using the conventional way of checking KeyError
using try-catch block, when using
FireDucks default mode of lazy-execution. For example, consider the below example:
def project(df, cname):
try:
s = df[cname]
except KeyError:
print(f"Column {cname} doesn't exist in input dataframe")
else:
return s
Due to lazy-execution, s = df[cname]
will not actually be executed right after it is called.
Hence the try-catch execution might not work as expected in this case. If you cannot avoid the above
code structure due to some restriction, it is advised to call it as s = df[cname]._evaluate()
to
enforce the statement to be executed at that very point.
Otherwise, it is advised to use more standard way to implement the above case as follows:
def project(df, cname):
if cname not in df:
print(f"Column {cname} doesn't exist in input dataframe")
else:
s = df[cname]
return s
The in
operator invokes DataFrame.__contains__() which is a non-lazy method,
since it returns the result to be either True or False and needs to be evaluated right
after it is called.