Acceleration in FireDucks

There are two mechanisms to accelerate FireDucks. The first is compiler optimization on the IR, intermediate representation, and the second is multithreading on the backend.

Compiler Optimization

FireDucks uses a runtime compiler mechanism to convert Python programs into an intermediate language before execution. Optimization on the intermediate language means that the Python program is converted to an intermediate language that can be executed faster without changing the meaning of the program, rather than being executed as-is. This is equivalent to automatically performing the kind of tuning that a skilled programmer would perform when writing a program.

The FireDucks intermediate language is an intermediate language designed specifically for DataFrames, and each instruction in the intermediate language is a highly abstract, information-rich instruction that represents an operation on a DataFrame. Therefore, the FireDucks compiler can understand the meaning of a program without complicated program analysis and can perform optimization specific to DataFrame.

Examples of Optimizations

Here is an example of such optimization. The following program extracts rows from a DataFrame in which column a is greater than 10, and then extracts column b from that row.

selected = df[df["a"] > 10]["b"]

This is a commonly used process, and the code looks straightforward. However, the process of extracting the first row covers all columns, which is not efficient if df has more columns than just a and b. Because DataFrames generally use column-oriented data structures, the process of extracting a specific row is a much more time-consuming process than extracting a column, and doing so for all columns may result in non-negligible overhead.

FireDucks optimization uses such domain knowledge to transform the intermediate language so that column extraction is performed first. The transformed process, written in Python, looks like the following code.

tmp = df[["a", "b"]]
selected = tmp[df["a"] > 10]["b"]

While skilled users who are aware of DataFrame’s internal data structures will prefer code like this, FireDucks performs this acceleration as an optimization on an intermediate language.

Multithreading

In FireDucks, the user API and its execution are completely independent of each other via an intermediate language. The backend executes the instructions of the intermediate language and has the ability to perform specific data structures and operations on data frames.

FireDucks allows the backend to be changed, for example, a backend tuned for multi-core, a backend using accelerators such as GPUs, etc., to match the target environment, thereby increasing speed. The backend can be changed by environment variables, allowing the user to switch backends without changing the user program at all.

FireDucks includes a multi-threaded backend for CPUs. This backend uses Apache Arrow as the data structure and adds its own parallelization in addition to the data frame operations provided by Apache Arrow.