FireDucks – Posts

Posts: Workshop at Bangalore, India

Thu, 19 Sep 2024 00:00:00 +0900

We had a workshop on FireDucks with faculties from universities around Bangalore. Thank you for joining and discussion.

Posts: Have you ever thought of speeding up your data analysis in pandas with a compiler?

Mon, 01 Jul 2024 00:00:00 +0000

In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualizations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations.

With the pitfall of its single-core implementation and inefficient data structures, often we face performance issues when dealing with pandas for relatively larger data, but its performance is also highly impacted by the choice of APIs, their parameters and execution orders. Sometime an efficient writing of a pandas application can itself make it 10-20x faster.

In this article, I will discuss a few commonly used examples picked up from some pandas applications.

:notebook: Example 01

df = pd.DataFrame()
s = pd.Series(["2020-10-10", "2021-11-20", "2022-08-03", "2023-07-04"])
df["year"] = pd.to_datetime(s).dt.year
df["month"] = pd.to_datetime(s).dt.month
df["day"] = pd.to_datetime(s).dt.day

We can clearly identify the major issue with the above code is using the expression, pd.to_datetime(s) three times. The purpose is to extract the year, month and day fields from the series data (s) of type “datetime64[ns]”. The method to_datetime() attempts to parse the input string column in order to convert it to a datetime column (along with inferring the format if not specified). Such parsing itself is very expensive when performed on a large data and if the same expression on the same input is written more than one time, its very obvious that the program will experience a significant performance issue.

As you might have clearly figured out, an optimized solution to this problem could be:

df = pd.DataFrame()
s = pd.Series(["2020-10-10", "2021-11-20", "2022-08-03", "2023-07-04"])
dt_s = pd.to_datetime(s)
df["year"] = dt_s.dt.year
df["month"] = dt_s.dt.month
df["day"] = dt_s.dt.day

This is a very basic example which can be manually optimized, if reviewed carefully after an application is developed. Let’s take another example which is very common when performing an extensive data analysis.

:notebook: Example 02

res = pd.DataFrame()

# To find industry-wise average salary of an Indian employee
res["industry_wise_avg_sal"] = (
 employee[employee["country"] == "India"]
 .groupby("industry")["salary"]
 .mean()
)

# To find industry-wise average salary of an Indian employee who are above 30
res["industry_wise_avg_sal_for_specific_age_group"] = (
 employee[(employee["country"] == "India") & (employee["age"] >= 30)]
 .groupby("industry")["salary"]
 .mean()
)

print(res)

Both of these queries have a common expression to be evaluated during the filter operation, i.e., checking whether the employee belongs to India (comparison on String column is itself very costly than comparison on numeric columns). Now, imagine if we are dealing with an extensively large employee database, then evaluating employee["country"] == "India" for two times can be quite an expensive operation that can easily be optimized with the same strategy as follows:

# To generate the required filtration masks in advance
cond1 = (employee["country"] == "India")
cond2 = (employee["age"] >= 30)

res = pd.DataFrame()
# To find industry-wise average salary of an Indian employee
res["industry_wise_avg_sal"] = (
 employee[cond1]
 .groupby("industry")["salary"]
 .mean()
)

# To find industry-wise average salary of an Indian employee who are above 30
res["industry_wise_avg_sal_for_specific_age_group"] = (
 employee[cond1 & cond2]
 .groupby("industry")["salary"]
 .mean()
)

print(res)

In compiler technology, such optimization is called as common sub-expression elimination (CSE) and can essentially be performed by a smart compiler in programming language like C/C++.

:notebook: Example 03

Let’s take another example of a compiler optimization technique:

def func(x: pd.DataFrame, y: pd.DataFrame):
 merged = x.merge(y, on="key")
 sorted = merged.sort_values(by="key") # is never used
 return merged.groupby("key").max()

The method is trying to merge two input tables, x and y followed by a groupby-aggregate operation. The “sort” operation is also a part of the method after merging the tables, but it is actually not required from the method context (since the sorted result is never used within the method). Sometime when we focus on very detail exploration of the input data, it is very common that such piece of unwanted code remains in our application resulting a significant performance cost.

Hence, an optimized solution to the above method could be:

def func(x: pd.DataFrame, y: pd.DataFrame):
 merged = x.merge(y, on="key")
 return merged.groupby("key").max()

In compiler technology, such optimization is called as dead code elimination and is very important from the overall performance point of view of any application.

An expert programmer might take care of such cases while developing his data analysis solutions, but as a data analyst our primary focus is to explore more about the data from different verticals and horizontals to find out meaningful insights while solving some business problems or creating important features for our training data. Hence we often miss to take care of such issues. Nowadays, there are many good linters that will point out such issues, so a manual optimization is definitely possible if you include the linter execution step as part of your TODO action when developing an application.

:point_right: New Information alert!

Let’s now talk about an ideal scenario!

How would it be if such optimization is automatically taken care by the python data manipulation library we use?

It can save us from taking care of such issues by ourselves and can significantly improve the performance of our application, right?

Well, the wait is over! Let me introduce a high-performance DataFrame library, named FireDucks with highly compatible pandas APIs, powered by MLIR（Multi-Level Intermediate Representation) framework to have such powerful compiler optimization abilities.

The library is carefully developed at NEC R&D laboratory over last 3 years and is freely available to be installed using “pip” under BSD licence since October, 2023.

:fire: :bird: What does it offer?

FireDucks is developed with the following points in mind:

Automatic query optimization ability with lazy-execution model.
- A data analyst can focus more on exploring data, while the inbuilt compiler of FireDucks can take care of the following:
  1. basic compiler optimizations, like common sub-expression elimination, dead code elimination etc.
  2. domain specific optimizations, like execution reordering, dropping unwanted columns in advance etc.
  3. pandas specific optimizations, like choice of right APIs when executing a query (by careful analysis of the application objective), choice of right parameters by avoiding unwanted operations (like sorting of result etc.)
A pandas user can flexibly adapt to this library without any new learning cost.
- FireDucks is highly compatible with pandas, so any pandas application can be optimized without any manual code changes. Doesn’t it sound quite promising?
The program can leverage all the available cores in the execution environment:
- The single-core execution issue in pandas is solved!

:computer: Demonstration

Here is a link for a test drive with a sample walkthrough notebook on Google Colab which shows how easy it is to start with FireDucks and its performance gain over pandas for a sample CSE use-case (Example 01 of this article).

With a low-spec execution environment having 2 cores, where pandas takes 1.3 seconds to execute a sample query, FireDucks (with multhreaded + compiler enabled) takes only 166 milli-seconds. Such speedup (~8x) without incurring any migration cost (no cost involves in rewritting the application from pandas to FireDucks) or any special hardware cost seems quite beneficial from the overall production cost point of view.

In case you are interested to see how it performs for popular benchmarks (like TPC-H, db-benchmark etc.) in comparison to other high-performance pandas alternatives, you may like to check this out.

:point_right: How does FireDucks work?

FireDucks comes with three powerful layers:

a python frontend highly compatible with pandas APIs
an in-built compiler to auto-detect and optimize the exsting performance issues in a user program
a multithreaded C++ backend with efficient parallel implementation of all the dataframe related operations like join, filter, groupby sort etc.

Unlike pandas that executes the library functions right after they are called, FireDucks creates some special instructions and keep them accumulated until there is some explicit request for the result (print result, do some reduction, to_csv etc.). Before the execution, the in-built compiler inspects all the accumulated instructions related to the result to be processed and performs some automatic optimization (as explained above) and then the optimized instructions are executed at the multithreaded kernel backend helping us to focus more on our analytical work and be more productive.

✍️ Conclusion

FireDucks shows promise by taking care of all the drawbacks associated with pandas and its compiler optimization technology makes it one of its kind. This article introduces FireDucks basic compilation optimization abilities as mentioned in (a). In upcoming articles, I will demonstrate other powerful optimization areas (b and c) automatically taken care of by FireDucks. If you like this article, please be with me to check them out as well. You may like to try FireDucks and share your feedback. I would love to answer to whatever queries you may have in mind with the best of my knowledge. The development team is very active and there is a new release almost every week with performance improvements, new features based on user requests, bug fixes etc. You may like to get in touch with them directly through the slack channel.

You may also like to checkout one of my previous articles on FireDucks salient features.

Posts: Introduction to FireDucks: Get performance beyond pandas with zero learning cost!

Wed, 08 May 2024 00:00:00 +0000

Posts: Backtesting Trading Strategies with Ease: An excursion with FireDucks

Fri, 22 Mar 2024 00:00:00 +0900

Posts: FireDucks: Diving into API Compatibility with Pandas

Thu, 21 Mar 2024 00:00:00 +0900

Posts: Choosing Your Data Champion: A Side-by-Side Look at FireDucks and Polars

Wed, 20 Mar 2024 00:00:00 +0900

Posts: Boosting Data Analysis with FireDucks

Tue, 05 Mar 2024 00:00:00 +0900

Posts: A hidden fact you must know when working with pandas

Sat, 16 Dec 2023 00:00:00 +0000

Posts: Tricks to improve computational performance of JOIN operation more than 10x

Mon, 11 Dec 2023 00:00:00 +0000

Posts: FireDucks - An economical and environment-friendly high-performance solution for your complex Data Analysis

Thu, 07 Dec 2023 00:00:00 +0000

Posts: One thing you might be doing wrong in pandas!!

Wed, 06 Dec 2023 00:00:00 +0000

Posts: Acceleration technology inside FireDucks

Tue, 05 Dec 2023 00:00:00 +0900

SWITCHING GROUPBY

In this article, we introduce the acceleration techniques of “groupby” used in FireDucks.

The groupby operation is one of the most fundamental and important operations in tabular data analysis. We can use the groupby operation to obtain important statistical properties such as the mean and variance of the data. We can also combine it with other operations to obtain new features.

FireDucks optimizes based on data characteristics for fast groupby operations. One such optimization is the automatic selection of groupby algorithms based on the number of groups. FireDucks’ groupby algorithm focuses on the number of groups of data and switches between an algorithm that is fast for data with a small number of groups (Algorithm A) and an algorithm that is fast for data with a large number of groups (Algorithm B). The number of groups indicates the type of data that make up the target column.

Consider the following tabular data.

	food	category
0	apple	fruit
1	carrot	vegetable
2	peach	fruit
3	onion	vegetable
4	grape	fruit

There are five elements that make up the “food” column: “apple”, “carrot”, “peach”, “onion”, and “grape”. Therefore, the number of groups in the “food” column is 5. Next, we see that there are two types of elements that make up the “category” column: fruit and vegetable. Therefore, the number of groups in the “category” column is 2.

Calculating the number of groups for large data sets is time-consuming. Therefore, FireDucks uses a statistical method to estimate the number of groups without obtaining an exact value for the number of groups¹. The estimation is performed as follows.

Extracts a random piece of data from a sequence of data of interest.
Record that data.
Again, take out one data at random from the column of data of interest.
That data is checked to see if it matches the recorded data and record.
Repeat operations 3 and 4 multiple times.

After repeating operations 3 and 4 multiple times, FireDucks estimates the number of groups of data in the group key of interest from the number of times the data was retrieved and the number of matches.

Evaluation

We measured the calculation speed using TPC-H, a benchmark that includes many processes related to data analysis.

“A" uses only Algorithm A for groupby operations, and “B” uses only Algorithm B for groupby operations. “auto" estimates the number of groups and automatically selects between Algorithm A and B for the groupby operation.

Processes for which there is little difference in execution time between Algorithm A and Algorithm B are omitted from the above graph. The graph shows that, with the exception of q10, the automatic selection algorithm is the faster of Algorithm A and Algorithm B.

The total shows the computation time for the entire TPC-H, including the processes excluded from the above graph for Algorithm A, Algorithm B, and the automatic selection algorithm. This indicates that the automatic selection algorithm is about 3 times faster than Algorithm A, and about 1.2 times faster than Algorithm B for TPC-H as a whole.

Reference

M. Bressan, E. Peserico, and L. Pretto. Simple set cardinality estimation through random sampling. CoRR, abs/1512.07901, 2015. ↩︎

Posts: Application example: Spicy MINT at Toyota Technical Development Corporation

Thu, 19 Oct 2023 00:00:00 +0000