Have you ever thought of speeding up your data analysis in pandas with a compiler?

By Sourav Saha | Monday, July 01, 2024

In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualizations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations.

With the pitfall of its single-core implementation and inefficient data structures, often we face performance issues when dealing with pandas for relatively larger data, but its performance is also highly impacted by the choice of APIs, their parameters and execution orders. Sometime an efficient writing of a pandas application can itself make it 10-20x faster.

In this article, I will discuss a few commonly used examples picked up from some pandas applications.

:notebook: Example 01

df = pd.DataFrame()
s = pd.Series(["2020-10-10", "2021-11-20", "2022-08-03", "2023-07-04"])
df["year"] = pd.to_datetime(s).dt.year
df["month"] = pd.to_datetime(s).dt.month
df["day"] = pd.to_datetime(s).dt.day

We can clearly identify the major issue with the above code is using the expression, pd.to_datetime(s) three times. The purpose is to extract the year, month and day fields from the series data (s) of type “datetime64[ns]”. The method to_datetime() attempts to parse the input string column in order to convert it to a datetime column (along with inferring the format if not specified). Such parsing itself is very expensive when performed on a large data and if the same expression on the same input is written more than one time, its very obvious that the program will experience a significant performance issue.

As you might have clearly figured out, an optimized solution to this problem could be:

df = pd.DataFrame()
s = pd.Series(["2020-10-10", "2021-11-20", "2022-08-03", "2023-07-04"])
dt_s = pd.to_datetime(s)
df["year"] = dt_s.dt.year
df["month"] = dt_s.dt.month
df["day"] = dt_s.dt.day

This is a very basic example which can be manually optimized, if reviewed carefully after an application is developed. Let’s take another example which is very common when performing an extensive data analysis.

:notebook: Example 02

res = pd.DataFrame()

# To find industry-wise average salary of an Indian employee
res["industry_wise_avg_sal"] = (
  employee[employee["country"] == "India"]
   .groupby("industry")["salary"]
   .mean()
)

# To find industry-wise average salary of an Indian employee who are above 30
res["industry_wise_avg_sal_for_specific_age_group"] = (
  employee[(employee["country"] == "India") & (employee["age"] >= 30)]
   .groupby("industry")["salary"]
   .mean()
)

print(res)

Both of these queries have a common expression to be evaluated during the filter operation, i.e., checking whether the employee belongs to India (comparison on String column is itself very costly than comparison on numeric columns). Now, imagine if we are dealing with an extensively large employee database, then evaluating employee["country"] == "India" for two times can be quite an expensive operation that can easily be optimized with the same strategy as follows:

# To generate the required filtration masks in advance
cond1 = (employee["country"] == "India")
cond2 = (employee["age"] >= 30)

res = pd.DataFrame()
# To find industry-wise average salary of an Indian employee
res["industry_wise_avg_sal"] = (
  employee[cond1]
   .groupby("industry")["salary"]
   .mean()
)

# To find industry-wise average salary of an Indian employee who are above 30
res["industry_wise_avg_sal_for_specific_age_group"] = (
  employee[cond1 & cond2]
   .groupby("industry")["salary"]
   .mean()
)

print(res)

In compiler technology, such optimization is called as common sub-expression elimination (CSE) and can essentially be performed by a smart compiler in programming language like C/C++.

:notebook: Example 03

Let’s take another example of a compiler optimization technique:

def func(x: pd.DataFrame, y: pd.DataFrame):
    merged = x.merge(y, on="key")
    sorted = merged.sort_values(by="key") # is never used
    return merged.groupby("key").max()

The method is trying to merge two input tables, x and y followed by a groupby-aggregate operation. The “sort” operation is also a part of the method after merging the tables, but it is actually not required from the method context (since the sorted result is never used within the method). Sometime when we focus on very detail exploration of the input data, it is very common that such piece of unwanted code remains in our application resulting a significant performance cost.

Hence, an optimized solution to the above method could be:

def func(x: pd.DataFrame, y: pd.DataFrame):
    merged = x.merge(y, on="key")
    return merged.groupby("key").max()

In compiler technology, such optimization is called as dead code elimination and is very important from the overall performance point of view of any application.

An expert programmer might take care of such cases while developing his data analysis solutions, but as a data analyst our primary focus is to explore more about the data from different verticals and horizontals to find out meaningful insights while solving some business problems or creating important features for our training data. Hence we often miss to take care of such issues. Nowadays, there are many good linters that will point out such issues, so a manual optimization is definitely possible if you include the linter execution step as part of your TODO action when developing an application.

:point_right: New Information alert!

Let’s now talk about an ideal scenario!

How would it be if such optimization is automatically taken care by the python data manipulation library we use?

It can save us from taking care of such issues by ourselves and can significantly improve the performance of our application, right?

Well, the wait is over! Let me introduce a high-performance DataFrame library, named FireDucks with highly compatible pandas APIs, powered by MLIR（Multi-Level Intermediate Representation) framework to have such powerful compiler optimization abilities.

The library is carefully developed at NEC R&D laboratory over last 3 years and is freely available to be installed using “pip” under BSD licence since October, 2023.

:fire: :bird: What does it offer?

FireDucks is developed with the following points in mind:

Automatic query optimization ability with lazy-execution model.
- A data analyst can focus more on exploring data, while the inbuilt compiler of FireDucks can take care of the following:
  1. basic compiler optimizations, like common sub-expression elimination, dead code elimination etc.
  2. domain specific optimizations, like execution reordering, dropping unwanted columns in advance etc.
  3. pandas specific optimizations, like choice of right APIs when executing a query (by careful analysis of the application objective), choice of right parameters by avoiding unwanted operations (like sorting of result etc.)
A pandas user can flexibly adapt to this library without any new learning cost.
- FireDucks is highly compatible with pandas, so any pandas application can be optimized without any manual code changes. Doesn’t it sound quite promising?
The program can leverage all the available cores in the execution environment:
- The single-core execution issue in pandas is solved!

:computer: Demonstration

Here is a link for a test drive with a sample walkthrough notebook on Google Colab which shows how easy it is to start with FireDucks and its performance gain over pandas for a sample CSE use-case (Example 01 of this article).

With a low-spec execution environment having 2 cores, where pandas takes 1.3 seconds to execute a sample query, FireDucks (with multhreaded + compiler enabled) takes only 166 milli-seconds. Such speedup (~8x) without incurring any migration cost (no cost involves in rewritting the application from pandas to FireDucks) or any special hardware cost seems quite beneficial from the overall production cost point of view.

In case you are interested to see how it performs for popular benchmarks (like TPC-H, db-benchmark etc.) in comparison to other high-performance pandas alternatives, you may like to check this out.

:point_right: How does FireDucks work?

FireDucks comes with three powerful layers:

a python frontend highly compatible with pandas APIs
an in-built compiler to auto-detect and optimize the exsting performance issues in a user program
a multithreaded C++ backend with efficient parallel implementation of all the dataframe related operations like join, filter, groupby sort etc.

Unlike pandas that executes the library functions right after they are called, FireDucks creates some special instructions and keep them accumulated until there is some explicit request for the result (print result, do some reduction, to_csv etc.). Before the execution, the in-built compiler inspects all the accumulated instructions related to the result to be processed and performs some automatic optimization (as explained above) and then the optimized instructions are executed at the multithreaded kernel backend helping us to focus more on our analytical work and be more productive.

✍️ Conclusion

FireDucks shows promise by taking care of all the drawbacks associated with pandas and its compiler optimization technology makes it one of its kind. This article introduces FireDucks basic compilation optimization abilities as mentioned in (a). In upcoming articles, I will demonstrate other powerful optimization areas (b and c) automatically taken care of by FireDucks. If you like this article, please be with me to check them out as well. You may like to try FireDucks and share your feedback. I would love to answer to whatever queries you may have in mind with the best of my knowledge. The development team is very active and there is a new release almost every week with performance improvements, new features based on user requests, bug fixes etc. You may like to get in touch with them directly through the slack channel.

You may also like to checkout one of my previous articles on FireDucks salient features.