Ensuring compatibility with pandas in the GPU version of FireDucks

We are currently developing a GPU version of FireDucks.

FireDucks is built with an architecture that translates programs into an intermediate representation at runtime, optimizes them in this intermediate representation, and then compiles and executes the intermediate representation for the backend. The currently released CPU version of FireDucks has a backend for CPUs. In the development of the GPU version, the backend is changed to a GPU. This allows us to use the translation to and optimization of the intermediate representation developed for the CPU version as is.

In developing the GPU backend, we are making use of NVIDIA’s cuDF library. Our intermediate representation is roughly compatible with the pandas API, and cuDF provides similar API to pandas, so at first glance the backend development seems straightforward. However, the functions provided by cuDF are slightly different from those provided by pandas, so some ingenuity is required to maintain compatibility with pandas.

In this blog post, I would like to briefly introduce some of the issues that need to be addressed in order to maintain compatibility with pandas.

Different result types

When working with dates in pandas, if you convert them to the datetime64 type, you can use the dt accessor to extract the year, month, and day.

df = pd.DataFrame({"a": ["2017-11-01 12:24:00"]})
dfa = pd.to_datetime(df["a"])
print(dfa.dt.year)

will return 2017. By the way, what is the type of this value? If you do print(dfa.dt.year.dtype), pandas returns int32 and cuDf returns int16. This difference seems to be intentional, in order to reduce the amount of memory used by the GPU.

The year is unlikely to exceed 16 bits, and there doesn’t seem to be a major problem, but if you use it for calculations, there is a possibility that the result will change due to overflow. For example, if you try to calculate the number of hours from year 0, and use dfa.dt.year * 365 * 24, you will get a different result because of the overflow in int16.

This is not limited to this example, but in cuDF, the type of the result is often slightly different. In FireDucks, we convert the result to the same type to maintain compatibility with pandas.

Missing values are handled differently in calculations

In pandas, missing values (what we call NULL in RDB) are basically expressed as NaN. On the other hand, in cuDF, missing values are treated as a special value (NA). Therefore, the calculation results may be different.

For example,

df = pd.Series([1.0, 3.0, np.nan])
mask = df < 2.0
print (mask)

will print

0     True
1    False
2    False

in the case of pandas. False is entered in the missing value position. This is because it is compared with np.nan.

On the other hand, in the case of cuDF,

0     True
1    False
2     <NA>

is printed. The result of the operation is also a missing value (NA). In RDB, operations with NULL always result return NULL, so the results are consistent with RDB, but not with pandas. I think this is also an intentional feature of cuDF.

In pandas, a special value (pd.NA) can be used experimentally as a missing value. When this is used, the results are the same as in cuDF and RDB, but otherwise the results are different. In FireDucks, the results are adjusted so that the handling of missing values is the same as pandas.

The results of merge are different

Pandas’ merge has a complex specification, and cuDF does not necessarily follow that specification. This may change in the future, but for now there are differences, such as the following.

In pandas merge, you can use left_on and right_on to specify the columns to be used for merging. Normally, you specify the column names here, but if the index has a name, you can also specify the index name. You can also specify the index by specifying left_index=True or right_index=True. Let’s try merging using these functions.

First, create a DataFrame for merging. The left dataframe is created as:

idx1 = pd.Index([1,2,3,4],name="p")
df1 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]], columns=["a","b"], index=idx1)

The result is:

    a  b
 p
 1  1  2
 2  3  4
 3  5  6
 4  7  8

The right dataframe is created as:

idx2 = pd.Index([1,2,5,6],name="q")
df2 = pd.DataFrame([[3,4],[5,6],[1,2],[3,4]], columns=["c","d"], index=idx2)

The result is:

   c  d
q
1  3  4
2  5  6
5  1  2
6  3  4

Now let’s try merging these. On the left, we specify the index column by name with left_on=[“p”], and on the right, we specify that we want to use the index column with right_index=True. We specify “outer” for how to perform an outer join, as in an RDB.

df1.merge(df2, left_on=["p"], right_index=True, how="outer")

In the case of pandas, the result is:

     p    a    b    c    d
1.0  1  1.0  2.0  3.0  4.0
2.0  2  3.0  4.0  5.0  6.0
3.0  3  5.0  6.0  NaN  NaN
4.0  4  7.0  8.0  NaN  NaN
NaN  5  NaN  NaN  1.0  2.0
NaN  6  NaN  NaN  3.0  4.0

In the case of cuDF,

         a     b     c     d
p
1        1     2     3     4
2        3     4     5     6
3        5     6  <NA>  <NA>
4        7     8  <NA>  <NA>
<NA>  <NA>  <NA>     1     2
<NA>  <NA>  <NA>     3     4

The type of the column with missing values is different, because pandas uses NaN for missing values, which causes the type of the column float64 type. The big difference is that pandas has a column called p, but cuDF does not. Also, the name of the resulting index is different.

The column p is originally the index column of the left DataFrame, and it seems like a strange specification that this is created as a column of the resulting DataFrame (columns like p are only created in the case of the special parameter combination described above). In RDB joins, such columns are not created. On the other hand, the results are clearly different, so there is a possibility that this could cause a compatibility problem for user programs.

In FireDucks, we have adjusted the code so that the results are as close as possible to those of pandas even in such cases.

As described above, there are cases where the results differ between pandas and cuDF. The GPU version of FireDucks has been implemented to absorb such differences and produce results as close as possible to those of pandas. The GPU version of FireDucks is still under development and is not yet ready for use by users, but we hope that you will try it out when it is completed.