pandas compatibility

FireDucks provides the same API (class names, method names, and attribute names) as pandas, and aims for compatibility in terms of being able to use it simply by changing import statements.

Compatibility Concept

We do not aim for compatibility in the following aspects.

  • Complete consistency of class names
    • FireDucks provides a pandas-compatible API in the module fireducks.pandas. The complete class names, including module names, are different from those of pandas.
    • For example, the data frame type is pandas.DataFrame in pandas, but fireducks.pandas.DataFrame in FireDucks, which are not exactly the same. Therefore, an explicit test for a pandas DataFrame, such as isinstance(df, pandas.DataFrame), will be false. True for import fireducks.pandas as pd; isinstance(df, pd.DataFrame).
  • Full compatibility of errors and warnings
    • Because of delayed execution, the timing of errors and warnings in FireDucks is different from that in pandas.
    • It is not the goal to match error messages (although the goal is to match Exception classes).
    • In addition, since Warnings may not be necessary in FireDucks, whether or not a Warning is generated may not match pandas.
  • Complete reproduction of undefined behavior and bugs in pandas.
    • We do not aim to reproduce the implementation-dependent behavior of pandas.
    • In particular, it may be undefined whether pandas returns a copy or a reference, and we do not recommend writing code that depends on either (reference).
  • pandas internal API and Experimental features
    • We do not aim to provide methods that begin with an underscore (_) or features that are marked as Experimental in the pandas documentation.
  • Extending pandas
    • Functions that extend pandas as described in Extending pandas are currently not targeted.
    • These include the ability to create subclasses of DataFrames and Series, and the ability to define your own data types.
  • Consistency of the order of merge/join result rows with pandas
    • The order of rows in merge/join results may not match the order in pandas. Sorting will match.
  • copy(deep=False) might not work when changes made in data values of ‘copied’ instance is expected to be reflected in the data values of the ‘source’ instance. The changes made in the metadata will work as expected. Refer for more details.

copy(deep = False)

In pandas, df.copy(deep=False) is mostly used for the following two purposes:

  1. when we just want to modifiy the metadata of a table (like column name etc.), but not the actual content as follows: change_metadata

  2. when we want the modification made in the ‘copied’ (df2) instance to be reflected in the ‘source’ (df1) instance as follows. This is more likely a side-effect of the shallow-copy, instead of what a user might want to perform. change_data_values

FireDucks actually doesn’t support in-place operations in true sense, since it assumes all the data instances are immutable in nature (to improve performance). Whenever an in-place operation on a column is performed, e.g. df["A"] += 2, it allocates a new memory for that column to write the result of the operation and modifies the data pointer as shown below to make it look like an in-place operation. in_place_addition

Hence, (1) will work as expected in FireDucks (because it doesn’t change the actual data values), but (2) will not work as expected, since the change on copied instance is reflected on a newly allocated memory which is pointed by the ‘copied’ instance (df2), but the ‘source’ instance (df1) still points to the initial memory as depicted in following figure: shallow_copy_issue.

Usage similar to (1) can be found in some internal implementation of libraries like seaborn, skimpy etc. Hence you can safely ignore the warning message, when using these libraries with import-hook enabled in FireDucks.

Use with pandas

The internal API and isinstance are frequently used in pandas. Therefore, using FireDucks with pandas will not work in most cases. It is recommended to rewrite all import statements or use automatic conversion by import hook to use FireDucks instead.

If for some reason you wish to use FireDucks with pandas or pass DataFrame or Series of FireDucks to a library that accepts those of pandas, please use the pandas conversion function.

import somelib
import fireducks.pandas as pd

df = pd.read_csv(...)
somelib.process_pandas_dataframe(df.to_pandas())