Release Note
FireDucks Release Note
1.1.8 (Jan 22, 2025)
- Bug Fixes:
- Fixed bug in left join with mask
- Removing Fallbacks:
- Remove redundant fallbacks to check non-existing attributes for array protocol
- Others:
- Support python3.13
1.1.7 (Jan 15, 2025)
- Optimization:
- optimize read_parquet
- optimize sort_values and groupby
1.1.6 (Jan 07, 2025)
- Bug Fixes:
- fix getitem from multiindex dataframe. GH#32
- fixed metadata reading issue with read_parquet
- Removing Fallbacks:
- supported read_feather
1.1.5 (Dec 25, 2024)
- Optimization:
- optimized a pattern
df.where(cond).groupby(key, dropna=True).agg(...)
asdf[cond].groupby(key, dropna=True).agg(...)
- pushdown optimization supports moving projection over concat.
- optimized a pattern
1.1.4 (Dec 17, 2024)
- Bug Fixes:
- fixed issue in groupby with multi-keys
- fixed some issues in modifying callable-method to method-name when falling back to pandas
- fixed issue with return type for to_csv() with filename
- fixed inplace update issue with series delitem
- Removing Fallbacks:
- supported fallback on DataFrame/Series column-aggregate, groupby-aggregate, when input is a callable method from numpy or Series modules, like np.sum, pd.Series.sum etc.
1.1.3 (Dec 10, 2024)
- Bug Fixes:
- Fix read_csv when csv file includes newlines in values.
- Fix the issue in LeftHashJoin
- Fix astype from float to timestamp
- Removing Fallbacks:
- supported a few fallback cases for DataFrame/Series loc getter
- Optimization:
- Improve project pushdown for read_parquet beyond join/merge.
- Others:
- Upgrade dependent pyarrow to 18.1.0.
1.1.2 (Dec 03, 2024)
- Bug Fixes:
- fixed issue in median calculation. GH#31
- fixed issue in empty or null Series aggregation
- fixed groupby projection when it includes reordering and duplication.
1.1.1 (Nov 26, 2024)
- Performance Improvement:
- Improve performance of left join, for example 1.6x for tpch Q13.
- Improve join/merge with timestamp keys.
- Removing Fallbacks:
- Supported DataFrame.fillna() with dictionary-like input
- Supported DataFrame.round() with dictionary, any integer-like
- Supported series.values for columns of string or temporal types
- Supported astype with numpy datetime64 type
- Bug Fixes:
- Fix sort order for category data.
- Fixed issue in Series.where with named Series. GH#29
- Fixed issue in DataFrame/Series.take with list of booleans
- Optimization:
- Support project pushdown for read_csv/parquet.
1.1.0 (Nov 19, 2024)
- Removing Fallbacks:
- supported “expand” parameter in str.split() method.
- Bug Fixes:
- fixed issue in dtypes for DataFrame with multi-level columns
- Optimization:
- Improve performance of join/merge. About 1.5x at max in our experiments.
- Others:
- Upgrade dependent pyarrow to 18.0.0. As pyarrow18, python3.8 is no longer supported.
1.0.11 (Nov 12, 2024)
- Removing Fallbacks:
- supported fallback on DataFrame.dtypes in presence of column of types list, date32, large_string
- supported DataFrame.loc with columns slicing e.g.,
df.loc[:, A: C]
- Bug Fixes:
- fixed issue in setting index to an empty DataFrame/Series
- Optimization:
- improved sort_values() with key of temporal types
- supported projection-pushdown when projection target is empty e.g.,
df.sort_values("C")[[]]
1.0.10 (Nov 05, 2024)
- Bug Fixes:
- fixed a conditional bug with negative index as for input of DataFrame/Series take()
- fixed issue in sampling empty DataFrame/Series
- fixed a bug in calculation of length of a column of type list containing Nulls.
- Removing Fallback:
- supported DataFrame.groupby() with input key of Series type.
1.0.9 (Oct 28, 2024)
- Bug Fixes:
- update Join: support list-type payload (GH#20)
- fix: isin() to support CategoricalDtype
- fix: supported to_csv on DataFrame/Series having list or struct-like columns.
- Removing Fallback:
- supported fallback on getitem with numeric index for StringMethods (string or list-like columns): e.g.,
s.str[2]
- supported fallback on getitem with numeric index for StringMethods (string or list-like columns): e.g.,
- Performance Improvement:
- optimized calculation (~3x) of length for list-like columns
1.0.8 (Oct 22, 2024)
- Bug Fixes:
- Some unsupported rolling functions are implemented.
- fixed issue in slicing a list-like column
- fixed RuntimeError on
DataFrame.astype(category).head()
- Removing Fallbacks:
- dtype_backend=“pyarrow” parameter of read_csv
- column parameter of to_csv
1.0.7 (Oct 16, 2024)
- Removing Fallback:
- remove fallback: read_csv with encoding=utf8
- supported binop comparison with ‘date’ instance as for scalar value
1.0.6 (Oct 07, 2024)
- Bug Fixes:
- fix in operator with Series (GT#26).
- fix issue where index setter,
df.index = ...
, does not work with fallback.
1.0.5 (Sep 20, 2024)
- Bug Fixes:
- fixed dump to and read from pickle
- fixed groupby ith selector for key like
df.groupby("a")["a"]
- Removing Fallback:
- supported ignore_index parameter for drop_duplicates
- Performance Improvement:
- added IR optimization
df.drop_duplicates(...).reset_index(drop=True)
->df.drop_duplicates(..., ignore_index=True)
- added IR optimization
1.0.4 (Sep 10, 2024)
- Bug Fixes:
- fixed issue on groupby-select-aggregate with kwargs e.g.,
df.groupby("a")["b"].agg(Sum="sum")
- fixed melt() issue with non-string “value_vars”.
- fixed issue with groupby(…).size() on empty data.
- fixed issue on groupby-select-aggregate with kwargs e.g.,
- Removing Fallback:
- supported dtype=“string” as for input of astype(), read_csv() etc.
- supported iloc by row-index e.g.,
df.iloc[0]
- supported “ignore_index” parameter for Series/DataFrame dropna().
- Performance Improvement:
- improved overhead of computation on index column, when reset_index(drop=True) is performed followed by dropna, concat, melt, explode.
1.0.3 (Sep 02, 2024)
- Bug Fixes:
- fixed join with categorical columns.
1.0.2 (Aug 30, 2024)
- Bug Fixes:
- fixed a bug in reading the parquet file when index columns are stored at the beginningg.
- Removing Fallback:
- supported datetime properties
dt.date
,dt.time
- supported aggregate method “last” for GroupBy.
- supported datetime properties
1.0.1 (Aug 28, 2024)
- Bug Fixes:
- benchmark-mode with inplace method
- Performance Improvement:
- Groupby.nunique with numeric column
- DataFrame.merge for some cases
- added optimization pattern
sort_values(...).reset_index(drop=True)
->sort_values(..., ignore_index=True)
- Others:
- print optimized IR when FIRE_LOG_LEVEL=3
1.0.0 (Aug 23, 2024)
- Bug Fixes:
- fixed issue with dictionary sort
- fixed issue in filling null with null
- Removing Fallback:
- supported sort_index on DataFarme and Series
- Performance Improvement:
- add JoinWithMaskPat optimization
- add predicate pushdown optimization
- Others:
- add test on rockylinux9.2 with python3.11
fireducks.pandas.__version__
returns version of pandas. Usefireducks.__version__
when version of fireducks is required.
0.13.1 (Aug 14, 2024)
- Bug Fixes:
- fixed issue in casting a datetime column from one unit to another (e.g.,
datetime64[ns]
->datetime64[ms]
) - fixed issue in handling range index with step != 1
- fixed issue in casting a datetime column from one unit to another (e.g.,
- Removing Fallback:
- supported pd.read_json() with lines=True case
- supported Series.reset_index() with name parameter
- supported DataFrame setitem/getitem with numpy array of dimension Nx1
- supported DataFrame.setitem with non-string key. e.g.,
df[1] = ...
(key is integer)
- Performance Improvement:
- improved
dropna(axis=0)
for input without any nulls
- improved
0.13.0 (Jul 30, 2024)
- Bug Fixes:
- Fixed filter bug when input mask is having different alignment than in input table.
- Fixed a bug related to an importhook under a FireDucks profiler.
- Fixed merge with on=key for different key types.
- Fixed merge with left_index, right_index for different key types.
- Fixed issue in unit handling for TimeDelta columns.
- Others:
- Upgrade dependent pyarrow to 17.0.0.
0.12.6 (Jul 23, 2024)
- Removing Fallback:
- supported getter and setter on Series.name
- supported loc-assignment, scalar-assignment related cases with pd.NaT, e.g,
df["c"] = pd.NaT
- supported setitem on DataFrame with numeric arrays having None, e.g,
df["c"] = [1, None, 3]
- Bug Fixes:
- Fixed issue in putting null using np.nan on non-numeric columns (string, timedelta etc.)
- Fixed get_dummies() issue with default dtype for pandas 2x
- Fixed strftime issue with format having “%%S” like escape
0.12.5 (Jul 12, 2024)
- Performance Improvement:
- optimized days_in_month
- optimized implementation of microsecond (> 2x)
- improved performance of sample, by avoiding unnecessary checks for negative index
- Bug Fixes:
- groupby with timestamp and timedelta column.
- fixed issue with is_leap_year
0.12.4 (Jul 09, 2024)
- Performance Improvement:
- improved perfomance of take(axis=0) when input frame has default range index.
- improved performance of sum, mean, count etc. for boolean column.
- Removing Fallback:
- supported Datetime Accessor methods: is_leap_year, days_in_month, microsecond
- supported Series.between.
- supported DataFrame filter with numpy-array as mask vector.
- supported following iloc-gettter cases:
- iloc with arraylike of integers: e.g., df.iloc[[0,2,4]]
- iloc with range or slice objects: e.g., df.iloc[:3]
- iloc for projection-filter: df.iloc[:2, :3], df.iloc[[0,3,5], [0,1]] etc.
- Bug Fixes:
- fixed issue in groupby-aggregator with duration column as key/non-key.
- fixed issue in boolean casting for column of types: timestamp, timedelta.
- fixed type issue in count() result for column of type: timedelta.
- fixed iloc bug when input frame has duplicate columns.
- fixed issue with strftime("%S") when non-fractional second part is to be formatted.
0.12.3 (Jul 02, 2024)
- Performance Improvement:
- read_csv with many columns
- merge with many columns
- Removing Fallback:
- supported header parameter for read_csv
- supported list-of-integers to specified as index_col in read_csv()
- Bug Fixes:
- fixed issue with aggregation on unsigned numeric columns by supporting unsiged scalars in FireDucks
- fix: reindexing column order after performing arithmetic operation
- fixed read_csv() bug when ‘index_col’ is of boolean-type or contains negative integers
- Others:
- remove: dependency on numpy<2.0
0.12.2 (Jun 24, 2024)
- Removing Fallback:
- Supported to_datetime() with given format (fixed fallback issue at backend).
- Supported astype() with input as Series: e.g., s.astype(s2.dtype)
- Supported DateTime accessor method total_seconds() on TimeDelta columns.
- Supported Datetime accessor method strftime() on DateTime columns. Huge improvement than pandas implementation of strftime.
- Bug Fixes:
- Fixed read_csv() issue when the length of “names” parameter is different than number of fields in the input file.
- Fixed issue in concatting String with Category, String with LargeString columns.
- Fixed issue in to_csv() when input data has multi-level columns and “header” parameter is not True.
- Fixed issue in isin() operation on string column with non-string lookup targets.
- Others:
- Optimized “strftime(format) + astype(numeric)” pattern is optimized when format can be treated as numeric datetime field extractor.
- Modified
fireducks.ipyext
module loading is no longer required, whenfireducks.pandas
module is already loaded.
0.12.1 (Jun 17, 2024)
- Removing Fallback:
- supported sep, na_rep, quoting_style, header etc. parameters for DataFrame/Series to_csv()
- Bug Fixes:
- fixed issue in to_csv when columns are of multi-level and header=False; when columns names are single-level non-strings, saved as strings (unlike pandas)
- fixed groupby.shift ignores dropna parameter.
- Others:
- add dependency on numpy<2.0.
- support python 3.12
- support older glibc with python 3.9-3.12
0.12.0 (Jun 10, 2024)
- Removing Fallback:
- supported min_periods parameter in rolling()
- supported dictionary of parquet files to be loaded using read_parquet()
- supported Datetime Accessor methods day_name(), month_name()
- supported pd.to_datetime() for Series input
- Bug Fixes:
- fixed merge bug for multi-index as key
- fixed Series.map(pd.Timestamp.timespamp)
- fixed string to datetime conversion when input timestamp contains microseconds, nanoseconds parts.
- fixed fillna() on string columns with numeric scalar, e.g., df.fillna(0).
- fixed concat to support mixed of single-level and multi-level column names.
- fixed lazy-execution issue in setting Series attributes
- Others:
- removed pandas dependency with 1.5.3. Fireducks is now compatible with both pandas 1.5 and 2.2.
0.11.5 (Jun 04, 2024)
- Bug Fixes:
- Fixed bug of column dtypes on reading an empty CSV file.
- Fixed issue on calling StringMethods (s.str.upper etc.) on a “category” column with key as string.
- Fixed issue on calling where/mask on empty DataFrame.
- Fixed to return pd.NaT instead of np.nan on calling aggregate methods on empty Series of timedelta, timestamp types.
- Fixed issue on calling any(), all() on a String column.
- Fixed issue on comparing string and datetime columns, string and numeric columns.
- Fixed read_parquet() to support non-string column name
- Fixed issue on calling where/mask on column of type float16
- Removing Fallback:
- supported DataFrame.contains
- supported “min_periods” parameter for DataFrame/Series rolling()
- Performance Improvement:
- improved groupby when key is of “category” type (upto 2 times).
- improved displaying a DataFrame instance on jupyter-like notebook platforms.
- Others:
- importhook now supports -m option to run a library module. e.g., python -m fireducks.imhook -m <other_python_module> …
0.11.4 (May 27, 2024)
- Bug Fixes:
- fixed result type of groupby-sum on boolean column from uint64 -> int64 according to pandas
- fixed None check in DataFrame.rename
- fixed read_csv issue with parsing bad-csv files by falling back to pandas
- fixed: DataFrame.drop() issue when string-value is specified as index to be dropped from a datetime column
- fixed: DataFrame.drop() when target column to be dropped is specified as scalar with axis=1 [e.g., df.drop(“c”, axis=1)]
- fixed: logical func of DataFrame/Series unexpected kwargs
- fixed: unnecessary upcast of floating dtypes in to_numpy on a column with nulls.
- fixed bug when groupby results in data columns being empty.
- fixed dictionary mapping on Series, when type of input Series and the type of dictionary-keys do not match
- fixed errors when unsupported aggregate methods (e.g., corr, describe etc.) are provided to groupby-agg
- update: DataFrame/Series.where handles other=nan as null.
- Removing Fallback:
- fixed fallback when head/tail/shift etc. is provided as single-value-list to groupby-aggregate [e.g., agg([“head”])];
- supported inplace parameter for Series/DataFrame drop_duplicates
- supported Series.drop
- Performance Improvement:
- Improved groupby-aggregate for sum, mean, median, stddev on boolean column
- Improved dictionary mapping on Series
0.11.3 (May 20, 2024)
- Performance Improvement:
- Add new optimization to remove uncecesarray sort in groupby.
- Bug Fixes:
- fix: Series.dtype where Series.name is non-0 integer
- Removing Fallback:
- fixed fallback of sort_values with kind=None
- DataFrame/Series.diff support more integer-like periods
0.11.2 (May 16, 2024)
- Bug Fixes:
- Fix dependency on pyarrow.
- Fix dtype of index when merge result is empty.
- Removing Fallback:
- Supported aggregate on timedelta columns.
0.11.1 (May 13, 2024)
- Performance Improvement:
- Add new IR pattern rewrite optimization pass.
- DataFrame.merge/join with date32/64 payload column.
- Bug Fixes:
- Fixed bug in iloc-getter when there are duplicates in column names
- Removing Fallback:
- Supported aggregate methods (max, min, mean etc.) to be performed on timestamp columns.
- Supported iloc-getter with integer or list-likes column indicator: e.g., df.iloc[:, 0], df.iloc[:, [2,4]] etc.
- Supported take() with slice object as input.
- Supported squeeze() For DataFrame and Series.
- Supported dictionary or casting-methods to be mapped on a Series.
- New pandas incompatibility:
- observed parameter of groupby is always true for better performance.
0.11.0 (May 07, 2024)
- Performance Improvement:
- groupby.median() and median is now returns non approximate median.
- Removing Fallback:
- read_parquet with
columns
parameter. - DataFrame.rename with
columns
parameter.
- read_parquet with
- Others:
- Upgrade dependent pyarrow to 16.0.0.
- the importhook feature now can be activated by
fireducks.pandas
0.10.9 (Apr 23, 2024)
- Performance Improvement:
- groupby.std()
- Removing Fallback:
- Supported astype(“datetime64”)
- Supported DataFrame.dropna(axis=1)
- Bug Fixes:
- Fix df.merge returning incorrect result when how is left and key has nulls.
- Fix an error when “head”, “tail” or “shift” is used in
groupby.agg
. If any of these is provided as a single aggregator [e.g.,df.groupby(...).agg("head")
], you can experience speed-up from FireDucks, but when these are provided in combination with another aggregator [e.g.,df.groupby(...).agg(["head", "mean"])
], the same will be executed by fallbacker. - Fix issues in accessing methods from pd.api.types module.
- Others:
- Remove version from dependency on numpy.
- Add experimental profiler for jupyter/ipython. Use
%load_ext fireducks.ipyext
and%%fireducks.profile
cell magic. See here for details.
0.10.8 (Apr 16, 2024)
- Performance Improvement:
- groupby two keys with nulls
- left join with single key
- left and inner join with single key of category type
- Removing Fallback:
- groupby.corrwith among two columns
0.10.7 (Apr 10, 2024)
- Performance Improvement:
- Printing dataframe with large dictionary.
- Removing Fallback:
- DataFrame/Series astype() with dtype=“category”
- Bug Fixes:
- Fixed Join issue with dictionary-typed key columns.
- Fixed filter issue of a table having multiple index columns with duplicate values
- Others:
- Upgrade dependent pyarrow to 15.0.2.
0.10.6 (Apr 02, 2024)
- Performance Improvement:
- Series.unique()
- DataFrame/Series nunique()
- read_csv with category type
- Removing Fallback:
- DataFrame/Series astype() with bool, uint8 etc.
- supported following parameters for pd.get_dummies(): columns, prefix, prefix_sep, dtype
0.10.5 (Mar 26, 2024)
- Performance Improvement:
- groupby.head/tail
- groupby.size
- Removing Fallback:
- dropna=False with groupby
- groupby.first
- DataFrame.value_counts
- supported “normalize” parameter of Series.value_counts
- Bug Fixes:
- Fix incorrect fallback of Series.apply
- Fix str.split issue when expand parameter is specified
- Fix null assignment issue, e.g., df.mask[cond, “a”] = np.nan
0.10.4 (Mar 13, 2024)
- Performance Improvement:
- Groupby, merge/join, sort_value with string key
- Bug Fixes:
- Fixed fallback issue with loca/iloc setitem
0.10.3 (Mar 06, 2024)
- Performance Improvement:
- Optimized construction of a Series from another Series.
- Removing Fallback:
- Supported replace with regex=True
- Supported loc-assignment for non-numeric index, e.g.,
df.loc[["a", "c", "d"], "col1"] = 5
- Bug Fixes:
- Fixed bug when loc assignment is performed with non-series data (like list etc.) and target frame does not have default index.
- Fixed NotImplementedError cases related to datetime-string comparison.
0.10.2 (Feb 26, 2024)
- Performance Improvement:
- improved index-getter (df.index) by avoiding fallback of data columns
- sort with uint32/uint64 key
- Removing Fallback:
- Supported groupby.shift() for DataFrame and Series
- Supported take() for DataFrame and Series
- Supported sample() for DataFrame and Series
- Supported loc-assignment with positions (e.g., df.loc[[5,2,4], “a”] = 100) for DataFrame and Series
0.10.1 (Feb 19, 2024)
- Performance improvement:
- DataFrame.merge
- DataFrame/Series.sort_values when including null
- Bug Fix:
- fixed DataFrame/Series.sort_values with string key and ascending=False
0.10.0 (Feb 13, 2024)
- Performance improvement:
- DataFrame/Series.drop_duplicates
- DataFrame/Series.dropna
- Removing Fallback:
- supported astype with numpy types (np.int32, np.int64, np.float32, np.float64)
- supported conditional loc setter for DataFrame and Series: e.g.,
df.loc[cond, "a"] = 2; s.loc[cond] = 2
- Bug Fixes:
- fixed int-float binop division issue
- fixed calling issue of StringMethods on LARGE_STRING typed columns
- Others:
- update to arrow15
0.9.8 (Feb 5, 2024)
- Performance improvement:
- DataFrame.groupby
- Removing fallback:
- DataFrame/Series.reset_index with allow_duplicates
0.9.7 (Jan 29, 2024)
- Removing fallback:
- Setting index of DataFrame/Series like
df.index = ...
- Index.set_names
- Setting index of DataFrame/Series like
0.9.6 (Jan 22, 2024)
- Performance improvement:
- move projection optimization: support copy and drop_duplicates.
- Removing fallback:
- DataFrame/Series.__repr_html__ to drastically improve speed for displaying on Jupyter notebook.
- DataFrame/Series.set_axis
- DataFrame/Series.__setitem__ with array-like
- DataFrame/Series.set_index with ndarray, drop=True, append=True and verify_integrity=True
- DataFrame/Series.sort_values with ignore_index=True
- Bug fix:
- read_csv with fsspec parameter such as “s3://”
0.9.5 (Jan 15, 2024)
- Removing fallback:
- DataFrame/Series.shift
- DataFrame/Series.pipe
0.9.4 (Dec 28, 2023)
- Performance improvement
- DataFrame.copy
- Removing fallback:
- DataFrame/Series.iloc setter
- DataFrame.__array__
0.9.3 (Dec 25, 2023)
- Performance improvement
- DataFrame.merge
- Binary operations
- Bug fix:
- Series.__repr__
0.9.2 (Dec 18, 2023)
- Performance improvement
- DataFrame.groupby, DataFrame.where
- IR building
- Removing fallback:
- DataFrame.iloc, DataFrame.__repr__
- Bug fix:
- read_csv with URL
0.9.1 (Dec 11, 2023)
- Performance improvement
- DataFrame.corr
- Others
0.9.0 (Dec 4, 2023)
- Update to arrow-14.0.1
0.8.8 (Nov 27, 2023)
- Bug Fix
- remove unexpected print in read_csv
0.8.7 (Nov 27, 2023)
- Performance improvement
- DataFrame.corr
- DataFrame.dropna
- Removing fallback:
- read_csv with default arguments
- DataFrame.to_csv with encoding=utf8
- DataFrame.groupby with dropna=True
0.8.6 (Nov 20, 2023)
- Performance Improvement
- DataFrame.groupby using cardinarity estimation.
- DataFrame.corr for less rows DataFrame.
- Removing fallback
- DataFrame/Series.mask
- DataFrame/Series.where
- Bug Fix:
- concat for corner cases
0.8.5 (Nov 9, 2023)
- Improve performance of DataFrame.corr
- Remove fallback of DataFrame.get_dummies for simple case
0.8.4 (Nov 9, 2023)
- Performance improvement
- DataFrame.corr
- Perfomance improvement by removing fallback (depending on parameters)
- Series.rolling
- DataFrame.drop
- DataFrame/Series.describe
- DataFrame/Series.skew
- DataFrame/Series.kurt
- DataFrame/Series.values
- Bug Fix
- Series.__float__/__int__
- fallback reason of to_csv
0.8.3 (Oct 26, 2023)
- Add wheel package for python3.11 (tested with python-3.11.4 on ubuntu23.04).
- Improve performance of merge/join when both frames have default index.
- Improve pandas compatibility of methods which return a scalar value like Series.aggregate.
- Remove fallback: DataFrame.columns, DataFrame.pop, fireducks.pandas.join
- Add kernel tracing (enabled by
FIREDUCKS_FLAGS=--trace=3
) - Add reason to fallback log (enabled by
FIREDUCKS_FLAGS=-Wfallback
).
0.8.2 (Oct 19, 2023)
- First public beta release