Using Python's fast data frame library FireDucks

Monday, October 23, 2023

pandas is a library that provides functions to support data analysis in the Python programming language. NEC Research Laboratories has developed a library called FireDucks, a faster version of pandas.

Data Preparation

The analysis is performed on the data of passenger history of cabs in New York City. The source of the data is as follows:

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

To analyze large data sets, we downloaded and merged the “Yellow Taxi Trip Records” data from January 2022 to June 2023 from the above link. The data is provided in parquet format, but I converted it to csv format for testing. A script for preparing the data is included for reference.

import pandas as pd
import os

dir = "xxx"
df_list = []
for year in [2022, 2023]: for i in range(12)
for i in range(12): for i = str(i+1).zfill(2)
month = str(i+1).zfill(2)
fn = f "yellow_tripdata_{year}-{month}.parquet"
file = os.path.join(dir, fn)
if not os.path.exists(file):
continue
df = pd.read_parquet(fn)
df_list.append(df)

all_df = pd.concat(df_list)
all_df.to_csv("taxi_all.csv")

The contents of the data contains the following values (some columns are excerpts).

Column name	Data type	Description
`passenger_count`	int	The number of passengers
`pu_location_Id`	string	The TLC cab zone where the cab meter started working.
`do_location_Id`	string	The TLC cab zone where the cab meter was deactivated.
`tpep_dropoff_datetime`	string	The date and time the meter was deactivated.
`tpep_pickupdate_time`	string	The date and time when the meter started to work.
`trip_distance`	double	The trip distance (in miles) reported by the cab meter.
`total_amount`	double	The total amount of money charged to the passenger, not including the cash tip.
`extra`	double	Other surcharges and additional charges. Currently, this includes only the $0.50 and $1 rush hour and nighttime fares.
`fare_amount`	double	Time-and-distance combined fare calculated by the meter.

Actual preprocessing

A series of preprocessing calculations, such as type conversion, column addition, and outlier deletion, which are often used in data analysis, are performed on the prepared data.

First, prepare a wrapper for speed measurement. _evaluate() is described later.

from time import time
from functools import wraps

def timer(func):
@wraps(func)
def wp(*args, **kargs):
t = time()
ret = func(*args, **kargs)
print(f"{func.__name__} : {(time() - t):.5g} [sec]")
return ret
return wp

def evaluate(df):
if hasattr(df, "_evaluate"):
df._evaluate()

Loading data

First, read the data. Import pandas and then use read_csv to read the data. Define a function, and call it later to measure the data.

import pandas as pd

@timer
def file_read(fn, args={}):
df = pd.read_csv(fn, **args)
evaluate(df)
print(df.shape)
return df

Data processing

Remove data with missing values.

@timer
def drop_na(df):
df.dropna(how="all", inplace=True)
evaluate(df)
return df

The date and time of boarding and alighting are read as strings, so they should be converted to dates.

@timer
def txt_to_date(df, low):
df[low] = pd.to_datetime(df[low])
evaluate(df)
return df

Let’s look at the distribution grouped by the number of boardings (print is not included in the performance evaluation, so it is omitted). We see that there are data with zero riders, so we remove them.

# At least one person on the train
@timer
def check_passenger_c(df):
df_ = df.groupby("passenger_count").size()
evaluate(df_)
df = df[df["passenger_count"] > 0]
evaluate(df)
return df

Extract the year, month, day, and hour information from the ride date data and add columns. The distribution of the year and month of the ride contains incorrect values, so we remove them.

# correct ride year/month
@timer
def check_pu_date(df):
df['year'] = df['tpep_pickup_datetime'].dt.year
df['month'] = df['tpep_pickup_datetime'].dt.month
df['date'] = df['tpep_pickup_datetime'].dt.day
df['hour'] = df['tpep_pickup_datetime'].dt.hour

df_ = df.groupby("year").size()
evaluate(df_)
df = df[(df['year'] == 2022) | (df['year'] == 2023)]

df_ = df.groupby("month").size()
evaluate(df_)
df = df[(df['month'] >= 1) & df['month'] <= 12]

df_ = df.groupby(["year", "month"]).size()
evaluate(df_)
evaluate(df)
return df

Convert the difference between the disembarkation time and the ride time to minutes and add a column. Remove non-positive or too long ride times.

# realistic ride time in minutes
@timer
def check_ride_time(df):
df["ride_time"] = (df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]).dt.seconds / 60
df = df[(df["ride_time"] > 0) & (df["ride_time"] <= 180)]
evaluate(df)
return df

Remove non-negative or too large values for ride distance and fare.

# realistic distances
@timer
def check_trip_distance(df):
df = df[(df["trip_distance"] > 0) & (df["trip_distance"] <= 250)]
evaluate(df)
return df

# Realistic fares
@timer
def check_total_amount(df):
df = df[(df["total_amount"] > 0) & (df["total_amount"] <= 1000)]
evaluate(df)
return df

Calculate the latitude and longitude from the IDs of the boarding and alighting points. The relationship between the ID and the point can be checked as follows:

https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv

The columns are added by merging the conversion table created based on the latitude and longitude calculated from:

https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip

Remove data outside New York City from the latitude and longitude information.

# Find latitude and longitude from IDs
@timer
def add_coordinate(df, ID_df):
df = df.merge(ID_df.rename(columns={"longitude": "start_lon", "latitude": "start_lat"}),
left_on="PULocationID", right_on="LocationID", how="left").drop("LocationID", axis=1)
df = df.merge(ID_df.rename(columns={"longitude": "end_lon", "latitude": "end_lat"}),
left_on="DOLocationID", right_on="LocationID", how="left").drop("LocationID", axis=1)
evaluate(df)
return df

# Check if it is in NY
@timer
def in_NY(df):
df = df[(df["start_lon"] <= -71.47) & (df["start_lon"] >= -79.45)]
df = df[(df["start_lat"] >= 40.29) & (df["start_lat"] <= 45)]
df = df[(df["end_lon"] <= -71.47) & (df["end_lon"] >= -79.45)]
df = df[(df["end_lat"] >= 40.29) & (df["end_lat"] <= 45)]
evaluate(df)
return df

def check_in_NY(df, ID_df):
df = add_coordinate(df, ID_df)
df = in_NY(df)
return df

As described above, a series of processes are prepared to read data, perform type conversion, add columns, and remove outlier values.

Read the file
Convert date data from string to date data
Preprocessing
Remove missing values
Check the number of passengers
Check distribution by groupby
Select at least 1 passenger
Check the time of boarding
Tet the year, month, date and time from the date data and add a column
Group by year and check -> Select only the relevant year
Group by month and check -> select only Jan-Dec
Group by year and month and check the distribution
Check the boarding time
Take the difference between the time of disembarkation and the time of embarkation, convert to minutes, and add a column (dt.total_second is not supported by FireDucks)
Select realistic ride time data
Check the ride distance
Select realistic distance data
Check fare
Select realistic fare data
Select NY City data
Merge the passenger ID with the latitude and longitude table
Select the data where the longitude and latitude of the boarding and alighting is in NY city

Execution time in pandas

First, let’s check the execution time on pandas. A 24-core Xeon server (Intel(R) Xeon(R) Gold 6226 CPU x 2, 256GB main memory) was used for the measurement.

import os
data_path = "data_sets"
fn = os.path.join(data_path, "taxi_all.csv")
ID_file = os.path.join(data_path, "ID_to_coordinate.csv")

need_cols = ['tpep_pickup_datetime','tpep_dropoff_datetime', 'passenger_count', 'trip_distance',
'PULocationID', 'DOLocationID', 'total_amount', 'improvement_surcharge', 'extra', 'fare_amount', 'RatecodeID']

@timer
def Preprocessing():
df = file_read(fn, {"usecols": need_cols})
ID_df = file_read(ID_file)

df = txt_to_date(df, "tpep_pickup_datetime")
df = txt_to_date(df, "tpep_dropoff_datetime")

df = drop_na(df)
df = check_passenger_c(df)
df = check_pu_date(df)
df = check_ride_time(df)
df = check_trip_distance(df)
df = check_total_amount(df)
df = check_in_NY(df, ID_df)
evaluate(df)
return df

df = Preprocessing()

Now, the execution time in pandas is as shown in the table below. The two file_read files are due to reading the taxi data as well as the location data for merging, and the txt_to_date file is due to converting the ride time and the drop-off time. The execution time shows that it took more than one minute to read the file. In addition, check_pu_date and add_coordinate, which include adding columns and merge processing, take more than 30 seconds, and the implemented preprocessing takes 186 seconds to complete.

Execution time with FireDucks

Measure the execution time when using FireDucks with the imported library replaced by pandas.

pip install fireducks

The above scripts for pandas preprocessing can be used as-is by importing FireDucks, since FireDucks is compatible with pandas.

import fireducks.pandas as pd

Note that FireDucks does not immediately execute methods when they are called. Therefore, it is necessary to run _evaluate() to measure the execution time of each method.

The following table compares the execution time of FireDucks with that of pandas after importing FireDucks and performing the same preprocessing calculations.

Function	pandas [sec]	FireDucks [sec]	Speed-up ratio
`file_read`	72.19	3.52	20.49
`file_read`	0.003	0.01	0.38
`txt_to_date`	9.07	19.10	0.48
`txt_to_date`	8.57	20.57	0.42
`drop_na`	3.13	0.70	4.47
`check_passenger_c`	3.21	1.80	1.79
`check_pu_date`	27.37	0.99	27.64
`check_ride_time`	7.02	2.00	3.51
`check_trip_distance`	3.24	0.91	3.55
`check_total_amount`	3.11	0.93	3.59
`check_total_amount`	3.11	0.93	3.59
`in_NY`	20.75	2.71	7.65
`Preprocessing`	186.02	54.92	3.39

The file_read process took more than 70 seconds to complete in pandas, but it took about 3.5 seconds, which is more than 20 times faster. Other time-consuming processes (check_pu_date, add_coordinate) were also significantly reduced.

The computation time for txt_to_date was increased by using FireDucks. This is because the to_datetime() function is not supported by FireDucks at the time of writing. However, even when a function like to_datetime() is called that does not support acceleration, FireDucks does not return an error because it performs the calculation by calling a pandas function.

The total computation time for the preprocessing calculations in this article was 55 seconds with FireDucks, compared to 186 seconds with pandas, which is about 3.4 times faster. Of the 55 seconds, about 40 seconds was for processing that had not yet been accelerated, and further acceleration is expected in the future.