Pitfalls of Time Measurement for FireDucks with %%time in Notebooks

We will explore the pitfalls of using the %%time magic command in Jupyter and other IPython Notebooks to measure the execution time of FireDucks processes.

By Osamu Daido | Thursday, December 26, 2024

This is Osamu Daido from the FireDucks development team. In today's developers' blog, I would like to present a subtle pitfall in time measurement.

Quick Overview

When measuring the execution time of FireDucks using the %%time magic command in IPython Notebooks, make sure to always call the _evaluate() method of DataFrames or Series to ensure proper evaluation!

%%time
df = pd.read_csv("input.csv")
df._evaluate()

Time Measurement in Notebooks

Jupyter and other IPython Notebooks provide the %%time magic command to measure the execution time of the code written in a cell. For instance, a single percent sign %time measures the execution time of only one line of code, while double percent signs %%time measure the execution time for the entire cell. You may be interested in or curious about measuring the execution time of FireDucks because it can process data faster while offering the same API as pandas.

However…, hmm? Do you think the following time measurement is correct?

import fireducks.pandas as pd

%%time
df = pd.read_csv("sample-dataset-tips.csv")
df.head()

CPU times: user 3.37 ms, sys: 4.06 ms, total: 7.43 ms
Wall time: 6.87 ms
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Time Measurement for FireDucks

As explained on the execution model page, FireDucks uses a lazy execution model. In simple terms, FireDucks DataFrames do not begin actual processing until explicitly displayed on the screen with functions like print() or display(), or when the _evaluate() method is called. FireDucks can process data more quickly by optimizing the accumulated operations before executing them. This execution of accumulated operations is referred to as “evaluation.”

In IPython Notebooks, if the last line of a cell is not an assignment statement but simply a value, it is automatically displayed on the screen, similar to the Python interpreter's REPL. In IPython terms, it's as if display() is automatically called. This means that if you place a FireDucks DataFrame at the end of a cell, it will also be automatically evaluated.

df = pd.read_csv("sample-dataset-tips.csv")
df = df.sort_values(by="tip")
df  # THIS!

However, there is a subtle pitfall when you want to measure execution time using the %%time magic command. Just because placing a FireDucks DataFrame at the end of a cell triggers automatic evaluation, it does not necessarily mean that the correct execution time is measured.

Example of Incorrect Time Measurement

I prepared a CSV file of about 10GB for experimentation. When executing the following cell, something strange happens.

%%time
df = pd.read_csv("sample-dataset-tips10gb.csv")
df = df.sort_values(by="tip")
df

CPU times: user 18.2 ms, sys: 4.31 ms, total: 22.5 ms
Wall time: 15.2 ms
total_bill tip sex smoker day time size
67 3.07 1.0 Female Yes Sat Dinner 1
92 5.75 1.0 Female Yes Fri Dinner 2
… … … … … … … …
319815362 50.81 10.0 Male Yes Sat Dinner 3
319815606 50.81 10.0 Male Yes Sat Dinner 3
319815680 rows x 7 columns

	total_bill	tip	sex	smoker	day	time	size
67	3.07	1.0	Female	Yes	Sat	Dinner	1
92	5.75	1.0	Female	Yes	Fri	Dinner	2
…	…	…	…	…	…	…	…
319815362	50.81	10.0	Male	Yes	Sat	Dinner	3
319815606	50.81	10.0	Male	Yes	Sat	Dinner	3

Let's look at the line labeled “Wall time.” Imagine if it only took 15 milliseconds to read and sort data with 300 million rows — wouldn't that be incredible? In reality, it took about 10 seconds from the start of the cell's execution until the results were displayed. You might wonder, “The results are displayed on the screen, so shouldn't they be properly evaluated?” That's half true and half false.

In fact, with this approach, the evaluation of the DataFrame begins only after the %%time timer has stopped. In other words, because the order is timer stops → evaluation → display, the actual processing of the DataFrame is outside the measurement range of %%time.

Example of Correct Time Measurement

Therefore, even in IPython Notebooks, when you want to measure execution time, make sure to explicitly call the _evaluate() method of DataFrames to properly evaluate them. Writing it as shown below will execute in the order of evaluation → timer stops → display. This way, the actual processing of the DataFrame falls within the measurement range of %%time.

%%time
df = pd.read_csv("sample-dataset-tips10gb.csv")
df = df.sort_values(by="tip")
df._evaluate()

CPU times: user 3min 58s, sys: 1min 2s, total: 5min
Wall time: 11.1 s
total_bill tip sex smoker day time size
67 3.07 1.0 Female Yes Sat Dinner 1
92 5.75 1.0 Female Yes Fri Dinner 2
… … … … … … … …
319815362 50.81 10.0 Male Yes Sat Dinner 3
319815606 50.81 10.0 Male Yes Sat Dinner 3
319815680 rows x 7 columns

	total_bill	tip	sex	smoker	day	time	size
67	3.07	1.0	Female	Yes	Sat	Dinner	1
92	5.75	1.0	Female	Yes	Fri	Dinner	2
…	…	…	…	…	…	…	…
319815362	50.81	10.0	Male	Yes	Sat	Dinner	3
319815606	50.81	10.0	Male	Yes	Sat	Dinner	3

Slightly Different Solution

If you write it as shown below, the order will be evaluation → display → timer stops. In this case, the process to display the DataFrame on the screen inadvertently becomes part of the time measurement. Do you notice any other differences? You might notice that the order of the DataFrame output and the timing result output is reversed.

%%time
df = pd.read_csv("sample-dataset-tips10gb.csv")
df = df.sort_values(by="tip")
display(df)

total_bill tip sex smoker day time size
67 3.07 1.0 Female Yes Sat Dinner 1
92 5.75 1.0 Female Yes Fri Dinner 2
… … … … … … … …
319815362 50.81 10.0 Male Yes Sat Dinner 3
319815606 50.81 10.0 Male Yes Sat Dinner 3
319815680 rows x 7 columns
CPU times: user 3min 58s, sys: 55.4 s, total: 4min 53s
Wall time: 10.6 s

	total_bill	tip	sex	smoker	day	time	size
67	3.07	1.0	Female	Yes	Sat	Dinner	1
92	5.75	1.0	Female	Yes	Fri	Dinner	2
…	…	…	…	…	…	…	…
319815362	50.81	10.0	Male	Yes	Sat	Dinner	3
319815606	50.81	10.0	Male	Yes	Sat	Dinner	3

Wrap-up

FireDucks has the same API as pandas while adopting a lazy evaluation mechanism, so it’s important to pay attention to such subtle details when measuring processing time. However, FireDucks allows you to speed up processing while using nearly the same code as you would with pandas, making it a powerful ally for data science.

Recently, many people have shown interest in FireDucks, and we are receiving more feedback about its improved speed as well as bug reports. As the development team, we are committed to making sure FireDucks gets widely used and continues to be valuable over the long term. Please stay tuned for future updates!

May the Acceleration be with you, FireDucks Development Team