How to take traces in FireDucks
FireDucks has a trace function that records how long each process such as read_csv, groupby, sort, etc. takes. This article introduces how to use the trace function.
How to output and display trace files
To use the trace function, you do not need to modify the program. Simply set the environment variables as shown below and execute the program to use the trace function.
$ FIREDUCKS_FLAGS="--trace=3" python -mfireducks.pandas your_program.py
After setting the environment variables and executing the program, a file named trace.json
is created in the directory where the program was executed.
This file is the trace file.
To view a trace file, use either Microsoft Edge or Google Chrome, a web browser with trace viewer functionality.
You can start the trace viewer by typing edge://tracing
for Microsoft Edge or chrome://tracing
for Google Chrome in the address bar.
The following image shows the Trace Viewer running in Microsoft Edge.
Click the Load button to open the trace file. The execution trace of the program will be displayed graphically. The following image shows the execution trace of one query of the polars-tpch benchmark introduced in this [article] (https://fireducks-dev.github.io/posts/20241206_update_polars-tpch/).
The top
shows the time of the whole program (or, more correctly, the time between import fireducks.pandas
and the end of the program). Below that, fireducks.core.evaluate
is divided into two major blocks. The polars-tpch benchmark run explicitly separates the reading of the parquet file and the execution of the query. Therefore, the evaluation is split into two.
In the first half of the evaluation, you can see that only the fireducks.read_parquet_with_metadata
parquet reading process accounts for the execution time. You can also zoom in with the mouse to get a more detailed breakdown of the execution time for the second half of the query, as shown below.
How to change the trace file name
The default trace file name is trace.json
, but you can set an arbitrary file name as follows: --trace-file=foo.json
.
$ FIREDUCKS_FLAGS="--trace=3 --trace-file=foo.json" python -mfireducks.pandas your_program.py
How to output trace summary to standard error
If you only want to see a breakdown of the time spent on each process, you can also display summary information in standard error.
The summary is displayed using the same options described above. Use --trace-file=-
instead of the file name.
$ FIREDUCKS_FLAGS="--trace=3 --trace-file=-" python -mfireducks.pandas your_program.py
This is an example of the execution of the polars-tpch benchmark query used in the previous example. Although details such as the order of execution of each process are not available, a summary of the execution time can be viewed.
elapsed 6.071 sec
kernels 5.963 sec 98.22% 101
fallbacks 0.000 sec 0.00% 0
duration sec ratio count
== kernel ==
fireducks.read_parquet_with_metadata 5.453 89.83% 1
fireducks.filter 0.293 4.83% 1
fireducks.groupby_agg 0.089 1.46% 1
fireducks.le.vector.scalar 0.051 0.83% 1
fireducks.mul.vector.vector 0.042 0.69% 2
fireducks.rsub.vector.scalar 0.023 0.38% 1
fireducks.radd.vector.scalar 0.009 0.15% 1
fireducks.sort_values 0.002 0.03% 1
fireducks.read_parquet_metadata 0.001 0.02% 1
fireducks.project 0.000 0.00% 8
== fallback ==
== other ==
top 6.071 100.00% 1
create_mlir_func 0.001 0.02% 3
import pandas 0.000 0.00% 2
fire.get_string 0.000 0.00% 22
Conclusion
This article has introduced how to use the trace function in FireDucks.
When using FireDucks, there may be times when you notice a slowdown. In that case, you may be able to find the process that caused the slowdown by tracing with the help of this article.
We hope that you will make full use of FireDucks by using the trace function.