<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>FireDucks – Posts</title><link>https://fireducks-dev.github.io/posts/</link><description>Recent content in Posts on FireDucks</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Fri, 31 Jan 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://fireducks-dev.github.io/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>Posts: How to take traces in FireDucks</title><link>https://fireducks-dev.github.io/posts/2024-12-20-trace/</link><pubDate>Fri, 20 Dec 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/2024-12-20-trace/</guid><description>
&lt;p>FireDucks has a trace function that records how long each process such as read_csv, groupby, sort, etc. takes.
This article introduces how to use the trace function.&lt;/p>
&lt;h2 id="how-to-output-and-display-trace-files">How to output and display trace files&lt;/h2>
&lt;p>To use the trace function, you do not need to modify the program.
Simply set the environment variables as shown below and execute the program to use the trace function.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ FIREDUCKS_FLAGS&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;--trace=3&amp;#34;&lt;/span> python -mfireducks.pandas your_program.py
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>After setting the environment variables and executing the program, a file named &lt;code>trace.json&lt;/code> is created in the directory where the program was executed.
This file is the trace file.&lt;/p>
&lt;p>To view a trace file, use either Microsoft Edge or Google Chrome, a web browser with trace viewer functionality.
You can start the trace viewer by typing &lt;code>edge://tracing&lt;/code> for Microsoft Edge or &lt;code>chrome://tracing&lt;/code> for Google Chrome in the address bar.&lt;/p>
&lt;p>The following image shows the Trace Viewer running in Microsoft Edge.&lt;/p>
&lt;p>&lt;img src="trace01.png" alt="Edge TraceViewer">&lt;/p>
&lt;p>Click the Load button to open the trace file. The execution trace of the program will be displayed graphically.
The following image shows the execution trace of one query of the polars-tpch benchmark introduced in this &lt;a href="https://fireducks-dev.github.io/posts/20241206_update_polars-tpch/">article&lt;/a>.&lt;/p>
&lt;p>&lt;img src="trace02.png" alt="TPCH Q01 Trace">&lt;/p>
&lt;p>The &lt;code>top&lt;/code> shows the time of the whole program (or, more correctly, the time between &lt;code>import fireducks.pandas&lt;/code> and the end of the program). Below that, &lt;code>fireducks.core.evaluate&lt;/code> is divided into two major blocks. The polars-tpch benchmark run explicitly separates the reading of the parquet file and the execution of the query. Therefore, the evaluation is split into two.&lt;/p>
&lt;p>In the first half of the evaluation, you can see that only the &lt;code>fireducks.read_parquet_with_metadata&lt;/code> parquet reading process accounts for the execution time. You can also zoom in with the mouse to get a more detailed breakdown of the execution time for the second half of the query, as shown below.&lt;/p>
&lt;p>&lt;img src="trace03.png" alt="TPCH Q01 Trace Query Detail">&lt;/p>
&lt;h2 id="how-to-change-the-trace-file-name">How to change the trace file name&lt;/h2>
&lt;p>The default trace file name is &lt;code>trace.json&lt;/code>, but you can set an arbitrary file name as follows: &lt;code>--trace-file=foo.json&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ FIREDUCKS_FLAGS&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;--trace=3 --trace-file=foo.json&amp;#34;&lt;/span> python -mfireducks.pandas your_program.py
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="how-to-output-trace-summary-to-standard-error">How to output trace summary to standard error&lt;/h2>
&lt;p>If you only want to see a breakdown of the time spent on each process, you can also display summary information in standard error.&lt;/p>
&lt;p>The summary is displayed using the same options described above. Use &lt;code>--trace-file=-&lt;/code> instead of the file name.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ FIREDUCKS_FLAGS&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;--trace=3 --trace-file=-&amp;#34;&lt;/span> python -mfireducks.pandas your_program.py
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This is an example of the execution of the polars-tpch benchmark query used in the previous example.
Although details such as the order of execution of each process are not available, a summary of the execution time can be viewed.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>elapsed 6.071 sec
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>kernels 5.963 sec 98.22% &lt;span style="color:#ae81ff">101&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fallbacks 0.000 sec 0.00% &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> duration sec ratio count
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">==&lt;/span> kernel &lt;span style="color:#f92672">==&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.read_parquet_with_metadata 5.453 89.83% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.filter 0.293 4.83% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.groupby_agg 0.089 1.46% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.le.vector.scalar 0.051 0.83% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.mul.vector.vector 0.042 0.69% &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.rsub.vector.scalar 0.023 0.38% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.radd.vector.scalar 0.009 0.15% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.sort_values 0.002 0.03% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.read_parquet_metadata 0.001 0.02% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fireducks.project 0.000 0.00% 8
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">==&lt;/span> fallback &lt;span style="color:#f92672">==&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">==&lt;/span> other &lt;span style="color:#f92672">==&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>top 6.071 100.00% &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>create_mlir_func 0.001 0.02% &lt;span style="color:#ae81ff">3&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>import pandas 0.000 0.00% &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>fire.get_string 0.000 0.00% &lt;span style="color:#ae81ff">22&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>This article has introduced how to use the trace function in FireDucks.&lt;/p>
&lt;p>When using FireDucks, there may be times when you notice a slowdown.
In that case, you may be able to find the process that caused the slowdown by tracing with the help of this article.&lt;/p>
&lt;p>We hope that you will make full use of FireDucks by using the trace function.&lt;/p></description></item><item><title>Posts: Ensuring compatibility with pandas in the GPU version of FireDucks</title><link>https://fireducks-dev.github.io/posts/2024-12-19-araki-en/</link><pubDate>Thu, 19 Dec 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/2024-12-19-araki-en/</guid><description>
&lt;p>We are currently developing a GPU version of FireDucks.&lt;/p>
&lt;p>FireDucks is built with an architecture that translates programs into an intermediate representation at runtime, optimizes them in this intermediate representation, and then compiles and executes the intermediate representation for the backend. The currently released CPU version of FireDucks has a backend for CPUs. In the development of the GPU version, the backend is changed to a GPU. This allows us to use the translation to and optimization of the intermediate representation developed for the CPU version as is.&lt;/p>
&lt;p>In developing the GPU backend, we are making use of NVIDIA&amp;rsquo;s cuDF library. Our intermediate representation is roughly compatible with the pandas API, and cuDF provides similar API to pandas, so at first glance the backend development seems straightforward. However, the functions provided by cuDF are slightly different from those provided by pandas, so some ingenuity is required to maintain compatibility with pandas.&lt;/p>
&lt;p>In this blog post, I would like to briefly introduce some of the issues that need to be addressed in order to maintain compatibility with pandas.&lt;/p>
&lt;h2 id="different-result-types">Different result types&lt;/h2>
&lt;p>When working with dates in pandas, if you convert them to the datetime64 type, you can use the dt accessor to extract the year, month, and day.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>df &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame({&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>: [&lt;span style="color:#e6db74">&amp;#34;2017-11-01 12:24:00&amp;#34;&lt;/span>]})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>dfa &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>to_datetime(df[&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(dfa&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>year)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>will return &lt;code>2017&lt;/code>. By the way, what is the type of this value?
If you do &lt;code>print(dfa.dt.year.dtype)&lt;/code>, pandas returns &lt;code>int32&lt;/code> and cuDf returns &lt;code>int16&lt;/code>. This difference seems to be intentional, in order to reduce the amount of memory used by the GPU.&lt;/p>
&lt;p>The year is unlikely to exceed 16 bits, and there doesn&amp;rsquo;t seem to be a major problem, but if you use it for calculations, there is a possibility that the result will change due to overflow. For example, if you try to calculate the number of hours from year 0, and use &lt;code>dfa.dt.year * 365 * 24&lt;/code>, you will get a different result because of the overflow in int16.&lt;/p>
&lt;p>This is not limited to this example, but in cuDF, the type of the result is often slightly different. In FireDucks, we convert the result to the same type to maintain compatibility with pandas.&lt;/p>
&lt;h2 id="missing-values-are-handled-differently-in-calculations">Missing values are handled differently in calculations&lt;/h2>
&lt;p>In pandas, missing values (what we call NULL in RDB) are basically expressed as NaN. On the other hand, in cuDF, missing values are treated as a special value (NA). Therefore, the calculation results may be different.&lt;/p>
&lt;p>For example,&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>df &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>Series([&lt;span style="color:#ae81ff">1.0&lt;/span>, &lt;span style="color:#ae81ff">3.0&lt;/span>, np&lt;span style="color:#f92672">.&lt;/span>nan])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>mask &lt;span style="color:#f92672">=&lt;/span> df &lt;span style="color:#f92672">&amp;lt;&lt;/span> &lt;span style="color:#ae81ff">2.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print (mask)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>will print&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">0&lt;/span> &lt;span style="color:#66d9ef">True&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#66d9ef">False&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#66d9ef">False&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>in the case of pandas. False is entered in the missing value position. This is because it is compared with np.nan.&lt;/p>
&lt;p>On the other hand, in the case of cuDF,&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">0&lt;/span> &lt;span style="color:#66d9ef">True&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#66d9ef">False&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>is printed. The result of the operation is also a missing value (NA). In RDB, operations with NULL always result return NULL, so the results are consistent with RDB, but not with pandas. I think this is also an intentional feature of cuDF.&lt;/p>
&lt;p>In pandas, a special value (pd.NA) can be used experimentally as a missing value. When this is used, the results are the same as in cuDF and RDB, but otherwise the results are different. In FireDucks, the results are adjusted so that the handling of missing values is the same as pandas.&lt;/p>
&lt;h2 id="the-results-of-merge-are-different">The results of merge are different&lt;/h2>
&lt;p>Pandas&amp;rsquo; merge has a complex specification, and cuDF does not necessarily follow that specification. This may change in the future, but for now there are differences, such as the following.&lt;/p>
&lt;p>In pandas merge, you can use &lt;code>left_on&lt;/code> and &lt;code>right_on&lt;/code> to specify the columns to be used for merging. Normally, you specify the column names here, but if the index has a name, you can also specify the index name. You can also specify the index by specifying &lt;code>left_index=True&lt;/code> or &lt;code>right_index=True&lt;/code>. Let&amp;rsquo;s try merging using these functions.&lt;/p>
&lt;p>First, create a DataFrame for merging. The left dataframe is created as:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>idx1 &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>Index([&lt;span style="color:#ae81ff">1&lt;/span>,&lt;span style="color:#ae81ff">2&lt;/span>,&lt;span style="color:#ae81ff">3&lt;/span>,&lt;span style="color:#ae81ff">4&lt;/span>],name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;p&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df1 &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame([[&lt;span style="color:#ae81ff">1&lt;/span>,&lt;span style="color:#ae81ff">2&lt;/span>],[&lt;span style="color:#ae81ff">3&lt;/span>,&lt;span style="color:#ae81ff">4&lt;/span>],[&lt;span style="color:#ae81ff">5&lt;/span>,&lt;span style="color:#ae81ff">6&lt;/span>],[&lt;span style="color:#ae81ff">7&lt;/span>,&lt;span style="color:#ae81ff">8&lt;/span>]], columns&lt;span style="color:#f92672">=&lt;/span>[&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>,&lt;span style="color:#e6db74">&amp;#34;b&amp;#34;&lt;/span>], index&lt;span style="color:#f92672">=&lt;/span>idx1)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The result is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> a b
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> p
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">5&lt;/span> &lt;span style="color:#ae81ff">6&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#ae81ff">4&lt;/span> &lt;span style="color:#ae81ff">7&lt;/span> &lt;span style="color:#ae81ff">8&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The right dataframe is created as:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>idx2 &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>Index([&lt;span style="color:#ae81ff">1&lt;/span>,&lt;span style="color:#ae81ff">2&lt;/span>,&lt;span style="color:#ae81ff">5&lt;/span>,&lt;span style="color:#ae81ff">6&lt;/span>],name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;q&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df2 &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame([[&lt;span style="color:#ae81ff">3&lt;/span>,&lt;span style="color:#ae81ff">4&lt;/span>],[&lt;span style="color:#ae81ff">5&lt;/span>,&lt;span style="color:#ae81ff">6&lt;/span>],[&lt;span style="color:#ae81ff">1&lt;/span>,&lt;span style="color:#ae81ff">2&lt;/span>],[&lt;span style="color:#ae81ff">3&lt;/span>,&lt;span style="color:#ae81ff">4&lt;/span>]], columns&lt;span style="color:#f92672">=&lt;/span>[&lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>,&lt;span style="color:#e6db74">&amp;#34;d&amp;#34;&lt;/span>], index&lt;span style="color:#f92672">=&lt;/span>idx2)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The result is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> c d
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>q
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#ae81ff">5&lt;/span> &lt;span style="color:#ae81ff">6&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">5&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">6&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now let&amp;rsquo;s try merging these. On the left, we specify the index column by name with &lt;code>left_on=[“p”]&lt;/code>, and on the right, we specify that we want to use the index column with &lt;code>right_index=True&lt;/code>. We specify “outer” for how to perform an outer join, as in an RDB.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>df1&lt;span style="color:#f92672">.&lt;/span>merge(df2, left_on&lt;span style="color:#f92672">=&lt;/span>[&lt;span style="color:#e6db74">&amp;#34;p&amp;#34;&lt;/span>], right_index&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>, how&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;outer&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In the case of pandas, the result is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> p a b c d
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">1.0&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span> &lt;span style="color:#ae81ff">2.0&lt;/span> &lt;span style="color:#ae81ff">3.0&lt;/span> &lt;span style="color:#ae81ff">4.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">2.0&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#ae81ff">3.0&lt;/span> &lt;span style="color:#ae81ff">4.0&lt;/span> &lt;span style="color:#ae81ff">5.0&lt;/span> &lt;span style="color:#ae81ff">6.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">3.0&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">5.0&lt;/span> &lt;span style="color:#ae81ff">6.0&lt;/span> NaN NaN
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">4.0&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span> &lt;span style="color:#ae81ff">7.0&lt;/span> &lt;span style="color:#ae81ff">8.0&lt;/span> NaN NaN
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>NaN &lt;span style="color:#ae81ff">5&lt;/span> NaN NaN &lt;span style="color:#ae81ff">1.0&lt;/span> &lt;span style="color:#ae81ff">2.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>NaN &lt;span style="color:#ae81ff">6&lt;/span> NaN NaN &lt;span style="color:#ae81ff">3.0&lt;/span> &lt;span style="color:#ae81ff">4.0&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In the case of cuDF,&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span> a b c d
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>p
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">2&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span> &lt;span style="color:#ae81ff">5&lt;/span> &lt;span style="color:#ae81ff">6&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">5&lt;/span> &lt;span style="color:#ae81ff">6&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">4&lt;/span> &lt;span style="color:#ae81ff">7&lt;/span> &lt;span style="color:#ae81ff">8&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>NA&lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The type of the column with missing values is different, because pandas uses NaN for missing values, which causes the type of the column float64 type.
The big difference is that pandas has a column called &lt;code>p&lt;/code>, but cuDF does not. Also, the name of the resulting index is different.&lt;/p>
&lt;p>The column &lt;code>p&lt;/code> is originally the index column of the left DataFrame, and it seems like a strange specification that this is created as a column of the resulting DataFrame (columns like &lt;code>p&lt;/code> are only created in the case of the special parameter combination described above). In RDB joins, such columns are not created. On the other hand, the results are clearly different, so there is a possibility that this could cause a compatibility problem for user programs.&lt;/p>
&lt;p>In FireDucks, we have adjusted the code so that the results are as close as possible to those of pandas even in such cases.&lt;/p>
&lt;p>As described above, there are cases where the results differ between pandas and cuDF. The GPU version of FireDucks has been implemented to absorb such differences and produce results as close as possible to those of pandas. The GPU version of FireDucks is still under development and is not yet ready for use by users, but we hope that you will try it out when it is completed.&lt;/p></description></item><item><title>Posts: Cache or Eliminate? How FireDucks increase opportunity of optimization</title><link>https://fireducks-dev.github.io/posts/20241217_liveness_analysis/</link><pubDate>Tue, 17 Dec 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/20241217_liveness_analysis/</guid><description>
&lt;p>As described &lt;a href="https://fireducks-dev.github.io/docs/user-guide/02-exec-model/">here&lt;/a>, FireDucks uses
lazy execution model with define-by-run IR generation. Since FireDucks uses
&lt;a href="https://mlir.llvm.org">MLIR compiler framework&lt;/a> to optimize and execute IR,
first step of the execution is creating MLIR function which holds operations
to be evaluated. This article describes how important this function creation
step is for optimization, thus performance.&lt;/p>
&lt;p>In the simple example below, execution of IR is kicked by the &lt;code>print&lt;/code> statement
which calls &lt;code>df2.__repr__()&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>df0 &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>read_parquet(&lt;span style="color:#e6db74">&amp;#34;date.parquet&amp;#34;&lt;/span>, [])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df1 &lt;span style="color:#f92672">=&lt;/span> df0&lt;span style="color:#f92672">.&lt;/span>sort_value(&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df2 &lt;span style="color:#f92672">=&lt;/span> df1[[&lt;span style="color:#e6db74">&amp;#34;b&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(df2)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> :
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When the execution is started, FireDucks collects
operations which are required to compute &lt;code>df2&lt;/code> and put those into a MLIR
function.
Simplest MLIR function would be as below.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-mlir" data-lang="mlir">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">// MLIR function (simplified)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">func&lt;/span> main() {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> %t0 = read_parquet(&lt;span style="color:#e6db74">&amp;#34;data.parquet&amp;#34;&lt;/span>, [])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> %t1 = sort_values(%t0, [&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>], [True])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> %t2 = project(%t1, [&lt;span style="color:#e6db74">&amp;#34;b&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> %t0, %t1, %t2
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In this function, the &lt;code>return&lt;/code> op at the end of the function returns three
values, &lt;code>%t0&lt;/code>, &lt;code>%t1&lt;/code> and &lt;code>%t2&lt;/code>. Returned values will be bound to the variables
in python, i.e. &lt;code>df0&lt;/code>, &lt;code>df1&lt;/code> and &lt;code>df2&lt;/code>, after the execution. If such variables
will be used after &lt;code>print&lt;/code> statement, those results will be used. One can say
that the result of the operations are &lt;strong>cached&lt;/strong> to avoid re-execution.&lt;/p>
&lt;p>This function is however very strict for IR optimization. Because all output
values of ops, &lt;code>%t0&lt;/code>, &lt;code>%t1&lt;/code> and &lt;code>%t2&lt;/code>, are returned, an optimizer has to
preserve these values. On the other hand, for example, if &lt;code>return&lt;/code> op returns
only &lt;code>%t2&lt;/code>, an optimizer can change IR as below. In this IR, only three column
&lt;code>&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;c&amp;quot;&lt;/code> will be read from a file
by &lt;code>read_parquet&lt;/code> op to minimize reading time and memory footprint because the
result of it, &lt;code>%t0&lt;/code>, is only used to compute &lt;code>%t2&lt;/code>. As this simple example shows,
because returning values from a function restricts optimization opportunity,
returned values have to be carefully chosen.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-mlir" data-lang="mlir">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">// MLIR function (simplified)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">func&lt;/span> main() {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> %t0 = read_parquet(&lt;span style="color:#e6db74">&amp;#34;data.parquet&amp;#34;&lt;/span>, [&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;b&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>]) &lt;span style="color:#75715e">// only three columns are read from a parquet file.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span> %t1 = sort_values(%t0, [&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>], [True])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> %t2 = project(%t1, [&lt;span style="color:#e6db74">&amp;#34;b&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> %t2
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="liveness-analysis">Liveness analysis&lt;/h2>
&lt;p>To this end, FireDucks uses liveness analysis of python variables when creating
a function. For example, if &lt;code>print&lt;/code> statement is the last line of a python
script, the liveness analysis can say that &lt;code>df0&lt;/code> and &lt;code>df1&lt;/code> are dead at the
&lt;code>print&lt;/code> statement because no user of those variables is in the script after the
statement. With this analysis, FireDucks can &lt;strong>eliminate&lt;/strong> &lt;code>%t0&lt;/code> and &lt;code>%t1&lt;/code> from
the &lt;code>return&lt;/code> op. As an another example, if &lt;code>df1&lt;/code> is used after &lt;code>print&lt;/code>
statement, the liveness analysis says that it is live, and &lt;code>%t1&lt;/code> is not
eliminated from &lt;code>return&lt;/code> op.&lt;/p>
&lt;p>In lazy execution system, re-execution sometimes drastically decreases
performance. By using liveness analysis, FireDucks detects whether variables
are used after execution or not in order to increase optimization opportunity with
avoiding re-execution. AFAIK, this is the reason why FireDucks is very faster
than polars in &lt;a href="https://www.linkedin.com/feed/update/urn:li:activity:7274024597606305792/">this
article&lt;/a>.
You can &lt;a href="https://colab.research.google.com/drive/1cN82zfB56UZcG1Ooc2O1SbTe0rpr3DQl?usp=sharing&amp;amp;ref=dailydoseofds.com">reproduce it on
colab&lt;/a>.&lt;/p>
&lt;h2 id="how-does-it-work-in-a-notebook">How does it work in a notebook?&lt;/h2>
&lt;p>You may wonder how liveness analysis works when you are writing in a notebook
or ipython. In such environment, because all python variables might be used in
future cells, liveness analysis has to say that all variables are live. This
results in limited optimization opportunity.&lt;/p>
&lt;p>One workaround is that using chained style:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>pd&lt;span style="color:#f92672">.&lt;/span>read_parquet(&lt;span style="color:#e6db74">&amp;#34;date.parquet&amp;#34;&lt;/span>, [])&lt;span style="color:#f92672">.&lt;/span>sort_values(&lt;span style="color:#e6db74">&amp;#34;a&amp;#34;&lt;/span>)[[&lt;span style="color:#e6db74">&amp;#34;b&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>]]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In this style, because results of intermediate operations, &lt;code>read_parquet&lt;/code> and
&lt;code>sort_values&lt;/code> are not bound to any python variables, there is no chance to be
used in future cells. In this case, FireDucks&amp;rsquo;s liveness analysis can say that
those are dead when evaluating the result of last operation.&lt;/p></description></item><item><title>Posts: How to run polars-tpch benchmark with FireDucks</title><link>https://fireducks-dev.github.io/posts/20241206_update_polars-tpch/</link><pubDate>Fri, 06 Dec 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/20241206_update_polars-tpch/</guid><description>
&lt;p>Recently we have updated the result of &lt;a href="https://github.com/pola-rs/tpch">polars-tpch
benchmark&lt;/a> on 4th generation Xeon processor.
The latest result can be found &lt;a href="https://fireducks-dev.github.io/docs/benchmarks/#2-tpc-h-benchmark">here&lt;/a>, and
also below in this artice, explaining how to reproduce the same.&lt;/p>
&lt;p>For reproducibility, we have used AWS EC2 for this time evaluation. We have
used &lt;code>m7i.8xlarge&lt;/code> instance type with ubuntu 24.04 image and 128GB EBS SSD.
This instance includes:&lt;/p>
&lt;ul>
&lt;li>4th generation Xeon processor: Intel(R) Xeon(R) Platinum 8488C (32cores)&lt;/li>
&lt;li>128GB memory&lt;/li>
&lt;/ul>
&lt;h2 id="benchmark-result">Benchmark Result&lt;/h2>
&lt;p>The graph shown below compares performance of four dataframe libraries,
pandas, polars, modin and fireducks as a speedup from pandas. For an average of 22
queries:&lt;/p>
&lt;ul>
&lt;li>fireducks is 125x faster than pandas, whereas&lt;/li>
&lt;li>polars is 57x faster than pandas,&lt;/li>
&lt;li>modin is 1.0x faster than pandas&lt;/li>
&lt;/ul>
&lt;p>Note that our setting of benchmark is as follows:&lt;/p>
&lt;ul>
&lt;li>Scale factor is 10 (about 10GB dataset) &lt;code>SCALE_FACTOR=10.0&lt;/code>&lt;/li>
&lt;li>Timings without IO &lt;code>RUN_IO_TYPE=skip&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="polars-tpch-sf10_20241205.webp" alt="polars-tpch benchmark result">&lt;/p>
&lt;h2 id="how-to-run-the-benchmark-with-fireducks">How to run the benchmark with FireDucks&lt;/h2>
&lt;p>After launching instance and logging into it, you need to install python:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ sudo apt update
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>$ sudo apt install python3.10-venv make gcc
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then clone the benchmark code from &lt;a href="https://github.com/fireducks-dev/polars-tpch">our
repository&lt;/a> which includes
queries for FireDucks:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ git clone https://github.com/fireducks-dev/polars-tpch
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>$ cd polars-tpch
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then run the script which first creates dataset with &lt;code>SCALE_FACTOR=10.0&lt;/code> and runs
all 22 queries once with FireDucks as per the above settings:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ ./run-fireducks.sh
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now you can see timings for all queries in &lt;code>output/run/timings.csv&lt;/code>.&lt;/p>
&lt;h2 id="how-to-produce-timings-for-the-graph">How to produce timings for the graph&lt;/h2>
&lt;p>The graph shown above was created with minimum execution time among three runs
for each query.&lt;/p>
&lt;p>To do this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>$ SCALE_FACTOR&lt;span style="color:#f92672">=&lt;/span>10.0 make tables &lt;span style="color:#75715e"># create dataset by polars&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>$ .venv/bin/pip install -U pandas polars modin &lt;span style="color:#75715e"># install latest libraries&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>$ ./run-fppm3.sh &lt;span style="color:#75715e"># run four libraries three times&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="how-queries-for-fireducks-are-implemented">How Queries for FireDucks Are Implemented&lt;/h2>
&lt;p>When we have started to use TPC-H benchmark for evaluation of FireDucks, we
have implemented pandas version of queries so that those produce the correct
result. Later, when we came to know about polars-tpch benchmark, we have updated our
queries as like polars as possible, for example same number of operations, same
order of operations, etc.&lt;/p>
&lt;p>Since both libraries have different APIs, implementations are not 100% same, but to my surprise
both implementations look very similar, thanks to flexible pandas API.
Here is query 2 as an example. Can you figure out which one is polars and which one is pandas?&lt;/p>
&lt;p>&lt;img src="q02.webp" alt="q02">&lt;/p>
&lt;p>In this post, we have described how to reproduce our results for TPC-H Benchmark.
Let&amp;rsquo;s try and take a look at both queries. If you find something to be improved for
reasonable comparison among libraries, let us know.&lt;/p></description></item><item><title>Posts: What to do when FireDucks is slow</title><link>https://fireducks-dev.github.io/posts/beginner_guide/</link><pubDate>Mon, 11 Nov 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/beginner_guide/</guid><description>
&lt;p>Thank you for your interest in FireDucks.
&lt;!-- raw HTML omitted -->This article describes possible causes and remedies for slow programs using FireDucks.&lt;/p>
&lt;p>When a pandas program with FireDucks applied is slow, the reason may be the followings.&lt;/p>
&lt;ol>
&lt;li>Using &amp;lsquo;apply&amp;rsquo; or &amp;rsquo;loop&amp;rsquo;.&lt;/li>
&lt;li>Using pandas API not implemented in FireDucks.&lt;/li>
&lt;/ol>
&lt;p>In the case of 1, if you change the pandas program, the program may become faster.
&lt;!-- raw HTML omitted -->
For example,&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>sum_val &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> i &lt;span style="color:#f92672">in&lt;/span> range(len(df)):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> df[&lt;span style="color:#e6db74">&amp;#34;A&amp;#34;&lt;/span>][i] &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sum_val &lt;span style="color:#f92672">+=&lt;/span> df[&lt;span style="color:#e6db74">&amp;#34;B&amp;#34;&lt;/span>][i]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A program using &amp;rsquo;loop&amp;rsquo; like the one above can be faster by writing it like the one below.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>sum_val &lt;span style="color:#f92672">=&lt;/span> df[df[&lt;span style="color:#e6db74">&amp;#34;A&amp;#34;&lt;/span>] &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>][&lt;span style="color:#e6db74">&amp;#34;B&amp;#34;&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>sum()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If you have difficulty in confirming by yourself, or if the source code is too complicated to find suitable modification methods, please feel free to ask for help from the FireDucks community via the slack listed below.&lt;/p>
&lt;p>In the case of 2, we cannot immediately speed up the program, but by reporting unimplemented pandas API in FireDucks, we will implement them and speed up the program in the future.&lt;/p>
&lt;p>Please set environment variables to determine if FireDucks implements the functionality you need for your pandas program.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>FIREDUCKS_FLAGS&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;-Wfallback&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>After setting the environment variables, please run the program.
If you see the word “Fallback” in the pandas function, it would be helpful if you could report it.&lt;/p>
&lt;p>If you would like to report a problem, please contact us by any of the following methods.&lt;/p>
&lt;ul>
&lt;li>🦆github : &lt;a href="https://github.com/fireducks-dev/fireducks/issues/new">https://github.com/fireducks-dev/fireducks/issues/new&lt;/a>&lt;/li>
&lt;li>📧mail : &lt;a href="mailto:contact@fireducks.jp.nec.com">contact@fireducks.jp.nec.com&lt;/a>&lt;/li>
&lt;li>🤝slack : &lt;a href="https://join.slack.com/t/fireducks/shared_invite/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg">https://join.slack.com/t/fireducks/shared_invite/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Please feel free to contact us even if you have difficulty in researching Fallback.&lt;/p>
&lt;p>This concludes this article. Thank you for reading.&lt;/p></description></item><item><title>Posts: Workshop at Bangalore, India</title><link>https://fireducks-dev.github.io/posts/20240919-workshop/</link><pubDate>Thu, 19 Sep 2024 00:00:00 +0900</pubDate><guid>https://fireducks-dev.github.io/posts/20240919-workshop/</guid><description>
&lt;p>We had a workshop on FireDucks with faculties from universities around Bangalore.
Thank you for joining and discussion.&lt;/p>
&lt;p>&lt;img src="photo.webp" alt="image">&lt;/p></description></item><item><title>Posts: Have you ever thought of speeding up your data analysis in pandas with a compiler?</title><link>https://fireducks-dev.github.io/posts/sourav_cse_demo_20240701/</link><pubDate>Mon, 01 Jul 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/sourav_cse_demo_20240701/</guid><description>
&lt;p>In general, a Data Scientist spends significant efforts in transforming the raw data into a more
digestible format before training an AI model or creating visualizations. Traditional tools such
as pandas have long been the linchpin in this process, offering powerful capabilities but not
without limitations.&lt;/p>
&lt;p>With the pitfall of its single-core implementation and inefficient data structures,
often we face performance issues when dealing with pandas for relatively larger data,
but its performance is also highly impacted by the choice of APIs, their parameters and execution orders.
Sometime an efficient writing of a pandas application can itself make it 10-20x faster.&lt;/p>
&lt;p>In this article, I will discuss a few commonly used examples picked up from some pandas applications.&lt;/p>
&lt;h2 id="notebook-example-01">:notebook: Example 01&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>df &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>s &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>Series([&lt;span style="color:#e6db74">&amp;#34;2020-10-10&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;2021-11-20&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;2022-08-03&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;2023-07-04&amp;#34;&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df[&lt;span style="color:#e6db74">&amp;#34;year&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>to_datetime(s)&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>year
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df[&lt;span style="color:#e6db74">&amp;#34;month&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>to_datetime(s)&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>month
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df[&lt;span style="color:#e6db74">&amp;#34;day&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>to_datetime(s)&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>day
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We can clearly identify the major issue with the above code is using the expression, &lt;code>pd.to_datetime(s)&lt;/code> three times.
The purpose is to extract the year, month and day fields from the series data (s) of type &amp;ldquo;datetime64[ns]&amp;rdquo;.
The method to_datetime() attempts to parse the input string column in order to convert it to a datetime column
(along with inferring the format if not specified). Such parsing itself is very expensive when performed on a large data
and if the same expression on the same input is written more than one time, its very obvious that the program will
experience a significant performance issue.&lt;/p>
&lt;p>As you might have clearly figured out, an optimized solution to this problem could be:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>df &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>s &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>Series([&lt;span style="color:#e6db74">&amp;#34;2020-10-10&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;2021-11-20&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;2022-08-03&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;2023-07-04&amp;#34;&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>dt_s &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>to_datetime(s)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df[&lt;span style="color:#e6db74">&amp;#34;year&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> dt_s&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>year
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df[&lt;span style="color:#e6db74">&amp;#34;month&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> dt_s&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>month
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df[&lt;span style="color:#e6db74">&amp;#34;day&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> dt_s&lt;span style="color:#f92672">.&lt;/span>dt&lt;span style="color:#f92672">.&lt;/span>day
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This is a very basic example which can be manually optimized, if reviewed carefully after an application is developed.
Let&amp;rsquo;s take another example which is very common when performing an extensive data analysis.&lt;/p>
&lt;h2 id="notebook-example-02">:notebook: Example 02&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>res &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To find industry-wise average salary of an Indian employee&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>res[&lt;span style="color:#e6db74">&amp;#34;industry_wise_avg_sal&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> employee[employee[&lt;span style="color:#e6db74">&amp;#34;country&amp;#34;&lt;/span>] &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;India&amp;#34;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>groupby(&lt;span style="color:#e6db74">&amp;#34;industry&amp;#34;&lt;/span>)[&lt;span style="color:#e6db74">&amp;#34;salary&amp;#34;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>mean()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To find industry-wise average salary of an Indian employee who are above 30&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>res[&lt;span style="color:#e6db74">&amp;#34;industry_wise_avg_sal_for_specific_age_group&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> employee[(employee[&lt;span style="color:#e6db74">&amp;#34;country&amp;#34;&lt;/span>] &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;India&amp;#34;&lt;/span>) &lt;span style="color:#f92672">&amp;amp;&lt;/span> (employee[&lt;span style="color:#e6db74">&amp;#34;age&amp;#34;&lt;/span>] &lt;span style="color:#f92672">&amp;gt;=&lt;/span> &lt;span style="color:#ae81ff">30&lt;/span>)]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>groupby(&lt;span style="color:#e6db74">&amp;#34;industry&amp;#34;&lt;/span>)[&lt;span style="color:#e6db74">&amp;#34;salary&amp;#34;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>mean()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(res)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Both of these queries have a common expression to be evaluated during the filter operation, i.e.,
checking whether the employee belongs to India (comparison on String column is itself very costly than comparison on numeric columns).
Now, imagine if we are dealing with an extensively large employee database, then evaluating &lt;code>employee[&amp;quot;country&amp;quot;] == &amp;quot;India&amp;quot;&lt;/code> for two
times can be quite an expensive operation that can easily be optimized with the same strategy as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To generate the required filtration masks in advance&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cond1 &lt;span style="color:#f92672">=&lt;/span> (employee[&lt;span style="color:#e6db74">&amp;#34;country&amp;#34;&lt;/span>] &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;India&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cond2 &lt;span style="color:#f92672">=&lt;/span> (employee[&lt;span style="color:#e6db74">&amp;#34;age&amp;#34;&lt;/span>] &lt;span style="color:#f92672">&amp;gt;=&lt;/span> &lt;span style="color:#ae81ff">30&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>res &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To find industry-wise average salary of an Indian employee&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>res[&lt;span style="color:#e6db74">&amp;#34;industry_wise_avg_sal&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> employee[cond1]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>groupby(&lt;span style="color:#e6db74">&amp;#34;industry&amp;#34;&lt;/span>)[&lt;span style="color:#e6db74">&amp;#34;salary&amp;#34;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>mean()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># To find industry-wise average salary of an Indian employee who are above 30&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>res[&lt;span style="color:#e6db74">&amp;#34;industry_wise_avg_sal_for_specific_age_group&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> employee[cond1 &lt;span style="color:#f92672">&amp;amp;&lt;/span> cond2]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>groupby(&lt;span style="color:#e6db74">&amp;#34;industry&amp;#34;&lt;/span>)[&lt;span style="color:#e6db74">&amp;#34;salary&amp;#34;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">.&lt;/span>mean()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(res)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In compiler technology, such optimization is called as
&lt;strong>&lt;a href="https://en.wikipedia.org/wiki/Common_subexpression_elimination">common sub-expression elimination (CSE)&lt;/a>&lt;/strong> and
can essentially be performed by a smart compiler in programming language like C/C++.&lt;/p>
&lt;h2 id="notebook-example-03">:notebook: Example 03&lt;/h2>
&lt;p>Let&amp;rsquo;s take another example of a compiler optimization technique:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">func&lt;/span>(x: pd&lt;span style="color:#f92672">.&lt;/span>DataFrame, y: pd&lt;span style="color:#f92672">.&lt;/span>DataFrame):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> merged &lt;span style="color:#f92672">=&lt;/span> x&lt;span style="color:#f92672">.&lt;/span>merge(y, on&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> sorted &lt;span style="color:#f92672">=&lt;/span> merged&lt;span style="color:#f92672">.&lt;/span>sort_values(by&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>) &lt;span style="color:#75715e"># is never used&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> merged&lt;span style="color:#f92672">.&lt;/span>groupby(&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>max()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The method is trying to merge two input tables, x and y followed by a groupby-aggregate operation.
The &amp;ldquo;sort&amp;rdquo; operation is also a part of the method after merging the tables, but it is actually not
required from the method context (since the sorted result is never used within the method).
Sometime when we focus on very detail exploration of the input data, it is very common that such piece of
unwanted code remains in our application resulting a significant performance cost.&lt;/p>
&lt;p>Hence, an optimized solution to the above method could be:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">func&lt;/span>(x: pd&lt;span style="color:#f92672">.&lt;/span>DataFrame, y: pd&lt;span style="color:#f92672">.&lt;/span>DataFrame):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> merged &lt;span style="color:#f92672">=&lt;/span> x&lt;span style="color:#f92672">.&lt;/span>merge(y, on&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> merged&lt;span style="color:#f92672">.&lt;/span>groupby(&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>max()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In compiler technology, such optimization is called as
&lt;strong>&lt;a href="https://en.wikipedia.org/wiki/Dead-code_elimination">dead code elimination&lt;/a>&lt;/strong> and
is very important from the overall performance point of view of any application.&lt;/p>
&lt;p>An expert programmer might take care of such cases while developing his data analysis solutions, but as a data analyst our
primary focus is to explore more about the data from different verticals and horizontals to find out meaningful insights
while solving some business problems or creating important features for our training data. Hence we often miss to take care
of such issues. Nowadays, there are many good linters that will point out such issues, so a manual optimization is definitely
possible if you include the linter execution step as part of your TODO action when developing an application.&lt;/p>
&lt;h2 id="point_right-new-information-alert">:point_right: New Information alert!&lt;/h2>
&lt;p>Let&amp;rsquo;s now talk about an ideal scenario!&lt;/p>
&lt;p>How would it be if such optimization is automatically taken care by the python data manipulation library we use?&lt;/p>
&lt;p>It can save us from taking care of such issues by ourselves and can significantly improve the performance of our application, right?&lt;/p>
&lt;p>Well, the wait is over! Let me introduce a high-performance DataFrame library,
named &lt;strong>&lt;a href="https://fireducks-dev.github.io/">FireDucks&lt;/a>&lt;/strong> with highly compatible pandas APIs,
powered by &lt;a href="https://mlir.llvm.org/">MLIR（Multi-Level Intermediate Representation)&lt;/a>
framework to have such powerful compiler optimization abilities.&lt;/p>
&lt;p>The library is carefully developed at NEC R&amp;amp;D laboratory over last 3 years and is freely available to be installed using
&amp;ldquo;pip&amp;rdquo; under BSD licence since October, 2023.&lt;/p>
&lt;h2 id="fire-bird-what-does-it-offer">:fire: :bird: What does it offer?&lt;/h2>
&lt;p>FireDucks is developed with the following points in mind:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Automatic query optimization ability with lazy-execution model.&lt;/p>
&lt;ul>
&lt;li>A data analyst can focus more on exploring data, while the inbuilt compiler of FireDucks can take care of the following:
&lt;ol>
&lt;li>basic compiler optimizations, like common sub-expression elimination, dead code elimination etc.&lt;/li>
&lt;li>domain specific optimizations, like execution reordering, dropping unwanted columns in advance etc.&lt;/li>
&lt;li>pandas specific optimizations, like choice of right APIs when executing a query (by careful analysis of the application objective), choice of right parameters by avoiding unwanted operations (like sorting of result etc.)&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A pandas user can flexibly adapt to this library without any new learning cost.&lt;/p>
&lt;ul>
&lt;li>FireDucks is highly compatible with pandas, so any pandas application can be optimized without any manual code changes. Doesn&amp;rsquo;t it sound quite promising?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The program can leverage all the available cores in the execution environment:&lt;/p>
&lt;ul>
&lt;li>The single-core execution issue in pandas is solved!&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="computer-demonstration">:computer: Demonstration&lt;/h2>
&lt;p>Here is a &lt;a href="https://colab.research.google.com/drive/1mPv6kuOlWckubPBuCaUgmTg0Ip6zQAuO?usp=sharing">link&lt;/a> for a test drive with
a sample walkthrough notebook on Google Colab which shows how easy it is to start with FireDucks and its performance gain over
pandas for a sample CSE use-case (Example 01 of this article).&lt;/p>
&lt;p>With a low-spec execution environment having 2 cores, where pandas takes 1.3 seconds to execute a sample query,
FireDucks (with multhreaded + compiler enabled) takes only 166 milli-seconds.
Such speedup (~8x) without incurring any migration cost (no cost involves in rewritting the application from
pandas to FireDucks) or any special hardware cost seems quite beneficial from the overall production cost point of view.&lt;/p>
&lt;p>In case you are interested to see how it performs for popular benchmarks (like TPC-H, db-benchmark etc.) in comparison to
other high-performance pandas alternatives, you may like to check &lt;a href="https://fireducks-dev.github.io/docs/benchmarks/">this&lt;/a> out.&lt;/p>
&lt;h2 id="point_right-how-does-fireducks-work">:point_right: How does FireDucks work?&lt;/h2>
&lt;p>FireDucks comes with three powerful layers:&lt;/p>
&lt;ul>
&lt;li>a python frontend highly compatible with pandas APIs&lt;/li>
&lt;li>an in-built compiler to auto-detect and optimize the exsting performance issues in a user program&lt;/li>
&lt;li>a multithreaded C++ backend with efficient parallel implementation of all the dataframe related operations like join, filter, groupby sort etc.&lt;/li>
&lt;/ul>
&lt;p>Unlike pandas that executes the library functions right after they are called, FireDucks creates some special instructions and keep them accumulated until there is some explicit request for the result (print result, do some reduction, to_csv etc.). Before the execution, the in-built compiler inspects all the accumulated instructions related to the result to be processed and performs some automatic optimization (as explained above) and then the optimized instructions are executed at the multithreaded kernel backend helping us to focus more on our analytical work and be more productive.&lt;/p>
&lt;p>&lt;img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/2917078/9d3e5cf5-048c-889e-9efd-ac60f4c3e924.png" alt="pandas_fireducks_exec_model.png">&lt;/p>
&lt;h1 id="-conclusion">✍️ Conclusion&lt;/h1>
&lt;p>FireDucks shows promise by taking care of all the drawbacks associated with pandas and
its compiler optimization technology makes it one of its kind. This article introduces FireDucks basic compilation optimization
abilities as mentioned in (a). In upcoming articles, I will demonstrate other powerful optimization areas (b and c)
automatically taken care of by FireDucks. If you like this article, please be with me to check them out as well.
You may like to try FireDucks and share your feedback. I would love to answer to whatever queries you may have
in mind with the best of my knowledge. The development team is very active and there is a new release almost
every week with performance improvements, new features based on user requests, bug fixes etc.
You may like to get in touch with them directly through the
&lt;a href="https://join.slack.com/t/fireducks/shared_invite/zt-34qpdgr6q-_iWdIoZW4l_hGhljKS0pyg">slack channel&lt;/a>.&lt;/p>
&lt;p>You may also like to checkout one of my previous &lt;a href="https://qiita.com/qsourav/items/e87f25c4b307391d784a">articles&lt;/a> on FireDucks salient features.&lt;/p></description></item><item><title>Posts: Introduction to FireDucks: Get performance beyond pandas with zero learning cost!</title><link>https://fireducks-dev.github.io/posts/2024-05-08-medium/</link><pubDate>Wed, 08 May 2024 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/2024-05-08-medium/</guid><description/></item><item><title>Posts: Backtesting Trading Strategies with Ease: An excursion with FireDucks</title><link>https://fireducks-dev.github.io/posts/neci4/</link><pubDate>Fri, 22 Mar 2024 00:00:00 +0900</pubDate><guid>https://fireducks-dev.github.io/posts/neci4/</guid><description/></item><item><title>Posts: FireDucks: Diving into API Compatibility with Pandas</title><link>https://fireducks-dev.github.io/posts/neci3/</link><pubDate>Thu, 21 Mar 2024 00:00:00 +0900</pubDate><guid>https://fireducks-dev.github.io/posts/neci3/</guid><description/></item><item><title>Posts: Choosing Your Data Champion: A Side-by-Side Look at FireDucks and Polars</title><link>https://fireducks-dev.github.io/posts/neci2/</link><pubDate>Wed, 20 Mar 2024 00:00:00 +0900</pubDate><guid>https://fireducks-dev.github.io/posts/neci2/</guid><description/></item><item><title>Posts: Boosting Data Analysis with FireDucks</title><link>https://fireducks-dev.github.io/posts/neci1/</link><pubDate>Tue, 05 Mar 2024 00:00:00 +0900</pubDate><guid>https://fireducks-dev.github.io/posts/neci1/</guid><description/></item><item><title>Posts: A hidden fact you must know when working with pandas</title><link>https://fireducks-dev.github.io/posts/20231216-sourav/</link><pubDate>Sat, 16 Dec 2023 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/20231216-sourav/</guid><description/></item><item><title>Posts: Tricks to improve computational performance of JOIN operation more than 10x</title><link>https://fireducks-dev.github.io/posts/20231211-sourav/</link><pubDate>Mon, 11 Dec 2023 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/20231211-sourav/</guid><description/></item><item><title>Posts: FireDucks - An economical and environment-friendly high-performance solution for your complex Data Analysis</title><link>https://fireducks-dev.github.io/posts/20231207-sourav/</link><pubDate>Thu, 07 Dec 2023 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/20231207-sourav/</guid><description/></item><item><title>Posts: One thing you might be doing wrong in pandas!!</title><link>https://fireducks-dev.github.io/posts/20231206-sourav/</link><pubDate>Wed, 06 Dec 2023 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/20231206-sourav/</guid><description/></item><item><title>Posts: Acceleration technology inside FireDucks</title><link>https://fireducks-dev.github.io/posts/est/</link><pubDate>Tue, 05 Dec 2023 00:00:00 +0900</pubDate><guid>https://fireducks-dev.github.io/posts/est/</guid><description>
&lt;h2 id="switching-groupby">SWITCHING GROUPBY&lt;/h2>
&lt;p>In this article, we introduce the acceleration techniques of &amp;ldquo;groupby&amp;rdquo; used in FireDucks.&lt;/p>
&lt;p>The groupby operation is one of the most fundamental and important operations in tabular data analysis.
We can use the groupby operation to obtain important statistical properties such as the mean and variance of the data.
We can also combine it with other operations to obtain new features.&lt;/p>
&lt;p>FireDucks optimizes based on data characteristics for fast groupby operations.
One such optimization is the automatic selection of groupby algorithms based on the number of groups.
FireDucks&amp;rsquo; groupby algorithm focuses on the number of groups of data and switches between an algorithm that is fast for data with a small number of groups (Algorithm A) and an algorithm that is fast for data with a large number of groups (Algorithm B).
The number of groups indicates the type of data that make up the target column.&lt;/p>
&lt;p>Consider the following tabular data.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>food&lt;/th>
&lt;th>category&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>0&lt;/td>
&lt;td>apple&lt;/td>
&lt;td>fruit&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>carrot&lt;/td>
&lt;td>vegetable&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>peach&lt;/td>
&lt;td>fruit&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>onion&lt;/td>
&lt;td>vegetable&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>grape&lt;/td>
&lt;td>fruit&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>There are five elements that make up the “food” column: “apple”, “carrot”, “peach”, “onion”, and “grape”.
Therefore, the number of groups in the “food” column is 5.
Next, we see that there are two types of elements that make up the “category” column: fruit and vegetable. Therefore, the number of groups in the “category” column is 2.&lt;/p>
&lt;p>Calculating the number of groups for large data sets is time-consuming.
Therefore, FireDucks uses a statistical method to estimate the number of groups without obtaining an exact value for the number of groups&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>.
The estimation is performed as follows.&lt;/p>
&lt;ol>
&lt;li>Extracts a random piece of data from a sequence of data of interest.&lt;/li>
&lt;li>Record that data.&lt;/li>
&lt;li>Again, take out one data at random from the column of data of interest.&lt;/li>
&lt;li>That data is checked to see if it matches the recorded data and record.&lt;/li>
&lt;li>Repeat operations 3 and 4 multiple times.&lt;/li>
&lt;/ol>
&lt;p>After repeating operations 3 and 4 multiple times, FireDucks estimates the number of groups of data in the group key of interest from the number of times the data was retrieved and the number of matches.&lt;/p>
&lt;h2 id="evaluation">Evaluation&lt;/h2>
&lt;p>We measured the calculation speed using TPC-H, a benchmark that includes many processes related to data analysis.&lt;/p>
&lt;p>&lt;img src="compare.png" alt="compare">&lt;/p>
&lt;p>“A&amp;quot; uses only Algorithm A for groupby operations, and &amp;ldquo;B&amp;rdquo; uses only Algorithm B for groupby operations.
“auto&amp;quot; estimates the number of groups and automatically selects between Algorithm A and B for the groupby operation.&lt;/p>
&lt;p>Processes for which there is little difference in execution time between Algorithm A and Algorithm B are omitted from the above graph.
The graph shows that, with the exception of q10, the automatic selection algorithm is the faster of Algorithm A and Algorithm B.&lt;/p>
&lt;p>The total shows the computation time for the entire TPC-H, including the processes excluded from the above graph for Algorithm A, Algorithm B, and the automatic selection algorithm. This indicates that the automatic selection algorithm is about 3 times faster than Algorithm A, and about 1.2 times faster than Algorithm B for TPC-H as a whole.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>M. Bressan, E. Peserico, and L. Pretto. Simple set cardinality estimation through random sampling. CoRR, abs/1512.07901, 2015.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Posts: Application example: Spicy MINT at Toyota Technical Development Corporation</title><link>https://fireducks-dev.github.io/posts/ttdc/</link><pubDate>Thu, 19 Oct 2023 00:00:00 +0000</pubDate><guid>https://fireducks-dev.github.io/posts/ttdc/</guid><description/></item></channel></rss>