Big Data: Three Different Ways to Handle Data That Can’t Fit into Memory (Part III)

June 1, 2021

Today, we are diving into the third part of this topic and talk about a library that would help us to handle data that can’t fit into memory — Dask. Dask is a parallel data computation library that is built on top of Pandas. Dask can not only do parallel computation on a single machine, but also scale it to hundreds of machines.

Dask Dataframe Processing Examples:
We included some handy Dask code for your references. As you can tell, they are very similar if not identical with Pandas. .compute() method executes the lazy operation and returns the result.

import dask.dataframe as dd 

#read operation
dask_df = dd.read_csv("file_path", encoding = "utf-8")

#enforcing data into memory if needed
dask_df = dask_df.persist()

#repartition based on need, a partition is a small pandas df
dask_df = dask_df.repartition(npartitions = 25)

#simple math operation
dask_df['numerical_column'].sum().compute()

#simple sorting operation
dask_df['numerical_column'].nlargest(5).compute()

#simple apply operation
dask_df['some_column'].apply(some_func).compute()

#simple filtering operation
dask_df[dask_df['numerical_column] > 3].compute()

#output operation
dask_df.to_csv("file_path", index = False)


Where to Draw the Line:
A good rule to follow when it comes to choosing data processing/analytics libraries between Pandas and Dask is that, on a local machine, go with Pandas if your data can still fit into memory and it does not consist of many separate files. Pandas will just be faster in many cases (might not be all cases) and easier, and Pandas covers more data processing operations than Dask. In addition, your data might seem very big at the beginning, but it doesn’t mean it’s the size that you will end up with when you are further along the process. You might want to switch back to Pandas if your data size becomes more manageable.

Dask becomes truly powerful when you have distributed cluster resources available, i.e. AWS Fargate, but even with just one machine, if you have a lot of separate big files to load in, Dask can still help you to reduce load time significantly. And of course, if the file is too big to even fit into the memory, then Dask is here to rescue for its lazy operations.

Finally, here is a thoughtful quote from Dask’s official document, “while “Big Data” tools can be exciting, they are almost always worse than normal data tools while those remain appropriate”.

This concludes the three-part talk on this topic. Thank you for your attention today!