Big Data: Three Different Ways to Handle Data That Can’t Fit into Memory (Part II)

May 25, 2021

Today we would like to cover the second way of handling data that can’t fit into memory — down casting data types. This is a very effective way of reducing the memory cost of your big data, to a relative extent. This will allow you to read couple times more than the original data size that you were able to consume. Note: this is assuming that the data types in your dataset is not changing or have been validated.

We are going to build on top of the first way by randomly selecting a number of chunks within a big dataframe and concatenate them together to an entire dataframe; in this way we can have a good random sample of the dataframe that we cannot read entirely.

import pandas as pd 
import random
import itertools

#simulate a pseudo random nth chunk
random_nth_chunk = random.sample(range(1000), 1)[0]
number_of_chunks = 5
chunksize = 1000
print(random_nth_chunk, number_of_chunks, chunksize)

#create an iterable object with chunksize within a context block to allow re-initiation
#use itertools.islice to slice the iterator with the randomly generated nth chunk and number_of_chunks

chunk_df_list = []
with pd.read_csv('big_file.csv', iterator = True, chunksize = chunksize) as big_chunk_df:
    for chunk_df in itertools.islice(big_chunk_df, random_nth_chunk, random_nth_chunk + number_of_chunks):
        chunk_df_list.append(chunk_df)

total_chunk_df = pd.concat([chunk_df_list], axis = 0, ignore_index = True)

Now we are going to first get the data type of each column with bit-level information, i.e. int64 vs int32. Then we need to check memory usage of the current concatenated dataframe. This will show the memory usage of each column or feature.

#check data type of each column with bit level information
total_chunk_df.info()

#check memory usage of each column
total_chunk_df.memory_usage(deep = True)


Finally, we are going to convert the existing data types to the alternative data types provided by Pandas to effectively reduce memory cost. Here is a list of the data types offered by Pandas that we could convert to.

  • string/object -> Categorical data type (space efficient way of storing strings)
  • int64 -> int32
  • float64 -> float32
#loop through each column to convert data type
for column in total_chunk_df.columns:
    if total_chunk_df[column].dtype == "object":
        total_chunk_df[column] = total_chunk_df[column].astype('category')
    elif total_chunk_df[column].dtype == "int64":
        total_chunk_df[column] = pd.to_numeric(total_chunk_df[column], downcast="signed")
    elif total_chunk_df[column].dtype == "float64":
        total_chunk_df[column] = pd.to_numeric(total_chunk_df[column], downcast = "float")

We tested this on an example dataset with the following result:
Pre-conversion: dtypes: int64(21), object(4) memory usage: 38.1+ MB
Post-conversion: dtypes: category(4), int16(2), int32(8), int8(11) memory usage: 10.1 MB

That’s almost a 75% reduction on the original memory usage. With that extra memory, we can simply just load in more chunks of dataframes from the original dataframe that was too big to read.

Hope you enjoy this week’s Data Hack Tuesday! We will see you next time.