Big Data: Three Different Ways to Handle Data That Can’t Fit into Memory (Part I)

May 18, 2021

Often times, data professionals need to deal with data sizes that are larger than their local machine’s memory. How to efficiently manage memory while being able to utilize as much data as possible is not an easy task. Today we will focus on how different iterable objects can help us to handle big data in both Python native data structures and in Dataframe data structure with Pandas.

Python Generators for Handling Text File Data:
We implemented a method to read in a big text file with a generator reader and itertools.islice() to control the total read size.

import itertools

#creating a generator to read a big data file
def  read_big_file(file, chunk_size = 10):
    while True:
        data_chunk = file.read(chunk_size)
        yield data_chunk

#open the file within a context block, 
#and slice the generator with 10 chunks using itertools.islice()
with open('file_path', 'r') as big_file:
    for data_chunk in itertools.islice(read_big_file(big_file, chunk_size = 100), 10):
        print(data_chunk)

Pandas Chunksize Parameter for Handling Tabular Data:
We implemented a method with pseudo random_nth_chunk to allow better sampling. Although this is not total random sampling, but it’s a relatively okay one.

import pandas as pd 
import random
import itertools

#simulate a pseudo random nth chunk
random_nth_chunk = random.sample(range(1000), 1)[0]
number_of_chunks = 5
chunksize = 1000
print(random_nth_chunk, number_of_chunks, chunksize)

#create an iterable object with chunksize within a context block to allow re-initiation
#use itertools.islice to slice the iterator with the randomly generated nth chunk and number_of_chunks
with pd.read_csv('big_file.csv', iterator = True, chunksize = chunksize) as big_chunk_df:
    for chunk_df in itertools.islice(big_chunk_df, random_nth_chunk, random_nth_chunk + number_of_chunks):
        print(chunk_df)


Hope you enjoy this week’s Data Hack Tuesday, see you next week!