Data Hack Tuesday

Tips & Tricks for Your Data

Brought to you every Tuesday.

Docker: Four Useful Docker Commands that Could Boost Your Workflow

July 27, 2021

Today, we want to briefly introduce four docker commands that might help your workflow. Access the Terminal of a Running […]

View More

Today, we want to briefly introduce four docker commands that might help your workflow.

  1. Access the Terminal of a Running Container
    A lot of times, you would want to investigate if the container has the right files in it and the way to probe it is to access the terminal of this running container. Then you can use all the terminal commands that you are familiar to check the container.
$ docker exec -it container_name /bin/bash

If you don’t have the container running and you want to access the terminal after it starts to run, then you should use the following command.

$ docker run -it container_name bash

2. Clean Everything Stopped
If you want to clean everything including the images, containers, networks, cache and volumes, you can use the following command. This helps you to rebuild everything from scratch, if you choose to.

$ docker system prune -a --volumes

3. Copy Files between Host and Container
Copy a file from a host to a running container:

$ docker cp /file_path/file_name container_name:/file_path/file_name

Copy a file from a container to a host:

$ docker cp container_name:/file_path/file_name /file_path/file_name

4. Fetch Logs of a Running Container
This command helps you to check the logs of a running container, which will help you to further debug your container if necessary.

$ docker logs container_name

Thank you!

Data Infrastructure: Why You Should Consider Using Serverless Containers for Deploying Your Next Data Pipeline

July 20, 2021

Last week, we briefly mentioned how to develop event-driven containers for taking runtime parameters. Today, we would like to talk […]

View More

Last week, we briefly mentioned how to develop event-driven containers for taking runtime parameters. Today, we would like to talk about different options that we have for deploying our containers for data pipelines or data tasks.

Server vs Serverless
If you have a pipeline that needs to run once every week, it doesn’t make financial sense to rent a server and have that running constantly. By switching to serverless infrastructure, you only pay for the runtime, the time that you actually run the pipeline or any other tasks. Not only you end up saving 90%+ of the server costs (if it’s a simple pipeline), but also you would reduce the risk associated with securing a server constantly. You outsource the management responsibilities to a vendor like AWS, which most likely is better at handling these tasks.

Serverless Container
By handing the deployment responsibility to a Docker container or other types of containers, you also make the pipeline more scalable as it’s faster and easier to spin up a container with a pre-built image than deploying it manually one-by-one catering to different server environment. With the combination of a container and serverless infrastructure, the scalability of your pipeline/task is significantly increased and the cost of your infrastructure is significantly reduced. As containers are automatically spun down when they finish running the tasks defined, you don’t have to spin them down like what you would for a managed server.

Example Tech Stack: Docker + AWS ECS/Fargate
AWS ECS (Elastic Container Service) offers server and serverless container services for managing and hosting your containers. With the option of adopting Fargate on top of ECS, AWS Fargate will manage your containers for you without you managing an EC2 server. Your container images are hosted on AWS ECR (Elastic Container Registry). Then you can just schedule when you want to run your tasks and define how you would like Fargate to manage your containers. In addition, this tech stack comes with additional monitoring and logging capability from AWS Cloudwatch, which facilitates the process of debugging or troubleshooting.

Source: AWS Fargate

Thank you!

Docker: How to Create Event-Driven Containers for Your Next Data Project and Why

July 13, 2021

The container technology has vastly improved the efficiency and scalability of application deployment for software developers. With the help of […]

View More

The container technology has vastly improved the efficiency and scalability of application deployment for software developers. With the help of data infrastructure solutions like AWS and the fast advancement of data engineering technologies, containers have been serving as ways to deploy data and model pipelines as well. Today, we want to briefly talk about how you can pass in run-time parameters/data to docker containers, so that your containerized data or model pipelines can be initiated based on these defined parameters correspondingly.

A Common Use Case:
A common use case for using this pattern is that, for example, your ETL pipeline might not be fixated every time it’s running. The pipeline has to be run based on needs at runtime. Therefore, you would want to pass in the data to the containerized pipeline only at runtime based on what the need is.

A Common Dockerfile Pattern for a Simple Python Data Project
We provided a quick and commonly used Dockerfile pattern for simple python data projects. This follows a project structure that main.py contains all your workflow and project logic, that eventually gets called by Python. However, this doesn’t allow you to pass in parameters that only get generated at runtime, i.e. user inputted data.

#Dockerfile

#pull a python based image
FROM python:3.8-slim

#set directories and copy files into container
RUN mkdir -p /home/project_name
WORKDIR /home/project_name
COPY . /home/project_name

#set virtual environment
ENV VIRTUAL_ENV=/home/project_name/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

#install libraries
RUN pip install -r requirements.txt

#execute the main.py to run your program
CMD ["python3", "main.py"]

Pass in Event-Driven Parameters into Containers on Runtime
By setting the ENTRYPOINT at python3 main.py, Docker allows you to pass in and override arguments when you actually run the docker container with the CMD statement.

#Dockerfile

#pull a python based image
FROM python:3.8-slim

#set directories and copy files into container
RUN mkdir -p /home/project_name
WORKDIR /home/project_name
COPY . /home/project_name

#set virtual environment
ENV VIRTUAL_ENV=/home/project_name/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

#install libraries
RUN pip install -r requirements.txt

#provide an entry point for the program
ENTRYPOINT ["python3", "main.py"]

#provide additional parameters/values that are only generated at runtime
#the default values will be overwritten by values generated at runtime
CMD ["--arg_1=arg_1_value", "--arg_2=arg_2_value"]


And of course you would need to use the argparse library to handle command line arguments in the main.py:

#main.py
import argparse

def main(arg_1, arg_2):
    #insert your project logic flow here
    pass 

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='An ETL pipeline')
    parser.add_argument('--arg_1', help='arg_1 description')
    parser.add_argument('--arg_2', help='arg_2 description')
    args = parser.parse_args()

    main(arg_1 = args.arg_1, arg_2 = args.arg_2)


Finally, you can build the container image and run the container with the following commands. In the second command, this is where you can pass in the values/data that are generated at runtime.

$ docker build -t project_name:project_version .
$ docker run --rm project_name:project_version --arg_1="arg_1_runtime_value" --arg_2="arg_2_runtime_value"

Thank you!

Python: The Benefits of Passing Generators between Functions Instead of Lists

July 6, 2021

Many Python tutorials cover generators, but only few of them give out a practical example for utilizing generators. Today we […]

View More

Many Python tutorials cover generators, but only few of them give out a practical example for utilizing generators. Today we want to talk about an effective and efficient code pattern using generators to process data items one by one. Below is common code pattern that uses a data structure such as list to pass data from one function to another. This forces the system to execute and process all the elements in the data_object within func_1 before they can be passed into func_2 and ultimately saved into a database.

#common code pattern with lists
data_object = ['1', '2', '3', '4', '5']

#common code pattern
def func_1(data_object):
    data_list = []
    for i in data_object:
        processed_object = some_process(i)
        data_list.append(processed_object)
    return data_list 

def func_2(data_list):
    for i in data_list:
        processed_item = further_process(i)
        save_to_database(processed_item)

data_object_list = func_1(data_object) #fully executed
func_2(data_object_list) #fully executed

However, by using a generator to lazily pass in data elements, we would allow the system to process and save the items one by one. Therefore, the next process, perhaps a frontend rendering process doesn’t have to wait for all items to be processed to render the data from the database. In addition, you speed up the process by avoiding the append().

data_object = ['1', '2', '3', '4', '5']

#generator pattern
def func_1(data_object):
    for i in data_object:
        processed_object = some_process(i)
        yield processed_object

def func_2(data_generator):
    for i in data_generator:
        processed_item = more_process(i)
        save_to_database(processed_item)

data_generator = func_1(data_object) #lazy execution
func_2(data_generator) #everything is executed here one by one
        

Finally, you can also utilize a generator expression in this case to simplify the code:

data_object = ['1', '2', '3', '4', '5']

#generator expression pattern
def func_1(data_object):
    return (some_process(i) for i in data_object)

def func_2(data_generator):
    for i in data_generator:
        processed_item = more_process(i)
        save_to_database(processed_item)

data_generator = func_1(data_object) #lazy execution
func_2(data_generator) #everything is executed here one by one

Hope you like today’s Data Hack. We will see you next time.

Python: Why You Should Start Using Logging Instead of Print()

June 29, 2021

Logging is a built-in library within Python, which offers capabilities to log and store different levels of information throughout a […]

View More

Logging is a built-in library within Python, which offers capabilities to log and store different levels of information throughout a program, helping developers to monitor, investigate and diagnose their program effectively. However, many data professionals resort to the print() to get information from the program, as typing a function is just easier and it doesn’t cost time to set up. For a simple program, this might be fine, however once the system becomes more complex and the program starts running outside of the development environment, the Logging library will certainly save the developer time and effort. Today, we would like to explore several key features/benefits of using Logging.

Registering Different Levels of Information for Effective Monitoring:
There are five levels of information that you can register to monitor a program. Each of them indicates a different level of importance with a numeric value attached to it.

Source: https://docs.python.org/3.8/library/logging.html

We have included a code sample for the logging configuration of a prospective project. As you can see, we have specified a log level of “DEBUG” for both the StreamHandler and the FileHandler. A StreamHandler logs messages and prints to the console and a FileHandler logs messages and saves into a .log file. In our example, the log messages at DEBUG level or above will be both printed and saved. We have also specified formats for both types of log messages with BASE_FORMAT and FILE_FORMAT.

#logging_config.py

LOGGING_CONFIG = {
    "version": 1,
    "formatters": {
        "BASE_FORMAT": {
            'format': '[%(name)s][%(levelname)-6s]%(message)s',
        },
        'FILE_FORMAT':{
            'format': '[%(asctime)s] [%(name)s][%(levelname)-6s] %(message)s',
        },
    },
    "handlers": {
        "console" : {
            "class": "logging.StreamHandler",
            "level": "DEBUG",
            "formatter": "BASE_FORMAT"
        },
        "file" : {
            "class" : "logging.FileHandler",
            "filename": "pipeline_{}.log".format(datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")),
            "level" : "DEBUG",
            "formatter" : "FILE_FORMAT",
        }
    },
    "root" : {
        "level" : "DEBUG",
        "handlers" : ["console", "file"]
    }
}

Storing Logs for Debugging and Archiving Purposes:
With a FileHandler configured in the LOGGING_CONFIG dictionary specified above, we are able to save and store a log every time we run through our program. In addition, we named the log with a timestamp so that we can easily trace back the log based on the time that it was recorded. Below, please see a snapshot of the logs that we have received from running a system.

Distributing Logs as Notification:
The logging module also offers a class called SMTPHandler, which can help the developers to be notified of logging messages via emails. Upon reading the logging messages, the developer can decide whether to jump on the project depending on the log message.

https://docs.python.org/3/library/logging.handlers.html

All these features and benefits wouldn’t have been possible with a simple print() and these features will certainly make your project more robust and effective in the production environment.

VSCode: The Benefits of Using .py file with Jupyter Kernel vs. Jupyter Notebook for Your Data Project

June 22, 2021

Jupyter notebook is one of the most favorite tools of data science professionals; however, developing everything in a Jupyter notebook […]

View More

Jupyter notebook is one of the most favorite tools of data science professionals; however, developing everything in a Jupyter notebook can be error prone. In addition, even though there are workarounds, git itself does not track Jupyter notebooks. That being said, a Jupyter notebook can certainly be helpful to specific parts of a data science project, including exploratory data analysis, which would require a lot of experimentation and visualization. Today, we want to explore an option offered by VSCode that combines both the advantage of using a Jupyter kernel and using a .py file.

How to Use It:
First, install the ipykernel library within a virtual environment.

(venv) pip install ipykernel

Then type in # %% in a .py file within your VSCode is all you need to activate the Jupyter kernel within the .py file, which will create a code block just like a Jupyter Notebook.

Then press shift+enter to execute the code block to see the code output from the right side. To make sure you are using the right kernel, you can type the following to see which python you are using.

! which python

Separating the Testing Code and Development Code:
Even though this is not a total separation of kernel environment, it’s still a better practice to separate your testing code and development code. It’s very easy to rename the same variables or delete a block of code which contains a variable name that you ended up using later within a Jupyter notebook; those all can potentially create bugs in your notebook. You can still explore the data and develop visualization on the right side of the IDE, however obviously you don’t have the capability of saving the statistical/visualization output with the introduced method, which a Jupyter notebook would give you.

Faster Development Speed and Better Coding Habit:
A Jupyter notebook allows you to see results for each coding block, which may sometimes allow new developers to form habits of writing non-pythonic code or one-liners to just get the results. However, enforcing yourself to compile large blocks of code, i.e. a class like the following, in your head when you write it may help you to design better classes and develop faster. The longer you do this, the more capable you might be to develop longer low-bug code without spending much time on testing along the way. This also reduces the time to extract the code or refactor the code into functions/classes from your Jupyter notebooks, if you didn’t do that within a notebook.

More Distributable and Better Versioning
A lot of times, data science professionals don’t just work by themselves as they would have to collaborate with other engineering professionals. Being able to keep all the code or most of the code base in .py files will make a data science professional’s work more distributable and easier to engineer. On the other hand, git has trouble versioning and tracking .ipynb files. Although there are different ways to work around it, they do add a little bit more complexity to the project and some of them are paid options with free tiers (public repos only).

That’s it for today’s data hack. Hope you like it!

Jupyter Notebook: Use Jupyter Notebook within Virtual Environment

June 15, 2021

Jupyter Notebook is an effective tool to explore and visualize data, but unlike calling a .py file, it requires additional […]

View More

Jupyter Notebook is an effective tool to explore and visualize data, but unlike calling a .py file, it requires additional steps for a particular Jupyter notebook to utilize the libraries sitting in the same virtual environment folder of the project that you are currently working on. And being able to use the same set of libraries with other files in the project folder makes your project less error prone. Today, we are going to talk about these steps to align a Jupyter notebook’s kernel with the project’s virtual environment.

  1. Create a project_folder, create a virtual environment within the project folder and activate the virtual environment
mkdir project_name
cd project_name
python3 -m venv venv 
source ./venv/bin/activate

2. Install Jupyter Notebook and project kernel within the virtual environment

(venv) pip install jupyter
(venv) ipython kernel install --user --name=project_name

3. Start a Jupyter notebook server and select the kernel previously created.

#command to start a Jupyter notebook server
jupyter notebook 

As you can see, now the started notebook is connected to the kernel, project_name.

4. List and Delete installed Jupyter kernels.

#list installed kernels:
jupyter kernelspec list
#delete an installed kernel:
jupyter kernelspec uninstall project_name

Thank you for tuning in today. We will see you next week :).

Setting Up Data Science Development Environment on Apple M1/BigSur

June 8, 2021

Apple’s latest M1/BigSur Macbook has created many troubles for data science professionals when it comes to building basic data science […]

View More

Apple’s latest M1/BigSur Macbook has created many troubles for data science professionals when it comes to building basic data science libraries or even basic things like installing different versions of Python. Today, we compile a list of work-arounds that we have tested and proved to be working:

Install and Manage Python 3.8 with Pyenv:
Source: https://github.com/pyenv/pyenv/issues/1768

pyenv install --patch 3.8.7 <<(curl -sSL https://raw.githubusercontent.com/Homebrew/formula-patches/9811be33170a8f31a684fae6955542e31eb4e61e/python/3.8.7.patch) 

Build Pandas/Numpy with Homebrew:
Source: https://github.com/numpy/numpy/issues/17784

brew install openblas
OPENBLAS="$(brew --prefix openblas)" pip install numpy pandas

Build Pandas/Numpy within Docker Container without Homebrew:

  • Right click the terminal app within the Utilities folder of your Applications folder. And then click Get Info on the menu and select the “Open using Rosetta” option.
  • In your terminal, type the following command to switch to the intel chip mode:
arch -x86_64 zsh
  • In your Dockerfile, you can add the following command:
RUN pip install cython numpy pandas
  • Then build your docker image with a newly opened terminal (under Rosetta).

Build Pandas without Homebrew (This Method Doesn’t Work within Docker)
Source: https://stackoverflow.com/questions/65084318/trouble-installing-pandas-on-new-macbook-air-m1

pip install numpy cython
git clone https://github.com/pandas-dev/pandas.git
cd pandas
python3 setup.py install

Hopefully this will be able to smoothen your journey of adopting the Macbook M1/BigSur as a data science professional. Thanks!

Big Data: Three Different Ways to Handle Data That Can’t Fit into Memory (Part III)

June 1, 2021

Today, we are diving into the third part of this topic and talk about a library that would help us […]

View More

Today, we are diving into the third part of this topic and talk about a library that would help us to handle data that can’t fit into memory — Dask. Dask is a parallel data computation library that is built on top of Pandas. Dask can not only do parallel computation on a single machine, but also scale it to hundreds of machines.

Dask Dataframe Processing Examples:
We included some handy Dask code for your references. As you can tell, they are very similar if not identical with Pandas. .compute() method executes the lazy operation and returns the result.

import dask.dataframe as dd 

#read operation
dask_df = dd.read_csv("file_path", encoding = "utf-8")

#enforcing data into memory if needed
dask_df = dask_df.persist()

#repartition based on need, a partition is a small pandas df
dask_df = dask_df.repartition(npartitions = 25)

#simple math operation
dask_df['numerical_column'].sum().compute()

#simple sorting operation
dask_df['numerical_column'].nlargest(5).compute()

#simple apply operation
dask_df['some_column'].apply(some_func).compute()

#simple filtering operation
dask_df[dask_df['numerical_column] > 3].compute()

#output operation
dask_df.to_csv("file_path", index = False)


Where to Draw the Line:
A good rule to follow when it comes to choosing data processing/analytics libraries between Pandas and Dask is that, on a local machine, go with Pandas if your data can still fit into memory and it does not consist of many separate files. Pandas will just be faster in many cases (might not be all cases) and easier, and Pandas covers more data processing operations than Dask. In addition, your data might seem very big at the beginning, but it doesn’t mean it’s the size that you will end up with when you are further along the process. You might want to switch back to Pandas if your data size becomes more manageable.

Dask becomes truly powerful when you have distributed cluster resources available, i.e. AWS Fargate, but even with just one machine, if you have a lot of separate big files to load in, Dask can still help you to reduce load time significantly. And of course, if the file is too big to even fit into the memory, then Dask is here to rescue for its lazy operations.

Finally, here is a thoughtful quote from Dask’s official document, “while “Big Data” tools can be exciting, they are almost always worse than normal data tools while those remain appropriate”.

This concludes the three-part talk on this topic. Thank you for your attention today!

Close