Data Hack Tuesday

Tips & Tricks for Your Data

Brought to you every Tuesday.

Python: Building Data Driven Classes with Dataclass

August 17, 2021

Dataclass is a built-in module since the introduction of Python 3.7, which helps developers to build data-driven classes that come […]

View More

Dataclass is a built-in module since the introduction of Python 3.7, which helps developers to build data-driven classes that come with data related functionalities. As Python is the main programming language for many data professionals, Python classes are often written to handle data rather than behavior. The introduction of Dataclass then perhaps serves a great way to differentiate data-driven classes from behavior-driven classes; in this way, a data class becomes almost like a data container. Today, we want to introduce the concept of how to build a data class and some of the key features associated with the module.

Basic Build:

Below, please see a basic build of a dataclass building on top of an example from Python documentation.

Key Feature: Self-Generated Special Methods

With the help of the decorator @dataclass, there is no need to write many methods, including the __init__() method, the __repr__() method. In other words, there is no need to write statements like  “self.name = name”. And when you print the instance, it will show the data that the instance  contains.

Key Feature: Frozen Instance

If you would like to set the instance attributes as read only or not changeable, Dataclass offers you the capability to freeze the instance. This allows to further regulate an instance that contains data.

If you would like to set the instance attributes as read only or not changeable, Dataclass offers you the capability to freeze the instance. This allows to further regulate an instance that contains data.

Key Feature: Default Factory Functions

With the help of Dataclass, now we can also use a mutable data structure as a default argument, which previously would create bugs.

Data Infrastructure: What is Infrastructure as Code and why you should consider adopting it?

August 4, 2021

Infrastructure as Code (IaC) is a trend that has become more popular as the cloud infrastructure becomes mainstream in the […]

View More

Infrastructure as Code (IaC) is a trend that has become more popular as the cloud infrastructure becomes mainstream in the past decade. But what is Infrastructure as Code? Infrastructure as Code is a framework that automates the process of infrastructure provisioning, infrastructure configuration and application deployment via code.

Prior to the adoption of IaC and cloud infrastructure, DevOps engineers would have to manually set up the servers, configure the servers and networks, configure the softwares, install and manage the database, and then deploy the softwares. Now with the IaC tools, all these tasks become automatable. Some of the common IaC tools would include AWS CloudFormation, Azure Resource Manager, GCP Deployment Manager, Terraform and Ansible. Let’s review some of the key features/characteristics of IaC.

Resource Configurations as Versioned Templates:
The configuration of a resource or an environment is in code, which makes the configuration consistent and less error prone. In a big team of multiple groups of developers, that makes collaboration a lot easier and more efficient. In addition, being able to version the configuration also makes the changes trackable. Please see the following snapshot as an example of a template:

Source: AWS CloudFormation

Deploying Resources at Scale:
By avoiding the manual configurations, the deployment of infrastructure resources becomes significantly more scalable and repeatable. Beyond that, stacks of resources as a provisioned environment can be spun up and spun down together to serve different tasks. Please see the following snapshot of deployed stacks as an example on AWS CloudFormation.

Source: AWS CloudFormation

IaC + Containes:
Containers help to ensure that your applications can be deployed easily within different hosting environment, however, different cloud infrastructure do impact the behavior and the reliability of containers. With the help of IaC, the same infrastructure environment can be reproduced to serve the containers, which makes your containers even more scalable and reliable. In addition, that reproducibility of infrastructure makes testing on your applications much more effective, as essentially you can reproduce the production environment to test your application in development.

 

Thank you!

 

Docker: Four Useful Docker Commands that Could Boost Your Workflow

July 27, 2021

Today, we want to briefly introduce four docker commands that might help your workflow. Access the Terminal of a Running […]

View More

Today, we want to briefly introduce four docker commands that might help your workflow.

  1. Access the Terminal of a Running Container
    A lot of times, you would want to investigate if the container has the right files in it and the way to probe it is to access the terminal of this running container. Then you can use all the terminal commands that you are familiar to check the container.
$ docker exec -it container_name /bin/bash

If you don’t have the container running and you want to access the terminal after it starts to run, then you should use the following command.

$ docker run -it container_name bash

2. Clean Everything Stopped
If you want to clean everything including the images, containers, networks, cache and volumes, you can use the following command. This helps you to rebuild everything from scratch, if you choose to.

$ docker system prune -a --volumes

3. Copy Files between Host and Container
Copy a file from a host to a running container:

$ docker cp /file_path/file_name container_name:/file_path/file_name

Copy a file from a container to a host:

$ docker cp container_name:/file_path/file_name /file_path/file_name

4. Fetch Logs of a Running Container
This command helps you to check the logs of a running container, which will help you to further debug your container if necessary.

$ docker logs container_name

Thank you!

Data Infrastructure: Why You Should Consider Using Serverless Containers for Deploying Your Next Data Pipeline

July 20, 2021

Last week, we briefly mentioned how to develop event-driven containers for taking runtime parameters. Today, we would like to talk […]

View More

Last week, we briefly mentioned how to develop event-driven containers for taking runtime parameters. Today, we would like to talk about different options that we have for deploying our containers for data pipelines or data tasks.

Server vs Serverless
If you have a pipeline that needs to run once every week, it doesn’t make financial sense to rent a server and have that running constantly. By switching to serverless infrastructure, you only pay for the runtime, the time that you actually run the pipeline or any other tasks. Not only you end up saving 90%+ of the server costs (if it’s a simple pipeline), but also you would reduce the risk associated with securing a server constantly. You outsource the management responsibilities to a vendor like AWS, which most likely is better at handling these tasks.

Serverless Container
By handing the deployment responsibility to a Docker container or other types of containers, you also make the pipeline more scalable as it’s faster and easier to spin up a container with a pre-built image than deploying it manually one-by-one catering to different server environment. With the combination of a container and serverless infrastructure, the scalability of your pipeline/task is significantly increased and the cost of your infrastructure is significantly reduced. As containers are automatically spun down when they finish running the tasks defined, you don’t have to spin them down like what you would for a managed server.

Example Tech Stack: Docker + AWS ECS/Fargate
AWS ECS (Elastic Container Service) offers server and serverless container services for managing and hosting your containers. With the option of adopting Fargate on top of ECS, AWS Fargate will manage your containers for you without you managing an EC2 server. Your container images are hosted on AWS ECR (Elastic Container Registry). Then you can just schedule when you want to run your tasks and define how you would like Fargate to manage your containers. In addition, this tech stack comes with additional monitoring and logging capability from AWS Cloudwatch, which facilitates the process of debugging or troubleshooting.

Source: AWS Fargate

Thank you!

Docker: How to Create Event-Driven Containers for Your Next Data Project and Why

July 13, 2021

The container technology has vastly improved the efficiency and scalability of application deployment for software developers. With the help of […]

View More

The container technology has vastly improved the efficiency and scalability of application deployment for software developers. With the help of data infrastructure solutions like AWS and the fast advancement of data engineering technologies, containers have been serving as ways to deploy data and model pipelines as well. Today, we want to briefly talk about how you can pass in run-time parameters/data to docker containers, so that your containerized data or model pipelines can be initiated based on these defined parameters correspondingly.

A Common Use Case:
A common use case for using this pattern is that, for example, your ETL pipeline might not be fixated every time it’s running. The pipeline has to be run based on needs at runtime. Therefore, you would want to pass in the data to the containerized pipeline only at runtime based on what the need is.

A Common Dockerfile Pattern for a Simple Python Data Project
We provided a quick and commonly used Dockerfile pattern for simple python data projects. This follows a project structure that main.py contains all your workflow and project logic, that eventually gets called by Python. However, this doesn’t allow you to pass in parameters that only get generated at runtime, i.e. user inputted data.

#Dockerfile

#pull a python based image
FROM python:3.8-slim

#set directories and copy files into container
RUN mkdir -p /home/project_name
WORKDIR /home/project_name
COPY . /home/project_name

#set virtual environment
ENV VIRTUAL_ENV=/home/project_name/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

#install libraries
RUN pip install -r requirements.txt

#execute the main.py to run your program
CMD ["python3", "main.py"]

Pass in Event-Driven Parameters into Containers on Runtime
By setting the ENTRYPOINT at python3 main.py, Docker allows you to pass in and override arguments when you actually run the docker container with the CMD statement.

#Dockerfile

#pull a python based image
FROM python:3.8-slim

#set directories and copy files into container
RUN mkdir -p /home/project_name
WORKDIR /home/project_name
COPY . /home/project_name

#set virtual environment
ENV VIRTUAL_ENV=/home/project_name/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

#install libraries
RUN pip install -r requirements.txt

#provide an entry point for the program
ENTRYPOINT ["python3", "main.py"]

#provide additional parameters/values that are only generated at runtime
#the default values will be overwritten by values generated at runtime
CMD ["--arg_1=arg_1_value", "--arg_2=arg_2_value"]


And of course you would need to use the argparse library to handle command line arguments in the main.py:

#main.py
import argparse

def main(arg_1, arg_2):
    #insert your project logic flow here
    pass 

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='An ETL pipeline')
    parser.add_argument('--arg_1', help='arg_1 description')
    parser.add_argument('--arg_2', help='arg_2 description')
    args = parser.parse_args()

    main(arg_1 = args.arg_1, arg_2 = args.arg_2)


Finally, you can build the container image and run the container with the following commands. In the second command, this is where you can pass in the values/data that are generated at runtime.

$ docker build -t project_name:project_version .
$ docker run --rm project_name:project_version --arg_1="arg_1_runtime_value" --arg_2="arg_2_runtime_value"

Thank you!

Python: The Benefits of Passing Generators between Functions Instead of Lists

July 6, 2021

Many Python tutorials cover generators, but only few of them give out a practical example for utilizing generators. Today we […]

View More

Many Python tutorials cover generators, but only few of them give out a practical example for utilizing generators. Today we want to talk about an effective and efficient code pattern using generators to process data items one by one. Below is common code pattern that uses a data structure such as list to pass data from one function to another. This forces the system to execute and process all the elements in the data_object within func_1 before they can be passed into func_2 and ultimately saved into a database.

#common code pattern with lists
data_object = ['1', '2', '3', '4', '5']

#common code pattern
def func_1(data_object):
    data_list = []
    for i in data_object:
        processed_object = some_process(i)
        data_list.append(processed_object)
    return data_list 

def func_2(data_list):
    for i in data_list:
        processed_item = further_process(i)
        save_to_database(processed_item)

data_object_list = func_1(data_object) #fully executed
func_2(data_object_list) #fully executed

However, by using a generator to lazily pass in data elements, we would allow the system to process and save the items one by one. Therefore, the next process, perhaps a frontend rendering process doesn’t have to wait for all items to be processed to render the data from the database. In addition, you speed up the process by avoiding the append().

data_object = ['1', '2', '3', '4', '5']

#generator pattern
def func_1(data_object):
    for i in data_object:
        processed_object = some_process(i)
        yield processed_object

def func_2(data_generator):
    for i in data_generator:
        processed_item = more_process(i)
        save_to_database(processed_item)

data_generator = func_1(data_object) #lazy execution
func_2(data_generator) #everything is executed here one by one
        

Finally, you can also utilize a generator expression in this case to simplify the code:

data_object = ['1', '2', '3', '4', '5']

#generator expression pattern
def func_1(data_object):
    return (some_process(i) for i in data_object)

def func_2(data_generator):
    for i in data_generator:
        processed_item = more_process(i)
        save_to_database(processed_item)

data_generator = func_1(data_object) #lazy execution
func_2(data_generator) #everything is executed here one by one

Hope you like today’s Data Hack. We will see you next time.

Python: Why You Should Start Using Logging Instead of Print()

June 29, 2021

Logging is a built-in library within Python, which offers capabilities to log and store different levels of information throughout a […]

View More

Logging is a built-in library within Python, which offers capabilities to log and store different levels of information throughout a program, helping developers to monitor, investigate and diagnose their program effectively. However, many data professionals resort to the print() to get information from the program, as typing a function is just easier and it doesn’t cost time to set up. For a simple program, this might be fine, however once the system becomes more complex and the program starts running outside of the development environment, the Logging library will certainly save the developer time and effort. Today, we would like to explore several key features/benefits of using Logging.

Registering Different Levels of Information for Effective Monitoring:
There are five levels of information that you can register to monitor a program. Each of them indicates a different level of importance with a numeric value attached to it.

Source: https://docs.python.org/3.8/library/logging.html

We have included a code sample for the logging configuration of a prospective project. As you can see, we have specified a log level of “DEBUG” for both the StreamHandler and the FileHandler. A StreamHandler logs messages and prints to the console and a FileHandler logs messages and saves into a .log file. In our example, the log messages at DEBUG level or above will be both printed and saved. We have also specified formats for both types of log messages with BASE_FORMAT and FILE_FORMAT.

#logging_config.py

LOGGING_CONFIG = {
    "version": 1,
    "formatters": {
        "BASE_FORMAT": {
            'format': '[%(name)s][%(levelname)-6s]%(message)s',
        },
        'FILE_FORMAT':{
            'format': '[%(asctime)s] [%(name)s][%(levelname)-6s] %(message)s',
        },
    },
    "handlers": {
        "console" : {
            "class": "logging.StreamHandler",
            "level": "DEBUG",
            "formatter": "BASE_FORMAT"
        },
        "file" : {
            "class" : "logging.FileHandler",
            "filename": "pipeline_{}.log".format(datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")),
            "level" : "DEBUG",
            "formatter" : "FILE_FORMAT",
        }
    },
    "root" : {
        "level" : "DEBUG",
        "handlers" : ["console", "file"]
    }
}

Storing Logs for Debugging and Archiving Purposes:
With a FileHandler configured in the LOGGING_CONFIG dictionary specified above, we are able to save and store a log every time we run through our program. In addition, we named the log with a timestamp so that we can easily trace back the log based on the time that it was recorded. Below, please see a snapshot of the logs that we have received from running a system.

Distributing Logs as Notification:
The logging module also offers a class called SMTPHandler, which can help the developers to be notified of logging messages via emails. Upon reading the logging messages, the developer can decide whether to jump on the project depending on the log message.

https://docs.python.org/3/library/logging.handlers.html

All these features and benefits wouldn’t have been possible with a simple print() and these features will certainly make your project more robust and effective in the production environment.

VSCode: The Benefits of Using .py file with Jupyter Kernel vs. Jupyter Notebook for Your Data Project

June 22, 2021

Jupyter notebook is one of the most favorite tools of data science professionals; however, developing everything in a Jupyter notebook […]

View More

Jupyter notebook is one of the most favorite tools of data science professionals; however, developing everything in a Jupyter notebook can be error prone. In addition, even though there are workarounds, git itself does not track Jupyter notebooks. That being said, a Jupyter notebook can certainly be helpful to specific parts of a data science project, including exploratory data analysis, which would require a lot of experimentation and visualization. Today, we want to explore an option offered by VSCode that combines both the advantage of using a Jupyter kernel and using a .py file.

How to Use It:
First, install the ipykernel library within a virtual environment.

(venv) pip install ipykernel

Then type in # %% in a .py file within your VSCode is all you need to activate the Jupyter kernel within the .py file, which will create a code block just like a Jupyter Notebook.

Then press shift+enter to execute the code block to see the code output from the right side. To make sure you are using the right kernel, you can type the following to see which python you are using.

! which python

Separating the Testing Code and Development Code:
Even though this is not a total separation of kernel environment, it’s still a better practice to separate your testing code and development code. It’s very easy to rename the same variables or delete a block of code which contains a variable name that you ended up using later within a Jupyter notebook; those all can potentially create bugs in your notebook. You can still explore the data and develop visualization on the right side of the IDE, however obviously you don’t have the capability of saving the statistical/visualization output with the introduced method, which a Jupyter notebook would give you.

Faster Development Speed and Better Coding Habit:
A Jupyter notebook allows you to see results for each coding block, which may sometimes allow new developers to form habits of writing non-pythonic code or one-liners to just get the results. However, enforcing yourself to compile large blocks of code, i.e. a class like the following, in your head when you write it may help you to design better classes and develop faster. The longer you do this, the more capable you might be to develop longer low-bug code without spending much time on testing along the way. This also reduces the time to extract the code or refactor the code into functions/classes from your Jupyter notebooks, if you didn’t do that within a notebook.

More Distributable and Better Versioning
A lot of times, data science professionals don’t just work by themselves as they would have to collaborate with other engineering professionals. Being able to keep all the code or most of the code base in .py files will make a data science professional’s work more distributable and easier to engineer. On the other hand, git has trouble versioning and tracking .ipynb files. Although there are different ways to work around it, they do add a little bit more complexity to the project and some of them are paid options with free tiers (public repos only).

That’s it for today’s data hack. Hope you like it!

Jupyter Notebook: Use Jupyter Notebook within Virtual Environment

June 15, 2021

Jupyter Notebook is an effective tool to explore and visualize data, but unlike calling a .py file, it requires additional […]

View More

Jupyter Notebook is an effective tool to explore and visualize data, but unlike calling a .py file, it requires additional steps for a particular Jupyter notebook to utilize the libraries sitting in the same virtual environment folder of the project that you are currently working on. And being able to use the same set of libraries with other files in the project folder makes your project less error prone. Today, we are going to talk about these steps to align a Jupyter notebook’s kernel with the project’s virtual environment.

  1. Create a project_folder, create a virtual environment within the project folder and activate the virtual environment
mkdir project_name
cd project_name
python3 -m venv venv 
source ./venv/bin/activate

2. Install Jupyter Notebook and project kernel within the virtual environment

(venv) pip install jupyter
(venv) ipython kernel install --user --name=project_name

3. Start a Jupyter notebook server and select the kernel previously created.

#command to start a Jupyter notebook server
jupyter notebook 

As you can see, now the started notebook is connected to the kernel, project_name.

4. List and Delete installed Jupyter kernels.

#list installed kernels:
jupyter kernelspec list
#delete an installed kernel:
jupyter kernelspec uninstall project_name

Thank you for tuning in today. We will see you next week :).

Close