Docker: How to Create Event-Driven Containers for Your Next Data Project and Why

July 13, 2021

The container technology has vastly improved the efficiency and scalability of application deployment for software developers. With the help of data infrastructure solutions like AWS and the fast advancement of data engineering technologies, containers have been serving as ways to deploy data and model pipelines as well. Today, we want to briefly talk about how you can pass in run-time parameters/data to docker containers, so that your containerized data or model pipelines can be initiated based on these defined parameters correspondingly.

A Common Use Case:
A common use case for using this pattern is that, for example, your ETL pipeline might not be fixated every time it’s running. The pipeline has to be run based on needs at runtime. Therefore, you would want to pass in the data to the containerized pipeline only at runtime based on what the need is.

A Common Dockerfile Pattern for a Simple Python Data Project
We provided a quick and commonly used Dockerfile pattern for simple python data projects. This follows a project structure that main.py contains all your workflow and project logic, that eventually gets called by Python. However, this doesn’t allow you to pass in parameters that only get generated at runtime, i.e. user inputted data.

#Dockerfile

#pull a python based image
FROM python:3.8-slim

#set directories and copy files into container
RUN mkdir -p /home/project_name
WORKDIR /home/project_name
COPY . /home/project_name

#set virtual environment
ENV VIRTUAL_ENV=/home/project_name/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

#install libraries
RUN pip install -r requirements.txt

#execute the main.py to run your program
CMD ["python3", "main.py"]

Pass in Event-Driven Parameters into Containers on Runtime
By setting the ENTRYPOINT at python3 main.py, Docker allows you to pass in and override arguments when you actually run the docker container with the CMD statement.

#Dockerfile

#pull a python based image
FROM python:3.8-slim

#set directories and copy files into container
RUN mkdir -p /home/project_name
WORKDIR /home/project_name
COPY . /home/project_name

#set virtual environment
ENV VIRTUAL_ENV=/home/project_name/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

#install libraries
RUN pip install -r requirements.txt

#provide an entry point for the program
ENTRYPOINT ["python3", "main.py"]

#provide additional parameters/values that are only generated at runtime
#the default values will be overwritten by values generated at runtime
CMD ["--arg_1=arg_1_value", "--arg_2=arg_2_value"]


And of course you would need to use the argparse library to handle command line arguments in the main.py:

#main.py
import argparse

def main(arg_1, arg_2):
    #insert your project logic flow here
    pass 

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='An ETL pipeline')
    parser.add_argument('--arg_1', help='arg_1 description')
    parser.add_argument('--arg_2', help='arg_2 description')
    args = parser.parse_args()

    main(arg_1 = args.arg_1, arg_2 = args.arg_2)


Finally, you can build the container image and run the container with the following commands. In the second command, this is where you can pass in the values/data that are generated at runtime.

$ docker build -t project_name:project_version .
$ docker run --rm project_name:project_version --arg_1="arg_1_runtime_value" --arg_2="arg_2_runtime_value"

Thank you!