The container technology has vastly improved the efficiency and scalability of application deployment for software developers. With the help of data infrastructure solutions like AWS and the fast advancement of data engineering technologies, containers have been serving as ways to deploy data and model pipelines as well. Today, we want to briefly talk about how you can pass in run-time parameters/data to docker containers, so that your containerized data or model pipelines can be initiated based on these defined parameters correspondingly.
A Common Use Case:
A common use case for using this pattern is that, for example, your ETL pipeline might not be fixated every time it’s running. The pipeline has to be run based on needs at runtime. Therefore, you would want to pass in the data to the containerized pipeline only at runtime based on what the need is.
A Common Dockerfile Pattern for a Simple Python Data Project
We provided a quick and commonly used Dockerfile pattern for simple python data projects. This follows a project structure that main.py contains all your workflow and project logic, that eventually gets called by Python. However, this doesn’t allow you to pass in parameters that only get generated at runtime, i.e. user inputted data.
#Dockerfile #pull a python based image FROM python:3.8-slim #set directories and copy files into container RUN mkdir -p /home/project_name WORKDIR /home/project_name COPY . /home/project_name #set virtual environment ENV VIRTUAL_ENV=/home/project_name/venv RUN python3 -m venv $VIRTUAL_ENV ENV PATH="$VIRTUAL_ENV/bin:$PATH" #install libraries RUN pip install -r requirements.txt #execute the main.py to run your program CMD ["python3", "main.py"]
Pass in Event-Driven Parameters into Containers on Runtime
By setting the ENTRYPOINT at python3 main.py, Docker allows you to pass in and override arguments when you actually run the docker container with the CMD statement.
#Dockerfile #pull a python based image FROM python:3.8-slim #set directories and copy files into container RUN mkdir -p /home/project_name WORKDIR /home/project_name COPY . /home/project_name #set virtual environment ENV VIRTUAL_ENV=/home/project_name/venv RUN python3 -m venv $VIRTUAL_ENV ENV PATH="$VIRTUAL_ENV/bin:$PATH" #install libraries RUN pip install -r requirements.txt #provide an entry point for the program ENTRYPOINT ["python3", "main.py"] #provide additional parameters/values that are only generated at runtime #the default values will be overwritten by values generated at runtime CMD ["--arg_1=arg_1_value", "--arg_2=arg_2_value"]
And of course you would need to use the argparse library to handle command line arguments in the main.py:
#main.py import argparse def main(arg_1, arg_2): #insert your project logic flow here pass if __name__ == "__main__": parser = argparse.ArgumentParser(description='An ETL pipeline') parser.add_argument('--arg_1', help='arg_1 description') parser.add_argument('--arg_2', help='arg_2 description') args = parser.parse_args() main(arg_1 = args.arg_1, arg_2 = args.arg_2)
Finally, you can build the container image and run the container with the following commands. In the second command, this is where you can pass in the values/data that are generated at runtime.
$ docker build -t project_name:project_version . $ docker run --rm project_name:project_version --arg_1="arg_1_runtime_value" --arg_2="arg_2_runtime_value"