Here’s Why You Should Learn Docker as a Data Scientist

Build your first Container with this essential guide to Docker for Data Scientists

I don’t even count anymore how many times did my code break when someone else run it. The strange part was —it worked on my machine. That’s where Docker saves the day. If it works on your machine, it will work on any.

As of late 2020, knowing Docker is almost mandatory for data science jobs. No one says you should become an expert, but learning the basics can’t hurt. Today you’ll learn what Docker is and how to build your first container.

Docker is a tool that makes it easy to create, deploy, and run applications by using containers. You can package applications with their dependencies and deploy them as a single package.

Why should you care? Because saying “It works on my machine” doesn’t mean it will work on the others. With Docker containers, you can be sure that the application working on your machine will work on the others.

Think of Docker as a virtual machine without an operating system. Docker allows applications to use the same kernel as the system they are running on. As a result, you get both increase in performance and a decrease in the file size. Win-win.

Here’s a bit of Docker terminology you should know before starting:

Container: software unit that packages the code and its dependencies
Image: a snapshot of your container
Dockerfile: file used to build your images

Let’s see how to build your first container.

How to build your first Docker Container

You’ll be surprised by how easy it is.

To use Docker you’ll need to install it. Download Docker Desktop from this link, install it and open up the application.

Now create the following project structure anywhere on your computer:

Image 1 — Directory structure for your Python app (image by author)

Let’s start with what you’re familiar with — Python. The app.py should contain the following code:

from flask import Flask, render_template

app = Flask(__name__)

@app.route('/')
def hello():
    return render_template('index.html')

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

The whole purpose of this file is to instantiate the Flask application and run it on localhost:5000. Once the app is opened in the web browser, the index.html template is shown.

Here’s what templates/index.html contains:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Python App</title>
</head>
<body>
    <h1>Python App</h1>
    <p>Docker is awesome!</p>
</body>
</html>

It’s a plain old boring file containing a heading and a paragraph. Still, enough to verify our app is working.

Next on the list is the requirements.txt file. It contains all the libraries needed for your app. We only need Flask, but the file can get much longer for real-world applications:

flask==1.1.2

It’s a good idea to specify the library version, so you are entirely sure nothing will break in production.

And finally, we have the Dockerfile (notice how it doesn’t have a file extension). This file is used to build Docker images. Here’s what you should put inside:

FROM python:3.8
COPY . /app
WORKDIR /app 
RUN pip install -r requirements.txt 
EXPOSE 5000 
CMD python ./app.py

So, what’s going on here? Here’s an overview:

FROM python:3.8 – specifies we want to use official Python 3.8 Docker image as a base
COPY . /app – copies our files to the /app folder
WORKDIR /app – defines the working directory of a Docker container
RUN pip install -r requirements.txt – installs every library listed in the requirements.txt file
EXPOSE 5000 – tells Docker to listen on port 5000 at runtime
CMD python ./app.py – specifies how to run our Python application

And that’s it! You can now build and run the Docker image. Let’s build it first. From the Terminal, execute the following:

docker build –tag my-flask-app

This command will build an image called my-flask-app. To run it, you have to execute the following:

docker run -p 5000:5000 my-flask-app

The app is running now on localhost:5000. Let’s verify everything is okay:

Image 2 — Testing is the app works (image by author)

Wasn’t this easy? You could now deploy this Docker container to the cloud, and it will work like it did on your machine.

Conclusion

Today you’ve learned what Docker is and why it is useful in data science. You’ve also built your first app and verified it works. It is by far the easiest solution to deploy applications and machine learning models to productions.

Knowing Docker is almost always a prerequisite for data science jobs. I’m not that big of a fan of data scientists doing DevOps, but learning the basics can’t hurt.

Are you a Data Scientist using Docker? Please let me know.