Unlocking Data Science: Your Easy Docker Setup Guide

Are you an aspiring data scientist looking to streamline your development process and ensure reproducibility in your projects? Look no further! In this blog post, we’ll walk you through setting up a development environment using Docker, a powerful tool for containerization.

What is Docker, and Why Containers?

Docker is a platform that allows you to package and distribute applications as lightweight, portable containers. Containers are like virtual machines, but they are more efficient, consume fewer resources, and are incredibly easy to use.

The main benefit of using Docker is reproducibility. Docker containers encapsulate everything your application needs, including its dependencies. This ensures that your development environment remains consistent across different machines and environments.

Now, let’s dive into the steps to set up your data science development environment with Docker.

Pre-requisites:

Before we start, make sure you have:

  1. WSL (Windows Subsystem for Linux) installed on your Windows machine.
  2. Activated Hyper-V virtualization in your BIOS settings.
  3. Downloaded and installed Visual Studio Code (VS Code) and Docker Desktop.

Setting Up Extensions in VS Code:

Open VS Code and install the “Docker” extension and “Dev Containers” extension. This extension will make it easier to work with Docker containers directly from your code editor.



Creating the Docker Image:

Open your project folder in VS Code. This can be a local folder, or a clone of a Git repository. Open the terminal in VS Code. The current working directory should also be the folder that you have opened. Create a new file called Dockerfile. Follow the code below for the contents.

# Use an appropriate base image, such as Python.
FROM python:3.XX-slim

# Set the working directory inside the container.
WORKDIR /<directory_name>

# Copy your requirements.txt file to the container.
COPY requirements.txt .

# Install the packages listed in requirements.txt, clearing the cache.
RUN pip install --no-cache-dir -r requirements.txt
where XX is the version of Python that you want to use as base. The -slim parameter downloads a minimal version of the base Python version. The 2nd command creates and sets the working directory inside the container. It is recommended that you name it the same name as the local directory where you are now. The 3rd command copies the requirement.txt from your current local directory to the working directory inside the container. The 4th command installs the libraries indicated in the requirements.txt to the container. This is the same command that you use when setting up a python environment locally. It is okay if the requirements.txt file is blank for now.

Building the image:

Run the command below to create a Docker image specified by Dockerfile above.

docker build -t <new_image_name> .

If you started out with a blank requirements file, the image will just be the base python. As you add libraries to the requirements file, later, you need to rebuild the docker image using the command above.

Starting the Container:

In the terminal run the following command to start the environment. The second version runs it in interactive mode while also attaching the current folder to a folder of the same name inside the container. In VS Code, Docker Tab, you should see the running container indicated by the green “play” button.

docker run <image_name>
docker run -it -v ${pwd}:/<directory_name>  <image_name>

All changes in the files on either your local directory or the directory inside the container is immediately reflected on the other.

To access the container via VS Code, use the Remote Explorer extension and click the running container and choose Attach in a New Window. This will open another VS Code window. It looks almost identical to the already open window, but if you look at the lower left corner, you will see that this new window is operating inside the container



Installing Libraries:

Inside your Docker container, you can use pip  to install data science libraries just as you would on your local machine. Run the command below to create the requirements file from the installed libraries. This command lists all installed libraries and dependencies. You may have to manually prune this file to only include the key libraries to be installed similar to the one below. To make the libraries persist in the image, close the connection to the container, and rebuild the image with an updated requirements.txt as shown above.
pip freeze > requirements.txt
numpy==1.19.5
pandas==1.3.3
matplotlib==3.4.3
scikit-learn==0.24.2

In conclusion, Docker is a game-changer for data scientists, offering a way to create reproducible development environments quickly. By following these steps, you’ll be well on your way to setting up an efficient data science environment that’s ready for all your exciting projects.

Stay tuned for more tutorials on data science tools and techniques, and happy coding!

Address
Quezon City, PH

Work Hours
M-F  07:00-16:00