Are you an aspiring data scientist looking to streamline your development process and ensure reproducibility in your projects? Look no further! In this blog post, we’ll walk you through setting up a development environment using Docker, a powerful tool for containerization.
What is Docker, and Why Containers?
Docker is a platform that allows you to package and distribute applications as lightweight, portable containers. Containers are like virtual machines, but they are more efficient, consume fewer resources, and are incredibly easy to use.
The main benefit of using Docker is reproducibility. Docker containers encapsulate everything your application needs, including its dependencies. This ensures that your development environment remains consistent across different machines and environments.
Now, let’s dive into the steps to set up your data science development environment with Docker.
Pre-requisites:
Before we start, make sure you have:
- WSL (Windows Subsystem for Linux) installed on your Windows machine.
- Activated Hyper-V virtualization in your BIOS settings.
- Downloaded and installed Visual Studio Code (VS Code) and Docker Desktop.
Setting Up Extensions in VS Code:
Open VS Code and install the “Docker” extension and “Dev Containers” extension. This extension will make it easier to work with Docker containers directly from your code editor.
Creating the Docker Image:
Open your project folder in VS Code. This can be a local folder, or a clone of a Git repository. Open the terminal in VS Code. The current working directory should also be the folder that you have opened. Create a new file called Dockerfile
. Follow the code below for the contents.
# Use an appropriate base image, such as Python.
FROM python:3.XX-slim
# Set the working directory inside the container.
WORKDIR /<directory_name>
# Copy your requirements.txt file to the container.
COPY requirements.txt .
# Install the packages listed in requirements.txt, clearing the cache.
RUN pip install --no-cache-dir -r requirements.txt
XX
is the version of Python that you want to use as base. The -slim
parameter downloads a minimal version of the base Python version.
The 2nd command creates and sets the working directory inside the container. It is recommended that you name it the same name as the local directory where you are now.
The 3rd command copies the requirement.txt from your current local directory to the working directory inside the container.
The 4th command installs the libraries indicated in the requirements.txt to the container. This is the same command that you use when setting up a python environment locally. It is okay if the requirements.txt
file is blank for now. Building the image:
Run the command below to create a Docker image specified by Dockerfile above.
docker build -t <new_image_name> .
If you started out with a blank requirements file, the image will just be the base python. As you add libraries to the requirements file, later, you need to rebuild the docker image using the command above.
Starting the Container:
In the terminal run the following command to start the environment. The second version runs it in interactive mode while also attaching the current folder to a folder of the same name inside the container. In VS Code, Docker Tab, you should see the running container indicated by the green “play” button.
docker run <image_name>
docker run -it -v ${pwd}:/<directory_name> <image_name>
All changes in the files on either your local directory or the directory inside the container is immediately reflected on the other.
To access the container via VS Code, use the Remote Explorer extension and click the running container and choose Attach in a New Window
. This will open another VS Code window. It looks almost identical to the already open window, but if you look at the lower left corner, you will see that this new window is operating inside the container
Installing Libraries:
Inside your Docker container, you can usepip
to install data science libraries just as you would on your local machine. Run the command below to create the requirements file from the installed libraries. This command lists all installed libraries and dependencies. You may have to manually prune this file to only include the key libraries to be installed similar to the one below.
To make the libraries persist in the image, close the connection to the container, and rebuild the image with an updated requirements.txt
as shown above. pip freeze > requirements.txt
numpy==1.19.5
pandas==1.3.3
matplotlib==3.4.3
scikit-learn==0.24.2
Build an Automated ETL Pipeline. From setting up Docker to utilizing APIs and automating workflows with GitHub Actions, this post goes through it all.
Mastering Pipelines: Integrating Feature Engineering into Your Predictive Models
Master predictive modeling with Scikit-Learn pipelines. Learn the importance of feature engineering and how to prevent data leakage.
In this short project, I scraped all of Lebron's regular season points and plotted them in an interactive graph.
Predicting a Fitness Center’s Class Attendance with Machine Learning
In this project I analyzed a fitness center's attendance data to predict attendance rates of its group classes.
A marketing agency presents a promotional plan to a telecommunication company to increase its subscriber base and generate revenue. Using population and location data, we estimated the feasibility of this plan in this case study.
The tweet that started the trend was posted on June 12 (Philippine Independence day). On that tweet, the author stated that he dined in at a Tropical Hut branch, and he is their only customer. Despite the pandemic “restrictions”, commercial activity on malls and fast food chains are pretty much back to “normal”, so having only one customer is disheartening.
In this project I trained a transformer model to recognize words from audio.
In this article, I fine-tuned a pre-trained object detection model using a small custom dataset.
Fast and Free Deep Learning Demo with Gradio
In this article, we will build a simple deep learning based demo application that can be accessed publicly in Hugging Face spaces
The rate at which unreliable news was spread online in the recent years was unprecedented. In this project, I finetuned some language models to make an unreliable news classifier.
In conclusion, Docker is a game-changer for data scientists, offering a way to create reproducible development environments quickly. By following these steps, you’ll be well on your way to setting up an efficient data science environment that’s ready for all your exciting projects.
Stay tuned for more tutorials on data science tools and techniques, and happy coding!