My deep learning development environment: TensorFlow + Docker + PyCharm, etc. What about yours?

My deep learning development environment: TensorFlow + Docker + PyCharm, etc. What about yours?

Author: Killian

Compiled by Machine Heart

Participants: Nurhachu Null, Li Yazhou

In this article, researcher Killian introduced his deep learning development environment: TensorFlow + Docker + PyCharm + OSX Fuse + Tensorboard. However, everyone will configure a different development environment based on their budget, language habits, and development needs, and they have encountered various problems. Therefore, we attached a questionnaire at the end of the article, hoping to understand the deep learning environment of many different developers, and finally compile it into an article to provide different insights for everyone.

I spend quite a bit of time trying different things to configure a deep learning environment, so I thought I’d document my current workflow in the hopes that it can help others trying to do the same thing.

Target

Before I start creating my models, I have a few clear goals in mind for the development environment I would ideally use. Here are a few high-level goals that I will detail in this blog post:

  • Editing my code with Pycharm on my local machine (a standard MacBookPro laptop)

  • Use a powerful remote machine to train my model

  • Use this remote machine without any conflicts with my colleagues

  • Run/debug my TensorFlow code in development/production mode in docker containers on local and remote machines

  • While my model is training on a remote machine, I can display the performance of the model graphically in real time on my local machine.

Acknowledgements

I would like to thank my lab mate Chris Saam for pointing me to several interesting tools that I will mention in this article.

One-time installation

On the remote machine

So, before doing anything else, you might want to do these few things. By the way, in this post I will mention using super duper on your remote machine (with all GPUs attached) where you plan to train your deep learning machine models.

Install Nvidia-docker: The first thing you need to do is install Nvidia-docker. Docker is a really cool tool, but it currently doesn't allow you to use any NVIDIA GPU hardware or CUDA drivers most effectively, so you can't use docker to train your deep models. Nvidia-docker solves this problem for you and looks more like a regular docker. On top of the regular Docker commands, it also provides some options that allow you to manage your NVIDIA GPU hardware more effectively.

Figure 1: NVIDIA-Docker (Courtesy of NVIDIA-Docker)

Install Slurm: If you plan to share that deep learning machine with your colleagues, you might want to consider installing a tool like SLURM. SLURM gives you more control over what your teammates can do on the machine by limiting the set of commands that can be used by default, and forces each team member to run their code in a "job" environment with specific dedicated GPU/CPU resources. This is really useful if you want to avoid any resource contention caused by teammates accessing the machine at the same time.

Standardize your folder setup: If you plan to share your machine with colleagues, it’s also a good idea to standardize the folder structure between them. My deep learning machine is setup like this:

  • The /home/myusername folder contains your own private project code.

  • The /data folder contains datasets that the team shares during the project.

  • The /work folder contains the specific dataset needed for the current experiment. This folder is one level lower than the /data folder, but it provides faster memory access during training.

On the local machine

Install OS X Fuse: If you are using the latest version of OS X like me, you may want to install OS X Fuse. OS X Fuse allows you to mount folders from a remote machine in your local Finder using SFTP/SSH. Or if you don't want to spend time mounting your remote /home folder, you can simply use GIT PUSH/PULL to transfer code between your local machine and the remote machine, but this is not very efficient. So mounting these folders will save you a lot of time in long-running processes.

Setting up a remote python interpreter: Using the same docker image on both your local and remote machines is another way to avoid possible environment configuration issues later on. Pycharm has this cool feature that lets you run your code in a docker container. Before setting up anything in Pycharm, make sure you have the correct TensorFlow docker image. On your local machine, you may only need the following steps to get the TensorFlow docker image:

# Start your docker virtual machine

docker-machine start default

# Get the *** TensorFlow CPU version docker image

docker pull gcr.io/tensorflow/tensorflow:latest

Once you have the desired docker image, go to setup your Pycharm Project Interpreter. In Pycharm, go to Preferences>Project Interpreter>Add Remote (see below). Once the instance of the docker virtual machine starts running on your local machine, you will need to select the Docker configuration. Once it connects to your docker virtual machine, you should see the TensorFlow image you just got listed in the list of available images. Once this is set up, as long as pycharm is connected, you can get started.

Daily routine

On the local machine

Mounting remote folders: The first thing you want to do is make sure you can access the scripts you want to run on your local machine. So the first thing you want to do is mount the home/myusername folder on your Mac using OS X Fuse, and optionally mount the deep learning data. You may want to make some aliases for all of these commands, as they do get a bit long.

# Mount your remote home folder

sshfs -o uid=$(id -u) -o gid=$(id -g) [email protected]:/home/myusername/ /LocalDevFolder/MountedRemoteHomeFolder

# Mount your remote data folder (optionally)

sshfs -o uid=$(id -u) -o gid=$(id -g) [email protected]:/data/myusername/ /LocalDevFolder/MountedRemoteDataFolder

Here uid and gid are used to map the user and group IDs of the local and remote machines, since these may be different.

Start Docker on your local machine: Next, we want to make sure that PyCharm will access the correct libraries to compile our code locally. To do this, just start a Docker VM locally. If you didn't change anything in your settings, the TensorFlow CPU image should already be in your local Docker environment.

docker-machine start default

Open pycharm and select the project in the home folder you just mounted. Go to the Project Interpreter preferences and select the remote TensorFlow interpreter you created earlier in the list of available project interpreters. Pycharm should be able to compile your code correctly. At this point, you can use your code anytime, anywhere, and change anything you want.

On the remote machine

Ok, you have updated your code in pycharm with a new feature, and you want to train/test your model.

Log into your machine remotely using SSH: The first thing you need to do is simply log into your deep learning machine remotely.

ssh [email protected]

Running a SLURM Job: Before you proceed, make sure that no one else on your team is running a job. This can prevent your job from getting the resources it needs, so it's always a good idea to check what jobs are currently running on the remote machine. To do this with SLURM, just run the squeue command, which will list the jobs currently running on the machine. If for some reason one of your previous jobs is still running, you can cancel it using the scancel command. After making sure that no other jobs are running, let's start a new job. You can start a new job with the following command.

srun --pty --share --ntasks=1 --cpus-per-task=9 --mem=300G --gres=gpu:15 bash

The srun command gives you quite a few options to let you specify what resources a particular task needs. In this case, the cpus-per-task, mem, and gres options let you specify the number of CPUs, total memory, and number of GPUs, respectively, that the task needs. The pty option just provides a nice command line interface.

Start Nvidia docker: Now that you have the resources allocated for your task, start a docker container to run your code in the right environment. Instead of using regular docker, we will use NVIDIA-Docker to fully utilize our GPU. Also, to fully utilize your hardware, make sure you are running the GPU docker image of TensorFlow instead of the docker CPU image. Don't forget to use the -v option to mount your project folder in the docker container. Once you are in that container, you can simply use regular python commands to run your code.

# Start your container

nvidia-docker run -v /home/myusername/MyDeepLearningProject:/src -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash

# Don't forget to switch to your source folder

cd src

# Run your model

python myDLmodel.py

On the local machine

Start Tensorboard visualization: You are almost there. Your code is running smoothly now, and you want to use tensorboard to see how the variables in your model change in real time. This is actually the easiest part. First, make sure you know the IP address of your local docker machine. You can do this with the following command:

docker-machine ls

Then, switch to the mounted remote home folder and start a TensorFlow docker container. Since you have already started a Tensorflow docker container on your local machine, make sure you are starting the CPU version of the docker container. As mentioned above, don't forget to mount your project folder in the docker container. In order to visualize the model being trained on your local machine, you also need to map the port number used by Tensorboard from the container to your local machine using the -p option.

docker run -v /LocalDevFolder/MountedRemoteHomeFolder/MyDeepLearningProject:/src -p 6006:6006 -it gcr.io/tensorflow/tensorflow:latest /bin/bash

Once you are inside the docker container, start Tensorboard by specifying the path to where your model saves variables (most likely the path to the checkpoint folder):

tensorboard—logdir=Checkpoints/LatestCheckpointFolder

If everything went well, all you need to do now is go to http://DOCKER_MACHINE_IP:6006 using your favorite browser.

This will show you in Tensorboard all the variables you are tracking in your model.

Figure 2. Tensorboard visualization (provided by Jimgoo)

<<:  DeepXplore: The first white-box framework for systematically testing real-world deep learning systems

>>:  Introduction to a new implementation mechanism for keeping mobile APP logged in

Recommend

Summary of Android channel packaging technology

Introduction This article compares 4 channel pack...

QQ space advertising wedding dress industry brand promotion case!

Guangdiantong is an advertising platform based on...

How much does it cost to be an agent of a pet mini program in Huludao?

For entrepreneurs, although mini program developm...

Which one makes you gain weight more easily, eating noodles or eating rice?

Review expert: Wang Xuejiang, Professor of Pathop...

AR and VR: The inevitable limitations of alternative realities

Despite significant advances, augmented reality (...

Will prehistoric "superpowers" be passed down to you and me?

01 Most of the genes we once had have not disappe...

Are colored glasses poisonous? Come and see if your glass is poisonous

gossip "Colored glass cups are poisonous and...

Brother Han teaches you how to use iOS - Experience sharing

[[148033]] This article mainly explains 4 questio...