Start docker container in syrius

Installing nvidia drivers

Install the cuda toolkit and drivers from nvidia[1] The current version installed in syrius is cuda 8.0.44 and the driver version is 367.48

Installing docker and nvidia-docker

1. The docker engine was installed using the yum package manager. The current version installed is Docker version 1.13.0, build 49bf474

1.1 This is the version we are using right now. https://docs.docker.com/install/linux/docker-ce/centos/#install-docker-ce follow the instructions to add the docker-ce repo and install with yum 1.2 To install the nvidia repository follow -> https://nvidia.github.io/nvidia-docker/

  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo```

2. To install nvidia-docker download the most recent version of the .rpm package from here[2].

   service docker start

Configuring nvidia-docker-plugin

The 'nvidia-docker-plugin' is a service in charge of the communication between the drivers on the host machine and the docker container. This service is started automatically by nvidia-docker

   service nvidia-docker start

For the nvidia-docker (ndocker) engine to work, the drivers in the host machine must be stored in the same partition as the volumes used by the 'nvidia-docker-plugin'. In order to modify this behavior, we need to modify how the nvidia-docker-plugin is started as follow:

Create a directory in the same partition where the nvidia drivers are installed (typically at /usr/),

   mkdir /usr/local/nvidia-docker/

Once the directory is created, modify how the service is started.

   locate nvidia-docker.service

   cd /usr/lib/systemd/system/

   vim nvidia-docker.service

Modify the line that executes the nvidia-docker-plugin

   ExecStart=/usr/bin/nvidia-docker-plugin -s $SOCK_DIR -d /usr/local/nvidia-docker/

Modify where the docker images are stored

   vim docker.service

Modify the line adding by adding the path to the storage

   ExecStart=/usr/bin/dockerd --graph=/Sirius_Storage/docker

Restart both services and cross your fingers.

Troubleshooting the volume where the drivers are installed

   Got permission denied while trying to connect to the Docker daemon socket

Add the user to the docker group

   Error response from daemon: error while mounting volume '/usr/local/nvidia-docker/nvidia_driver/375.26

or

   Error response from daemon: get nvidia_driver_375.26: no such volume: nvidia_driver_375.26

Sometimes the volume where the drivers are stored for the docker container needs to be created. If you have these errors, it must be the case:

First check if the nvidia-docker has a volume

   nvidia-docker volume ls

You should see something like

   DRIVER              VOLUME NAME
   nvidia-docker       nvidia_driver_375.26

If you don't see that then you must create a volume:

   nvidia-docker volume create -d nvidia-docker nvidia_driver_375.26

Modify the driver version accordingly and restart the nvidia-docker service

Start a container with tensorflow

The docker container will mount a local directory (your work folder). You can have access to your files, source code etc.

   ndocker run -t -i -v /work/<your login>:/root/work gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash

cuInit: CUDA_ERROR_UNKNOWN

The cuda context fails to initialize.

Try running the "nvidia-cuda-mps-server" in the host machine. This will solve the issue and allow the cuda mps context to be updated.

Start docker container in syrius

Contents

Installing nvidia drivers

Installing docker and nvidia-docker

Configuring nvidia-docker-plugin

Troubleshooting the volume where the drivers are installed

Start a container with tensorflow

cuInit: CUDA_ERROR_UNKNOWN

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools