Start docker container in syrius
Installing nvidia drivers
Install the cuda toolkit and drivers from nvidia[1] The current version installed in syrius is cuda 8.0.44 and the driver version is 367.48
Installing docker and nvidia-docker
1. The docker engine was installed using the yum package manager. The current version installed is Docker version 1.13.0, build 49bf474
1.1 This is the version we are using right now. https://docs.docker.com/install/linux/docker-ce/centos/#install-docker-ce follow the instructions to add the docker-ce repo and install with yum 1.2 To install the nvidia repository follow -> https://nvidia.github.io/nvidia-docker/
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \ sudo tee /etc/yum.repos.d/nvidia-docker.repo```
2. To install nvidia-docker download the most recent version of the .rpm package from here[2].
service docker start
Configuring nvidia-docker-plugin
The 'nvidia-docker-plugin' is a service in charge of the communication between the drivers on the host machine and the docker container. This service is started automatically by nvidia-docker
service nvidia-docker start
For the nvidia-docker (ndocker) engine to work, the drivers in the host machine must be stored in the same partition as the volumes used by the 'nvidia-docker-plugin'.
In order to modify this behavior, we need to modify how the nvidia-docker-plugin is started as follow:
Create a directory in the same partition where the nvidia drivers are installed (typically at /usr/),
mkdir /usr/local/nvidia-docker/
Once the directory is created, modify how the service is started.
locate nvidia-docker.service
cd /usr/lib/systemd/system/
vim nvidia-docker.service
Modify the line that executes the nvidia-docker-plugin
ExecStart=/usr/bin/nvidia-docker-plugin -s $SOCK_DIR -d /usr/local/nvidia-docker/
Modify where the docker images are stored
vim docker.service
Modify the line adding by adding the path to the storage
ExecStart=/usr/bin/dockerd --graph=/Sirius_Storage/docker
Restart both services and cross your fingers.
Troubleshooting the volume where the drivers are installed
Got permission denied while trying to connect to the Docker daemon socket
Add the user to the docker group
Error response from daemon: error while mounting volume '/usr/local/nvidia-docker/nvidia_driver/375.26
or
Error response from daemon: get nvidia_driver_375.26: no such volume: nvidia_driver_375.26
Sometimes the volume where the drivers are stored for the docker container needs to be created. If you have these errors, it must be the case:
First check if the nvidia-docker has a volume
nvidia-docker volume ls
You should see something like
DRIVER VOLUME NAME nvidia-docker nvidia_driver_375.26
If you don't see that then you must create a volume:
nvidia-docker volume create -d nvidia-docker nvidia_driver_375.26
Modify the driver version accordingly and restart the nvidia-docker service
Start a container with tensorflow
The docker container will mount a local directory (your work folder). You can have access to your files, source code etc.
ndocker run -t -i -v /work/<your login>:/root/work gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
cuInit: CUDA_ERROR_UNKNOWN
The cuda context fails to initialize.
Try running the "nvidia-cuda-mps-server" in the host machine. This will solve the issue and allow the cuda mps context to be updated.