Problem

A GPU-enabled instance does not seem to be able to use the device. The driver does not seem to be running. and when running "nvidia-smi" you get an error such as:

$> nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This usually happens after an update of the Operating System kernel, and requires a rebuild of the NVIDIA driver to be compatible with the new kernel.

Solution

Using the morpheus web portal:

  1. Navigate to the instance showing the problems.
  2. Click on ACTIONS - Run Workflow.
  3. Pick "Nvidia driver refresh" and click EXECUTE.
  4. Morpheus will show the progress of this operation, and after a few moments, the GPUs should be available again.


Once your instance is running, you can check wether your instance can see the GPU with:

$> nvidia-smi
Tue Nov 17 15:20:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.87       Driver Version: 440.87       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |    304MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+