There are currently 2 GPU's in 2 separate servers that are available to everyone in the MEP (NVIDIA Quadro M5000).
The GPU's can be used by submitting jobs in the 'quadro' queue.
Current status problems
There are a couple of annoyances currently with GPU processing.
- End-users need to choose which GPU to use themselves, which becomes difficult if multiple users want to use GPU's at the same time (Might be improved with HDP 3.1 GPU support)
- CUDA version can't differ per user as long as we don't use Docker (nvidia-docker). Current installed version is cuda-10.0 for the Quadro GPU's and cuda-9.0 for the GTX 1080Ti GPU's.
- Since we're using Spark to submit our jobs to the cluster, but do not use the parallelization since the GPU does this, a 'dummy' SparkContext will need to be created as otherwise the Spark job will timeout.
- Tensorflow uses all of the GPU GRAM by default, which means that if end-users do not specify the RAM related settings, the tensorflow process will use all of the memory available. This is mostly an issue on the GTX 1080Ti GPU's since they are all located on the same server and even if only using 1 GPU, it will allocate the memory of all 4.
- Tensorflow version for CPU nodes and GPU nodes differ (with version 1.11.0 for CPU nodes and version 2.0.0 for GPU nodes)
Built-in GPU support
Since the HDP cluster is upgraded to a new version (3.1.4) there is support for GPU resources in YARN. This support still needs to be implemented.
This will improve the user-friendliness of the GPU usage. It will probably be preferable to use GPU directly on YARN without Spark in between.
Something else to look at is to provide nvidia-docker so a user can isolate himself more. This would mean more flexibility in CUDA version.
A sample on how to submit an application to the GPU nodes can be found at https://github.com/VITObelgium/hadoop-gpu-sample