According to the supercomputer manufacturer, deep learning problems share algorithmic similarities with applications that are traditionally run on a massively parallel supercomputer. So by optimising inter-node communication using the Cray XC Aries network and a high performance MPI library, each training job is said to be able to leverage more compute resources and therefore reduce the amount of time required to train them.
As part of their work together, the companies scaled the Microsoft Cognitive Toolkit to more than 1,000 NVIDIA Tesla P100 GPU accelerators on the Cray XC50 supercomputer. The result of this is said to open the door for researchers to run larger, more complex, and multi-layered deep learning workloads at scale.
The hope is that this kind of development will help to increase the number of deep learning workloads on supercomputers. With this in mind, Cray is providing deep learning toolkits such as the Microsoft Cognitive Toolkit.