A new Artificial Intelligence (AI) industry benchmark has released its first set of results showing NVIDIA had a good day at the races.
The MLPerf, initiated in May, is a collaboration of engineers and researchers working to build a new industry benchmark. The MLPerf benchmark is supported by a wide consortium of technology leaders such as Google, Intel, NVIDIA, AMD and Qualcomm.
It was launched as the pace at which machine learning and AI has been moving in recent years has made it difficult to get an accurate measurement of a company’s capabilities. This is compounded by the fact ML and AI can be sprawling terms that encompass a range of techniques making it hard to compare efforts in the field.
According to its mission statement: “The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services.”
In order to do this, participants have created seven test areas, Image Classification, Object Identification, translation, speech to text, recommendation, sentiment analysis and reinforcement learning.
The benchmark has two divisions Closed and Open.
In the Open division participants can submit any model for testing. However, in the closed division they must use the same model and optimizer. As well as build restrictions the model for the image classification is set to ResNet-50 v1.5. These specification requirements allow for comparisons between the tech.
The results seem to indicate that NVIDIA has won this battle with its V100 Tensor Core proving its self to be the fastest AI accelerator. NVIDIA submitted into all but one of the categories, having decided that the Reinforcement Learning test, which is based on an implementation of the game GO, had too much of a CPU component.
In a straight up chip-to-chip comparison the NVIDIA V100 beat off its closest competitor Google’s TPUv3 in three categories. In image classification the V100 was 1.1x times faster, translation 1.2x and in Object detection NVIDIA’s chips was shown to be 1.6x faster than the TPUv3.
NVIDIA’s Paresh Kharya commented in a blog that: “A key benchmark on which NVIDIA technology performed particularly well was language translation, training the Transformer neural network in just 6.2 minutes.”
He also highlighted that their full stack constitute of: “NVIDIA’s stack includes NVIDIA Tensor Cores, NVLink, NVSwitch, DGX systems, CUDA, cuDNN, NCCL, optimized deep learning framework containers and NVIDIA software development kits.”
A spokesperson for the company also noted that: “In the case of our at-scale submission, we’re completing these tasks in under seven minutes in all but one of the tests. GPUs delivered up to 5.3x faster results compared to the next fastest submissions.”
Google manged to get within striking distance of NIVDIA when it came to image classification. On their own blog they claim that using the ResNet-50 Model Google’s TPUv3 Pod posted a score time of 60 minutes, while NIVIDA manged the task in 13.9.
However, several things have to be factored in, Google is saying that they have displayed their results as normalised to 16 accelerators, but it would appear they have used 20, note the four in a circle under the sixteen in the image above. They used 4 more chips in comparison to work NVIDIA did with 16 chips on their GPUs.
Google also only entered into three of the seven categories, leaving NVIDIA claim the top title for GPU performance in all six tasks. The also did not compare their system to NVIDIAs DGX-2h a server that was announced last month with higher performance and power.
All that said Google did get close with regard to image classification which is still pretty respectable.
Intel unfortunately were left behind by both Google and NVIDIA only doing well when in training Deep Neural Networks with CPUs.
Wei Li Vice President, Core and Visual Computing Group, and General Manager, Machine Learning and Translation commenting in an blog that: “CPU hardware and software performance for deep learning has increased by a few orders of magnitude in the past few years. Training that used to take days or even weeks can now be done in hours or even minutes.”
“This level of performance improvement was achieved through a combination of hardware and software. For example, current-generation Intel Xeon Scalable processors added both the AVX-512 instruction set (longer vector extensions) to allow a large number of operations to be done in parallel, and with a larger number of cores, essentially becoming a mini-supercomputer.”