MLCommons

– Training

v2.1 Results

MLPerf™ is a trademark of MLCommons®. If you use it and refer to MLPerf results, you must follow the results guidelines. MLCommons reserves the right to solely determine if uses of its trademark are appropriate.

Overview

This benchmark suite measures how fast systems can train models to a target quality metric. Below is a short summary of the current benchmarks and metrics. Please see the MLPerf Training benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite.

Benchmarks

Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth):

Area Benchmark Dataset Quality Target Reference Implementation Model
Vision Image classification ImageNet 75.90% classification ResNet-50 v1.5
Vision Image segmentation (medical) KiTS19 0.908 Mean DICE score 3D U-Net
Vision Object detection (light weight) Open Images 34.0% mAP RetinaNet
Vision Object detection (heavy weight) COCO 0.377 Box min AP and 0.339 Mask min AP Mask R-CNN
Language Speech recognition LibriSpeech 0.058 Word Error Rate RNN-T
Language NLP Wikipedia 2020/01/01 0.72 Mask-LM accuracy BERT-large
Commerce Recommendation 1TB Click Logs 0.8025 AUC DLRM
Research Reinforcement learning Go 50% win rate vs. checkpoint Mini Go (based on Alpha Go paper)

Metric

Each benchmark measures the wallclock time required to train a model on the specified dataset to achieve the specified quality target.

To account for the substantial variance in ML training times, final results are obtained by measuring the benchmark a benchmark-specific number of times, discarding the lowest and highest results, and averaging the remaining results. Even the multiple result average is not sufficient to eliminate all variance. Imaging benchmark results are very roughly +/- 2.5% and other benchmarks are very roughly +/- 5%.

For non-HPC training, results that converged in fewer epochs than the reference implementation run with the same hyperparameters were normalized to the expected number of epochs.

Divisions

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. MLPerf has two Divisions that allow different levels of flexibility during reimplementation. The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model and optimizer as the reference implementation. The Open division is intended to foster faster models and optimizers and allows any ML approach that can reach the target quality.

Availability

MLPerf divides benchmark results into Categories based on availability.

  • Available systems contain only components that are available for purchase or for rent in the cloud.
  • Preview systems must be submittable as Available in the next submission round.
  • Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.

Submission Information

Each row in the results table is a set of results produced by a single submitter
using the same software stack and hardware platform. Each Closed division row contains the following information:

  • Submitter: The organization that submitted the results.
  • System: General system description.
  • Processor and count: The type and number of CPUs used, if CPUs perform the majority of ML compute.
  • Accelerator and count: The type and number of accelerators used, if accelerators perform the majority of ML compute.
  • Software: The ML framework and primary ML hardware library used.
  • Benchmark Results: Results for each benchmark as described above.
  • Details: link to metadata for submission.
  • Code: link to code for submission.

Each Open division row may add the following information:

  • Model used: The model used to produce the results, which may or may not match the Closed Division requirement.
  • Notes: arbitrary notes from submitter.

Rules

The rules are here.

Reference implementations

The reference implementations for the benchmarks are here.