MLCommons

– Inference: Datacenter

v2.1 Results

MLPerf™ is a trademark of MLCommons®. If you use it and refer to MLPerf results, you must follow the results guidelines. MLCommons reserves the right to solely determine if uses of its trademark are appropriate.

Overview

This benchmark suite measures how fast systems can process inputs and produce results using a trained model. Below is a short summary of the current benchmarks and metrics. Please see the MLPerf Inference benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite.

Scenarios and Metrics

In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by a standard load generator generating inference requests in a particular pattern and measuring a specific metric.

Scenario Query Generation Duration Samples/query Latency Constraint Tail Latency Performance Metric
Single stream LoadGen sends next query as soon as SUT completes the previous query 1024 queries and 60 seconds 1 None 90% 90%-ile measured latency
Multiple stream (1.1 and earlier) LoadGen sends a new query every latency constraint if the SUT has completed the prior query, otherwise the new query is dropped and is counted as one overtime query 270,336 queries and 60 seconds Variable, see metric Benchmark specific 99% Maximum number of inferences per query supported
Multiple stream (2.0 and later) Loadgen sends next query, as soon as SUT completes the previous query 270,336 queries and 600 seconds 8 None 99% 99%-ile measured latency
Server LoadGen sends new queries to the SUT according to a Poisson distribution 270,336 queries and 60 seconds 1 Benchmark specific 99% Maximum Poisson throughput parameter supported
Offline LoadGen sends all queries to the SUT at start 1 query and 60 seconds At least 24,576 None N/A Measured throughput

Benchmarks

Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth):

Area Task Model Dataset QSL Size Quality Server latency constraint
Vision Image classification Resnet50-v1.5 ImageNet (224x224) 1024 99% of FP32 (76.46%) 15 ms
Vision Object detection Retinanet OpenImages (800x800) 64 99% of FP32 (0.20 mAP) 100 ms
Vision Medical image segmentation 3D UNET KITS 2019 (602x512x512) 16 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) N/A
Speech Speech-to-text RNNT Librispeech dev-clean (samples < 15 seconds) 2513 99% of FP32 (1 - WER, where WER=7.452253714852645%) 1000 ms
Language Language processing BERT-large SQuAD v1.1 (max_seq_len=384) 10833 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) 130 ms
Commerce Recommendation DLRM 1TB Click Logs 204800 99% of FP32 and 99.9% of FP32 (AUC=80.25%) 30 ms

Each Datacenter benchmark requires the following scenarios:

Area Task Required Scenarios
Vision Image classification Server, Offline
Vision Object detection Server, Offline
Vision Medical image segmentation Offline
Speech Speech-to-text Server, Offline
Language Language processing Server, Offline
Commerce Recommendation Server, Offline

Divisions

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. MLPerf has two Divisions that allow different levels of flexibility during reimplementation. The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation. The Open division is intended to foster innovation and allows using a different model or retraining.

Availability

MLPerf divides benchmark results into Categories based on availability.

  • Available systems contain only components that are available for purchase or for rent in the cloud.
  • Preview systems must be submittable as Available in the next submission round.
  • Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.

Submission Information

Each row in the results table is a set of results produced by a single submitter
using the same software stack and hardware platform. Each Closed division row contains the following information:

  • Submitter: The organization that submitted the results.
  • System: General system description.
  • Processor and count: The type and number of CPUs used, if CPUs perform the majority of ML compute.
  • Accelerator and count: The type and number of accelerators used, if accelerators perform the majority of ML compute.
  • Software: The ML framework and primary ML hardware library used.
  • Benchmark Results: Results for each benchmark as described above.
  • Details: link to metadata for submission.
  • Code: link to code for submission.

Each Open division row may add the following information:

  • Model used: The model used to produce the results, which may or may not match the Closed Division requirement.
  • Notes: arbitrary notes from submitter.

For results with power measurement, each row will add columns for each benchmark containing the following:

  • System power (for Server and Offline scenarios), or
  • Energy per stream (for Single stream and Multiple stream scenarios)

These metrics are computed using the measured average AC power (energy) consumed by the entire system for the duration of the performance measurements of a benchmark (e.g., a single network under a single scenario); the AC power is measured at the wall.

The measured power is only valid for the accompanying benchmark. MLPerf Power is only capable of measuring and validating the full system power. Any other references to power in any description (e.g., a TDP configuration, a power supply rating) are not measured or validated by MLCommons.

Rules

The rules are here.

Reference implementations

The reference implementations for the benchmarks are here.