- 1. Overview
- 2. General rules
- 3. Benchmarks
- 4. Divisions
- 5. Basics
- 6. Data Set
- 7. RL Environment
- 8. Model
- 9. Training Loop
- 10. Software Adoption
- 11. Run Results
- 12. Benchmark Results
- 13. Reference Convergence Points (RCPs)
- 14. Appendix: Benchmark Specific Rules
- 15. Appendix: Examples of Compliant Optimizers
- 16. Appendix: v1.0 Specific Rules
- 17. Appendix: v1.1 Specific Rules
- 18. Appendix: v3.0 Specific Rules
- 19. Appendix: RCP Examples
- 20. Appendix: Utilization
This document describes how to implement the MLPerf™ Training Suite using an ML framework and how to use that implementation to measure the performance of an ML software framework or hardware.
There are seperate rules for the submission, review, and publication process for all MLPerf benchmarks here.
The MLPerf name and logo are trademarks of the MLCommons® Association ("MLCommons"). In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. MLCommons reserves the right to solely determine if a use of its name or logos is acceptable.
The following definitions are used throughout this document:
Performance always refers to execution speed.
Quality always refers to a model’s ability to produce “correct” outputs.
A system consists of a defined set of hardware resources such as processors, memories, disks, and interconnect. It also includes specific versions of all software such as operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark, excluding the ML framework.
A framework is a specific version of a software library or set of related libraries, possibly with associated offline compiler, for training ML models using a system. Examples include specific versions of Caffe2, MXNet, PaddlePaddle, pyTorch, or TensorFlow.
A benchmark is an abstract problem that can be solved using ML by training a model based on a specific dataset or simulation environment to a target quality level.
A suite is a specific set of benchmarks.
A division is a set of rules for implementing benchmarks from a suite to produce a class of comparable results.
A reference implementation is a specific implementation of a benchmark provided by the MLPerf organization.
A benchmark implementation is an implementation of a benchmark in a particular framework by a user under the rules of a specific division.
A submission implementation set is a set of benchmark implementations for one or more benchmarks from a suite under the rules of a specific division using the same framework.
A run is a complete execution of an implementation on a system, training a model from initialization to the quality target.
A run result is the wallclock time required for a run.
A reference result is the result provided by the MLPerf organization for each reference implementation
A benchmark result is the mean of a benchmark-specific number of run results, dropping the highest and lowest results. The result is then normalized to the reference result for that benchmark. Normalization is of the form (reference result / benchmark result) such that a better benchmark result produces a higher number.
A submission result set is a one benchmark result for each benchmark implementation in a submission implementation set.
A submission is a submission implementation set and a corresponding submission result set.
A custom summary result is the weighted geometric mean of an arbitrary set of results from a specific submission. MLPerf endorses this methodology for computing custom summary results but does not endorse any official summary result.
latest version available is the last MLPerf submission suite that a benchmark was part of.
The following rules apply to all benchmark implementations.
Benchmarking should be conducted to measure the framework and system performance as fairly as possible. Ethics and reputation matter.
The same system and framework must be used for a submission result set. Note that the reference implementations do not all use the same framework.
The framework and system should not detect and behave differently for benchmarks.
Unless part of the definition of a benchmark, the implementation should not encode any information about the content of the dataset or a successful model’s state in any form. High-level statistical information about the dataset, such as distribution of sizes, may be used.
For gpt3, manipulation of metadata which consists of the number of documents in the dataset and the size of each document is allowed as long as the data tokens are not accessed.
For benchmarks which are defined as starting from a fixed set of weights, such as a checkpoint or backbone, the implementation should start from the weights provided in the benchmark reference definition, or if that is not posssible, provide information and code sufficient for reproducing how those starting weights were obtained. For v0.7, sets of weights used in v0.6 are allowed.
The benchmark suite consists of the benchmarks shown in the following table.
Area | Problem | Dataset | Latest version available |
---|---|---|---|
Vision |
Object detection (light weight) |
A subset of OpenImages |
v4.1 |
Text to Image |
LAION-400M-filtered |
v4.1 |
|
Language |
NLP |
Wikipedia 2020/01/01 |
v4.1 |
Large language model |
c4/en/3.0.1 |
v4.1 |
|
Large language model |
SCROLLS GovReport |
v4.1 |
|
Commerce |
Recommendation |
Criteo 3.5TB Click Logs (multi-hot variant) |
v4.1 |
Graphs |
Node classification |
IGBH-Full |
v4.1 |
Vision |
Image classification |
ImageNet |
v4.0 |
Image segmentation (medical) |
KiTS19 |
v4.0 |
|
Vision |
Object detection (heavy weight) |
COCO |
v3.1 |
Language |
Speech recognition |
LibriSpeech |
v3.1 |
Commerce |
Recommendation |
Criteo 1TB Click Logs (multi-hot variant) |
v2.1 |
Developing high-quality benchmarks requires significant effort, computational resources, and commitment. Therefore, each benchmark is expected to remain part of the benchmark suite for a minimum of two years or four submission rounds, whichever comes first.
A benchmark may be considered for early retirement due to reasons such as, but not limited to, low industry adoption. Early retirement requests will be reviewed by the Training working group, followed by a formal vote to determine the benchmark’s status.
MLCommons provides a reference implementation of each benchmark, which includes the following elements:
Code that implements the model in a framework.
A plain text “README.md” file that describes:
-
Problem
-
Dataset/Environment
-
Publication/Attribution
-
Data preprocessing
-
Training and test data separation
-
Training data order
-
Test data order
-
Simulation environment (RL models only)
-
Steps necessary for reproducing the initial set of weights, if an initial set of non-standard weights is used. For v0.7, weights from v0.6 may be used without this information.
-
Publication/Attribution
-
List of layers
-
Weight and bias initialization
-
Loss function
-
Optimizer
-
-
Quality
-
Quality metric
-
Quality target
-
Evaluation frequency (training items between quality evaluations)
-
Evaluation thoroughness (test items per quality evaluation)
-
-
Directions
-
Steps to configure machine
-
Steps to download and verify data
-
Steps to run and time
-
A “download_dataset” script that downloads the dataset.
A “verify_dataset” script that verifies the dataset against the checksum.
A “run_and_time” script that executes the benchmark and reports the wall-clock time.
There are two divisions of the benchmark suite, the Closed division and the Open division.
The Closed division requires using the same preprocessing, model, training method, and quality target as the reference implementation.
The closed division models and quality targets are:
Area | Problem | Model | Target | Latest version available |
---|---|---|---|---|
Vision |
Object detection (light weight) |
SSD (RetinaNet) |
34.0% mAP |
v4.1 |
Text to image |
Stable Diffusion v2.0 |
FID⇐90 and and CLIP>=0.15 |
v4.1 |
|
Language |
NLP |
BERT |
0.720 Mask-LM accuracy |
v4.1 |
Large Language Model |
GPT3 |
2.69 log perplexity |
v4.1 |
|
Large Language Model |
Llama2-70B-LoRA |
0.925 Eval loss |
v4.1 |
|
Commerce |
Recommendation |
DLRMv2 (DCNv2) |
0.80275 AUC |
v4.1 |
Graphs |
Node classification |
R-GAT |
72.0 % classification |
v4.1 |
Vision |
Image classification |
ResNet-50 v1.5 |
75.90% classification |
v4.0 |
Image segmentation (medical) |
U-Net3D |
0.908 Mean DICE score |
v4.0 |
|
Vision |
Object detection (heavy weight) |
Mask R-CNN |
0.377 Box min AP and 0.339 Mask min AP |
v3.1 |
Language |
Speech recognition |
RNN-T |
0.058 Word Error Rate |
v3.1 |
Closed division benchmarks must be referred to using the benchmark name plus the term Closed, e.g. “for the Recommendation Closed benchmark, the system achieved a result of 7.2.”
The Open division allows using arbitrary training data, preprocessing, model, and/or training method. However, the Open division still requires using supervised or reinforcement machine learning in which a model is iteratively improved based on training data, simulation, or self-play.
Open division benchmarks must be referred to using the benchmark name plus the term Open, e.g. “for the Recommendation Open benchmark, the system achieved a result of 7.2.”
CLOSED: Random numbers must be generated using stock random number generators.
Random number generators must be seeded from the following sources:
-
Clock
-
System source of randomness, e.g. /dev/random or /dev/urandom
-
Another random number generator initialized with an allowed seed
Random number generators may be initialized repeatedly in multiple processes or threads. For a single run, the same seed may be shared across multiple processes or threads.
From v4.1 onwards, the seeds should be logged and they need to satisfy the following requirements:
-
The only way to log seeds is through
mllog
. Any seed logged via any other method is discarded. -
All seeds must be valid integer (convertible via
int()
). -
We expect all runs to log at least one seed.
-
If one run logs one seed on a certain line in a certain source file, no other run can log the same seed on the same line in the same source file. What files are considered as source files are defined here.
Unsatisfying any of the above requirements will result in seed checker failures reported by the package checker.
If any run logs more than one seed, a warning is raised by the package checker. This is a reminder to submitters to rethink their design because using multiple seeds per run should not be necessary.
OPEN: Any random number generation may be used. The seed is not expected to be logged.
CLOSED: The numerical formats fp64, fp32, tf32, fp16, fp8, bfloat16, Graphcore FLOAT 16.16, int8, uint8, int4, and uint4 are pre-approved for use. Additional formats require explicit approval. Scaling may be added where required to compensate for different precision.
Reference Convergence Points must be obtained using FP32 precision, or FP32 emulation with explanation of the methodology for emulation.
OPEN: Any format and scaling may be used.
CLOSED: Each reference implementation includes a script to download the input dataset and script to verify the dataset using a checksum. The data must then be preprocessed in a manner consistent with the reference implementation, excepting any transformations that must be done for each run (e.g. random transformations). The data may also be reformatted for the target system provided that the reformatting does not introduce new information or introduce duplicate copies of data.
OPEN: Any public dataset may be used for training the model, however the evaluation data must be drawn from the benchmark dataset in a manner consistent with the reference.
You must flush the cache or restart the system prior to benchmarking. Data can start on any durable storage system such as local disks and cloud storage systems. This explicitly excludes RAM.
Only preprocessing that must be done for each run (e.g. random transformations) must be timed.
CLOSED: The same preprocessing steps as the reference implementation must be used.
OPEN: Any preprocessing steps are allowed for training data. However, each datum must be preprocessed individually in a manner that is not influenced by any other data. The evaluation data must be preprocessed in a manner consistent with reference.
CLOSED: Images must have the same size as in the reference implementation. Mathematically equivalent padding of images is allowed.
CLOSED: For benchmarks with sequence inputs, you may choose a length N and either truncate all examples to length N or throw out all examples which exceed length N. This must be done uniformly for all examples. This may only be done on the training set and not the evaluation set.
CLOSED: Two ways to represent the Mask R-CNN mask are permitted. One is a polygon and the other is a scalable bitmask.
OPEN: The closed division data representations restrictions only apply at the start of the run. Data may be represented in an arbitrary fashion during the run.
Input encoding data, such as language vocabulary, or the set of possible labels may used during pre-processing or execution without counting as "touching the training data" for timing purposes. Same applies to processing metadata like the number of documents, or document sizes in a dataset.
CLOSED: If applicable, the dataset must be separated into training and test sets in the same manner as the reference implementation.
OPEN: If applicable, the test dataset must be extracted in the same manner as the reference implementation. The training data set may not contain data that appears in the test set.
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.
Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once. Modifications to data order and/or batching must be presented to the SWG group in advance of the submission deadline for approval if they could affect the ability to borrow hyperparameters and/or approximately follow the learning rate schedule defined by the RCPs.
In the case of DLRMv2 benchmark, training dataset is shuffled during preprocessing (with a fixed seed) on a per-sample basis. The resulting order of samples should be then used during training and any other extra dataset shuffling is prohibited.
OPEN: The training data may be traversed in any order. The test data must be traversed in the same order as the reference implementation.
CLOSED: The implementation must use the same RL algorithm and simulator or game as the reference implementation, with the same parameters.
OPEN: The implementation may use a different RL algorithm but must use the same simulator or game with the same parameters. If the reference implementation generates all data online, the Open division implementation must also generate all data online.
It is allowed and encouraged to parallelize and otherwise optimize (e.g. by implementing in a compiled language) the RL environment provided that the semantics are preserved.
CLOSED: The benchmark implementation must use the same model as the reference implementation, as defined by the remainder of this section.
OPEN: The benchmark implementation may use a different model.
CLOSED: Each of the current frameworks has a graph that describes the operations performed during the forward propagation of training. The frameworks automatically infer and execute the corresponding back-propagation computations from this graph. Benchmark implementations must use the same graph as the reference implementation.
CLOSED: Weights and biases must be initialized using the same constant or random value distribution as the reference implementation, unless a pre-trained set of weights, such as a checkpoint or backbone, is used by the reference.
OPEN: Weights and biases must be initialized using a consistent constant or random value distribution.
CLOSED: Frameworks are free to optimize the non-weight parts of the computation graph provided that the changes are mathematically equivalent. So optimizations and graph / code transformations of the flavor of dead code elimination, common subexpression elimination, loop-invariant code motion, and recomputation of node state are entirely allowed.
OPEN: Frameworks are free to alter the graph.
CLOSED:
By default, the hyperparameters must be the same as the reference.
Hyperparameters include the optimizer used and values like the regularization norms and weight decays.
The implementation of the optimizer must match the optimizer specified in the Appendex: Allowed Optimizer. The Appendex lists which optimizers in the popular deep learning frameworks are compliant by default. If a submission uses an alternate implementation, the submitter must describe the optimizer’s equation and demonstrate equivalence with the approved optimizers on that list.
The following table lists the tunable hyperparameters for each allowed model,optimizer combination. The value of each tunable hyperparameter must meet the listed constraint.
The MLPerf verifier scripts checks all hyperparameters except those with names marked with asterisks. If a hyperparameter is marked with one asterisk, it must be checked manually. If a hyperparameter is marked with two asterisks, it is also not logged and it must be checked manually in the code. If the verifier and the constraints in this table differ, the verifier (specifically, the version on the date of submission unless otherwise decided by the review committee) is the source of truth.
Model | Optimizer | Name | Constraint | Definition | Reference Code | Latest version available |
---|---|---|---|---|---|---|
bert |
lamb |
global_batch_size |
unconstrained |
The global batch size for training. |
--train_batch_size |
v4.1 |
bert |
lamb |
opt_base_learning_rate |
unconstrained |
The base learning rate. |
--learning_rate |
v4.1 |
bert |
lamb |
opt_epsilon |
unconstrained |
adam epsilon |
v4.1 |
|
bert |
lamb |
opt_learning_rate_training_steps |
unconstrained |
Step at which your reach the lowest learning late |
v4.1 |
|
bert |
lamb |
opt_learning_rate_warmup_steps |
unconstrained |
"num_warmup_steps" |
v4.1 |
|
bert |
lamb |
num_warmup_steps |
unconstrained |
Number of steps for linear warmup. |
--num_warmup_steps |
v4.1 |
bert |
lamb |
start_warmup_step |
unconstrained |
--start_warmup_step |
--start_warmup_step |
v4.1 |
bert |
lamb |
opt_lamb_beta_1 |
unconstrained |
adam beta1 |
v4.1 |
|
bert |
lamb |
opt_lamb_beta_2 |
unconstrained |
adam beta2 |
v4.1 |
|
bert |
lamb |
opt_lamb_weight_decay_rate |
unconstrained |
Weight decay |
v4.1 |
|
dlrmv2 |
adagrad |
global_batch_size |
unconstrained |
global batch size |
v4.1 |
|
dlrmv2 |
adagrad |
opt_base_learning_rate |
unconstrained |
learning rate (for both dense layers and embeddings) |
v4.1 |
|
dlrmv2 |
adagrad |
opt_adagrad_learning_rate_decay |
0.0 |
learning rate decay |
v4.1 |
|
dlrmv2 |
adagrad |
opt_weight_decay |
0.0 |
weight decay |
v4.1 |
|
dlrmv2 |
adagrad |
opt_adagrad_initial_accumulator_value |
0.0 |
adagrad initial accumulator value |
v4.1 |
|
dlrmv2 |
adagrad |
opt_adagrad_epsilon |
1e-8 |
adagrad epsilon |
v4.1 |
|
dlrmv2 |
adagrad |
opt_learning_rate_warmup_steps |
0 (disabled) |
number to steps from 0 to sgd_opt_base_learning_rate with a linear warmup |
v4.1 |
|
dlrmv2 |
adagrad |
opt_learning_rate_decay_start_step |
0 (disabled) |
step at which poly decay is started |
v4.1 |
|
dlrmv2 |
adagrad |
opt_learning_rate_decay_steps |
0 (disabled) |
the step at which the end learning rate is reached |
v4.1 |
|
gpt3 |
adam |
global_batch_size |
unconstrained |
batch size in sequences |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_adam_beta_1 |
0.9 |
adam beta1 |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_adam_beta_2 |
0.95 |
adam beta2 |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_adam_epsilon |
1e-8 |
adam epsilon |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_gradient_clip_norm |
1.0 |
Gradients are clipped above this norm threshold. |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
dropout |
0.0 |
Disable all dropouts during training. |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
sequence_length |
2048 |
sequence length |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_weight_decay |
0.1 |
weight decay |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
gradient_accumulation_steps |
unconstrained |
Numer of fwd/bwd steps between optimizer step. |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_learning_rate_warmup_steps |
ceil(265 * 1536 / global_batch_size) |
steps taken for linear warmup during initial checkpoint generation. This only affects the learning rate curve in the benchmarking region. |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_learning_rate_decay_steps |
ceil(108600 * 1536 / global_batch_size) |
Step when the end of cosine learning rate curve is reached. Learning rate cosine decay is in range (opt_learning_rate_warmup_steps + 1,opt_learning_rate_decay_steps]. |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_init_checkpoint_step |
ceil(4000 * 1536 / batch_size) |
first step after loading initial checkpoint |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_base_learning_rate |
constrained based on global_batch_size |
refer to next table in section "GPT3 learning rates" |
See PR (From NV and Google, TODO Link) |
v4.1 |
gpt3 |
adam |
opt_end_learning_rate |
10% of opt_base_learning_rate |
learning rate at the last step of decay period |
See PR (From NV and Google, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
global_batch_size |
unconstrained |
batch size in sequences |
See PR (From NV and Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
opt_gradient_clip_norm |
fixed to referance (0.3) |
Gradients are clipped above this norm threshold. |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
lora_dropout |
0.1 |
fixed to reference (0.1). |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
sequence_length |
8196 |
the sequence length - fixed to reference |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
lora_alpha |
fixed to referance (32) |
scaling factor for the LoRA weight matrices |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
opt_weight_decay |
fixed to referance (0.0001) |
weight decay |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
gradient_accumulation_steps |
unconstrained |
Numer of fwd/bwd steps between optimizer step. |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
opt_learning_rate_warmup_ratio |
unconstrained |
ratio of steps out of training for linear warmup during initial checkpoint generation. This only affects the learning rate curve in the benchmarking region. |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
opt_learning_rate_training_steps |
unconstrained |
Step when the end of cosine learning rate curve is reached. Learning rate cosine decay is in range (opt_learning_rate_warmup_steps + 1,opt_learning_rate_decay_steps]. |
See PR (From Habana, TODO Link) |
v4.1 |
llama2_70b_lora |
adamw |
opt_base_learning_rate |
unconstrained |
base leraning rate |
See PR (From Habana, TODO Link) |
v4.1 |
stable diffusion |
adamw |
global_batch_size |
unconstrained |
The global batch size for training |
v4.1 |
|
stable diffusion |
adamw |
opt_adamw_beta_1 |
0.9 |
coefficients used for computing running averages of gradient and its square |
v4.1 |
|
stable diffusion |
adamw |
opt_adamw_beta_2 |
0.999 |
coefficients used for computing running averages of gradient and its square |
v4.1 |
|
stable diffusion |
adamw |
opt_adamw_epsilon |
1e-08 |
term added to the denominator to improve numerical stability |
v4.1 |
|
stable diffusion |
adamw |
opt_adamw_weight_decay |
0.01 |
weight decay coefficient |
v4.1 |
|
stable diffusion |
adamw |
opt_base_learning_rate |
unconstrained |
base learning rate, this should be the learning rate after warm up |
v4.1 |
|
stable diffusion |
adamw |
opt_learning_rate_warmup_steps |
unconstrained |
number of steps for learning rate to warm up |
v4.1 |
|
ssd |
adam |
global_batch_size |
arbitrary constant |
reference --batch-size |
v4.1 |
|
ssd |
adam |
opt_learning_rate_warmup_epochs |
integer >= 0 |
number of epochs for learning rate to warm up |
v4.1 |
|
ssd |
adam |
opt_learning_rate_warmup_factor |
unconstrained |
the constant factor applied at learning rate warm up |
v4.1 |
|
ssd |
adam |
opt_base_learning_rate |
unconstrained |
base learning rate, this should be the learning rate after warm up and before decay |
v4.1 |
|
ssd |
adam |
opt_weight_decay |
0 |
L2 weight decay |
v4.1 |
|
gnn |
adam |
global_batch_size |
arbitrary constant |
global batch size |
v4.1 |
|
gnn |
adam |
opt_base_learning_rate |
unconstrained |
base learning rate |
v4.1 |
|
resnet |
lars |
lars_opt_base_learning_rate |
arbitrary constant |
Base "plr" in the PR linked. |
v4.0 |
|
resnet |
lars |
lars_opt_end_learning_rate* |
fixed to reference |
end learning rate for polynomial decay, implied mathemetically from other HPs |
N/A |
v4.0 |
resnet |
lars |
lars_opt_learning_rate_decay_poly_power* |
fixed to reference |
power of polynomial decay, no link needed since not tunable |
N/A |
v4.0 |
resnet |
lars |
lars_epsilon* |
Fixed to reference |
epsilon in reference |
v4.0 |
|
resnet |
lars |
lars_opt_learning_rate_warmup_epochs |
arbitrary constant |
w_epochs in PR |
v4.0 |
|
resnet |
lars |
lars_opt_momentum |
0.9 for batch<32k, otherwise arbitrary constant |
momentum in reference |
v4.0 |
|
resnet |
lars |
lars_opt_weight_decay |
(0.0001 * 2 ^ N) where N is any integer |
weight_decay in reference |
v4.0 |
|
resnet |
lars |
lars_opt_learning_rate_decay_steps |
unconstrained |
num_epochs in reference |
v4.0 |
|
resnet |
lars |
global_batch_size |
unconstrained |
global batch size in reference |
v4.0 |
|
resnet |
lars |
label smoothing** |
0 or 0.1 |
TODO |
TODO |
v4.0 |
resnet |
lars |
truncated norm initialization** |
boolean |
TODO |
TODO |
v4.0 |
resnet |
sgd |
global_batch_size |
arbitrary constant |
reference --batch_size |
See LARS |
v4.0 |
resnet |
sgd |
sgd_opt_base_learning_rate |
0.001 * k where is an integer |
the learning rate |
See LARS |
v4.0 |
resnet |
sgd |
sgd_opt_end_learning_rate |
10^-4 |
end learning rate for polynomial decay, implied mathemetically from other HPs |
See LARS |
v4.0 |
resnet |
sgd |
sgd_opt_learning_rate_decay_poly_power |
2 |
power of polynomial decay, no link needed since not tunable |
See LARS |
v4.0 |
resnet |
sgd |
sgd_opt_learning_rate_decay_steps |
integer >= 0 |
num_epochs in reference |
See LARS |
v4.0 |
resnet |
sgd |
sgd_opt_weight_decay |
(0.0001 * 2 ^ N) where N is any integer |
Weight decay, same as LARS. |
See LARS |
v4.0 |
resnet |
sgd |
sgd_opt_momentum |
0.9 |
Momentum for SGD. |
See LARS |
v4.0 |
resnet |
sgd |
model_bn_span |
arbitrary constant |
number of samples whose statistics a given BN layer uses to normalize a training minibatch (may be just the portion of global_batch_size per device, but also may be aggregated over several devices) |
See LARS |
v4.0 |
resnet |
sgd |
opt_learning_rate_warmup_epochs |
integer >= 0 |
number of epochs needed for learning rate warmup |
See LARS |
v4.0 |
resnet |
sgd |
label smoothing** |
0 or 0.1 |
TODO |
TODO |
v4.0 |
resnet |
sgd |
truncated norm initialization** |
boolean |
TODO |
TODO |
v4.0 |
resnet |
lars/sgd |
opt_name |
"lars" or "sgd" |
The optimizer that was used. |
v4.0 |
|
unet3d |
sgd |
global_batch_size |
unconstrained |
global batch size |
reference --batch_size |
v4.0 |
unet3d |
sgd |
opt_base_learning_rate |
unconstrained |
base learning rate |
reference --learning_rate |
v4.0 |
unet3d |
sgd |
opt_momentum |
unconstrained |
SGD momentum |
reference --momentum |
v4.0 |
unet3d |
sgd |
opt_learning_rate_warmup_steps |
unconstrained |
number of epochs needed for learning rate warmup |
reference --lr_warmup_epochs |
v4.0 |
unet3d |
sgd |
opt_initial_learning_rate |
unconstrained |
initial learning rate (for LR warm up) |
reference --init_learning_rate |
v4.0 |
unet3d |
sgd |
opt_learning_rate_decay_steps |
unconstrained |
epochs at which the learning rate decays |
reference --lr_decay_epochs |
v4.0 |
unet3d |
sgd |
opt_learning_rate_decay_factor |
unconstrained |
factor used for learning rate decay |
reference --lr_decay_factor |
v4.0 |
unet3d |
sgd |
opt_weight_decay |
unconstrained |
L2 weight decay |
reference --weight_decay |
v4.0 |
unet3d |
sgd |
training_oversampling |
fixed to reference |
training oversampling |
reference --oversampling |
v4.0 |
unet3d |
sgd |
training_input_shape |
fixed to reference |
training input shape |
reference --input_shape |
v4.0 |
unet3d |
sgd |
evaluation_overlap |
fixed to reference |
evaluation sliding window overlap |
reference --overlap |
v4.0 |
unet3d |
sgd |
evaluation_input_shape |
fixed to reference |
evaluation input shape |
reference --val_input_shape |
v4.0 |
unet3d |
sgd |
data_train_samples |
fixed to reference |
number of training samples |
N/A |
v4.0 |
unet3d |
sgd |
data_eval_samples |
fixed to reference |
number of evaluation samples |
N/A |
v4.0 |
maskrcnn |
sgd |
global_batch_size |
arbitrary constant |
global version of reference SOLVER.IMS_PER_BATCH |
v3.1 |
|
maskrcnn |
sgd |
opt_learning_rate_decay_factor* |
fixed to reference (0.1) |
learning rate decay factor |
v3.1 |
|
maskrcnn |
sgd |
opt_learning_rate_decay_steps* |
(60000, 80000) * (1 + K / 10) * 16 / global_batch_size where K is integer |
Steps at which learning rate is decayed |
v3.1 |
|
maskrcnn |
sgd |
opt_base_learning_rate |
0.02 * K for any integer K. For global_batch_size < 16, 0.02 / K for any integer K is also allowed |
base learning rate, this should be the learning rate after warm up and before decay |
v3.1 |
|
maskrcnn |
sgd |
max_image_size* |
fixed to reference |
Maximum size of the longer side |
v3.1 |
|
maskrcnn |
sgd |
min_image_size* |
fixed to reference |
Maximum size of the shorter side |
v3.1 |
|
maskrcnn |
sgd |
num_image_candidates* |
1000 or 1000 * batches per chip |
tunable number of region proposals for given batch size |
v3.1 |
|
maskrcnn |
sgd |
opt_learning_rate_warmup_factor |
unconstrained |
the constant factor applied at learning rate warm up |
v3.1 |
|
maskrcnn |
sgd |
opt_learning_rate_warmup_steps |
unconstrained |
number of steps for learning rate to warm up |
v3.1 |
|
rnnt |
lamb |
global_batch_size |
unconstrained |
reference --batch_size |
See reference code |
v3.1 |
rnnt |
lamb |
opt_name |
"lamb" |
The optimizer that was used. |
See reference code |
v3.1 |
rnnt |
lamb |
opt_base_learning_rate |
unconstrained |
base learning rate, this should be the learning rate after warm up and before decay |
See reference code |
v3.1 |
rnnt |
lamb |
opt_lamb_epsilon |
1e-9 |
LAMB epsilon |
See reference code |
v3.1 |
rnnt |
lamb |
opt_lamb_learning_rate_decay_poly_power |
unconstrained |
Exponential decay rate |
See reference code |
v3.1 |
rnnt |
lamb |
opt_lamb_learning_rate_hold_epochs |
unconstrained |
Number of epochs when LR schedule keeps the base learning rate value |
See reference code |
v3.1 |
rnnt |
lamb |
opt_learning_rate_warmup_epochs |
unconstrained |
Number of epochs when LR linearly increases from 0 to base learning rate |
See reference code |
v3.1 |
rnnt |
lamb |
opt_weight_decay |
1e-3 |
L2 weight decay |
See reference code |
v3.1 |
rnnt |
lamb |
opt_lamb_beta_1 |
unconstrained |
LAMB beta 1 |
See reference code |
v3.1 |
rnnt |
lamb |
opt_lamb_beta_2 |
unconstrained |
LAMB beta 2 |
See reference code |
v3.1 |
rnnt |
lamb |
opt_gradient_clip_norm |
1 or inf |
Gradients are clipped above this norm threshold. |
See reference code |
v3.1 |
rnnt |
lamb |
opt_gradient_accumulation_steps |
unconstrained |
Numer of fwd/bwd steps between optimizer step. |
See reference code |
v3.1 |
rnnt |
lamb |
opt_learning_rate_alt_decay_func |
True |
whether to use alternative learning rate decay function |
See reference code |
v3.1 |
rnnt |
lamb |
opt_learning_rate_alt_warmup_func |
True |
whether to use alternative learning rate warmup function |
See reference code |
v3.1 |
rnnt |
lamb |
opt_lamb_learning_rate_min |
1e-5 |
LR schedule doesn’t set LR values below this threshold |
See reference code |
v3.1 |
rnnt |
lamb |
train_samples |
unconstrained |
Number of training samples after filtering out samples longer than data_train_max_duration |
See reference code |
v3.1 |
rnnt |
lamb |
eval_samples |
2703 |
Number of evaluation samples |
See reference code |
v3.1 |
rnnt |
lamb |
data_train_max_duration |
unconstrained |
Samples longer than this number of seconds are not included to training dataset |
See reference code |
v3.1 |
rnnt |
lamb |
data_train_num_buckets |
unconstrained |
Training dataset is split to this number of buckets |
See reference code |
v3.1 |
rnnt |
lamb |
data_train_speed_perturbation_min |
0.85 |
Input audio is resampled to a random rample rate not less than this fraction of original sample rate. |
See reference code |
v3.1 |
rnnt |
lamb |
data_train_speed_perturbation_max |
1.15 |
Input audio is resampled to a random rample rate not greater than this fraction of original sample rate. |
See reference code |
v3.1 |
rnnt |
lamb |
data_spec_augment_freq_n |
2 |
Number of masks for frequency bands |
See reference code |
v3.1 |
rnnt |
lamb |
data_spec_augment_freq_min |
0 |
Minimum number of frequencies in a single mask |
See reference code |
v3.1 |
rnnt |
lamb |
data_spec_augment_freq_max |
20 |
Maximum number of frequencies in a single mask |
See reference code |
v3.1 |
rnnt |
lamb |
data_spec_augment_time_n |
10 |
Number of masks for time band |
See reference code |
v3.1 |
rnnt |
lamb |
data_spec_augment_time_min |
0 |
Minimum number of masked time steps as a fraction of all steps |
See reference code |
v3.1 |
rnnt |
lamb |
data_spec_augment_time_max |
0.03 |
Maximum number of masked time steps as a fraction of all steps |
See reference code |
v3.1 |
rnnt |
lamb |
model_eval_ema_factor |
unconstrained |
Smoothing factor for Exponential Moving Average |
See reference code |
v3.1 |
rnnt |
lamb |
model_weights_initialization_scale |
unconstrained |
After random initialization of weight and bias tensors, all are scaled with this factorAfter random initialization of weight and bias tensors, all are scaled with this factor |
See reference code |
v3.1 |
OPEN: Hyperparameters and optimizer may be freely changed.
Since training large language models is very expensive, the task force aims to limit hyperparameter searches. Thus the allowed range of batch sizes and corresponding batch sizes are fixed as follows.
global_batch_size | opt_base_learning_rate |
---|---|
1536 |
2.0e-5 |
2048 |
2.0e-5 |
3072 |
2.0e-5 |
4096 |
3.0e-5 |
8192 |
3.0e-5 |
-
GBS<1536 or GBS>8192 - new RCP needs to be generated, reach out to the task force
-
GBS [1536,3072] - opt_base_learning_rate=2.0e-5
-
For (3072,4096) - opt_base_learning_rate=2.0e-5 or opt_base_learning_rate=3.0e-5
-
GBS [4096,8192] - opt_base_learning_rate=3.0e-5
If a new learning rate is needed for any GBS point, request new RCPs from the task force or normalize the score if permissible.
Submitters are expected to use their best efforts to submit with optimal hyperparameters for their system. The intent of Hyperparameter Borrowing is to allow a submitter to update their submission to reflect what they would have submitted had they known about more optimal hyperparameters before submitting, without knowing any other info (ie the performance of other submissions).
During the review period as described in the Submission Rules, a submitter may replace the hyperparameters, once per benchmark entry, in their implementation of a benchmark with hyperparameters from another submitter’s implementation of the same benchmark. By default, they may change batch size (local batch size, global batch size, batchnorm span), but must replace all other hyperparameters as a group.
With evidence that the resulting model, using the same batch size as the other submitter’s implementation, converges worse in terms of epochs required, the submitter may make a minimum number of additional hyperparameter changes for the purpose of improving convergence and achieving comparable, but not better, convergence in epochs compared to the other submitter’s implementation, but preserving any difference in convergence that may exist due to precision choices. In this situation, the other submitter’s implementation is considered the reference, and the new submitter must match the convergence behavior of the other submitter in a similar way as we compare any submission to the reference.
A resubmission of a benchmark with borrowed hyperparameters must use the same software (with the exceptions listed in the Software Adoption section of this document), system and system configuration (accelerators, NICs etc) as the original submission. The largest scale submission for a benchmark from a given system may be resubmitted with borrowed hyperparameters using a change of scale on that system, but only if the new scale is either larger, or enables the resubmission to achieve a faster run result. In addition, the new scale must not be larger than the largest scale used in an original submission of at least one of the benchmarks on that system in this round.
Since the hyperparameters are fixed for GPT3, hyperparameter borrowing is not allowed.
CLOSED: The same loss function used in the reference implementation must be used.
OPEN: Any loss function may be used. Do not confuse the loss function with target quality measure.
Each run must reach a target quality level on the reference implementation quality measure. By default, the time to evaluate the quality is included in the wallclock time. However, if the reference implementation generates timestamped checkpoints and evaluates the quality after the clock has been stopped, then an implementation may either perform evaluation on-the-clock or generate timestamped checkpoints, evaluate them after the clock has been stopped, and update the clock stopped time to the timestamp of the first passing checkpoint. The checkpoint timestamp may be any time after the last weight value included in the checkpoint is updated.
CLOSED: The same quality measure as the reference implementation must be used. The quality measure must be evaluated at the same frequency (in terms of number of training items between test sets) and at least as thoroughly (in terms of number of tests per set) as in the reference implementation. Where applicable, the required evaluation point may be rounded up to the nearest batch size. Typically, a test consists of comparing the output of one forward pass through the network with the desired output from the test set.
Area | Problem | Model | Evaluation frequency | Latest version available |
---|---|---|---|---|
Vision |
Object detection (light weight) |
SSD (RetinaNet) |
Every 1 epoch |
v4.1 |
Text to image |
Stable Diffusion v2.0 |
v4.1 |
||
Language |
NLP |
BERT |
eval_interval_samples=FLOOR(0.05*(230.23*GBS+3000000), 25000), skipping 0 |
v4.1 |
large Language Model |
GPT3 |
Every 24576 sequences. CEIL(24576 / global_batch_size) if 24576 is not divisible by GBS |
v4.1 |
|
large Language Model |
Llama2_70B_LoRA |
Every 384 sequences, CEIL(384 / global_batch_size) steps if 384 is not divisible by GBS. Skipping first FLOOR(0.125*global_batch_size+2) evaluations |
v4.1 |
|
Commerce |
Recommendation |
DLRMv2 (DCNv2) |
Every FLOOR(TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * NUM_EVAL) samples, where TOTAL_TRAINING_SAMPLES = 4195197692 and NUM_EVAL = 20 |
v4.1 |
Graphs |
Node classification |
R-GAT |
Evaluate 20 times per epoch |
v4.1 |
Vision |
Image classification |
Resnet-50 v1.5 |
Every 4 epochs with offset 0 or 1 or 2 or 3 |
v4.0 |
Image segmentation (medical) |
U-Net3D |
Starting at |
v4.0 |
|
Vision |
Object detection (heavy weight) |
Mask R-CNN |
Every 1 epoch |
v3.1 |
Language |
Speech recognition |
RNN-T |
Every 1 epoch |
v3.1 |
OPEN: An arbitrary stopping criteria may be used, including but not limited to the closed quality measure, a different quality measure, the number of epochs, or a fixed time. However, the reported results must include the geometric mean of the final quality as measured by the closed quality measure.
Exceptions for GPT3 OPEN: the open submissions are allowed to choose a language version that’s not English for the C4 dataset. When doing so, the submitter needs to make it clear that the dataset and convergence measures are different from the close division submissions.
Check points can be created at the discretion of submitter. No check points are required to be produced or retained.
The CLOSED division allows limited exemptions to mathematical equivalence between implementations for pragmatic purposes, including:
-
Different methods can be used to add color jitter as long as the methods are of a similar distribution and magnitude to the reference.
-
If data set size is not evenly divisible by batch size, one of several techniques may be used. The last batch in an epoch may be composed of the remaining samples in the epoch, may be padded, or may be a mixed batch composed of samples from the end of one epoch and the start of the next. If the mixed batch technique is used, quality for the ending epoch must be evaluated after the mixed batch. If the padding technique is used, the first batch may be padded instead of the last batch. Additionally, in the case of DLRMv2 benchmark, the last partial training batch may be dropped.
-
Values introduced for padding purposes may be reflected in batch norm computations.
-
Adam optimizer implementations may use the very small value epsilon to maintain mathematical stability in slightly different ways, provided that methods are reviewed and approved in advance. One such method involves squaring the value of epsilon and moving epsilon inside the square root in the parameter update equation.
-
Distributed batch normalization is allowed.
Additional exemptions need to be explicitly requested and approved in advance. In general, exemptions may be approved for techniques that are common industry practice, introduce small differences that would be difficult to engineer around relative to their significance, and do not substantially decrease the required computation. Over time, MLPerf should seek to help the industry converge on standards and remove exemptions.
The OPEN division does not restrict mathematical equivalence.
For a given round of MLPerf, the "canonical version" of a software component shall be defined as the public version as of 14 days before submission. If the software is open source, the canonical version shall be the one compiled with the default compilation options. If a system software provider submits with a component whose version is other than the canonical version, then other submitters using the same component are allowed to update their submission to use that version. Those other submitters must resubmit with the updated system software before the resubmission deadline during the review period. Software adoption applies only to system software, only to the version used by the software provider’s submission, and explicitly does not cover benchmark implementations. Benchmark implementations should be borrowed as a whole only if the software provider’s submission introduces new APIs.
A run result consists of a wall-clock timing measurement for a contiguous period that includes model initialization in excess of a maximum initialization time, any data preprocessing required to be on the clock, using the dataset to train the model, and quality evaluation unless specified otherwise for the benchmark.
Prior to starting the clock, a system may use a maximum model initialization time of 30 minutes for Closed division and 4 hours for Open division. Model initialization time begins when the system first begins to construct or execute the model. This maximum initialization time is intended to ensure that model initialization is not disproportionate on large systems intended to run much larger models, and may be adjusted in the future with sufficient evidence.
The clock must start before any part of the system touches the dataset or when the maximum model initialization time is exceeded. The clock may be stopped as soon as any part of the system determines target accuracy has been reached. The clock may not be paused during the run.
Each benchmark result is based on a set of run results. The number of results for each benchmark is based on a combination of the variance of the benchmark result, the cost of each run, and the likelihood of convergence.
Area | Problem | Minimum Number of Runs | Latest version available |
---|---|---|---|
Vision |
Object detection (light weight) |
5 |
v4.1 |
Stable Diffusion v2.0 |
10 |
v4.1 |
|
Language |
NLP |
10 |
v4.1 |
Large language model |
3 |
v4.1 |
|
Large language model Fine Tune (LoRA) |
10 |
v4.1 |
|
Commerce |
Recommendation |
10 |
v4.1 |
Graphs |
Node classification |
10 |
v4.1 |
Vision |
Image classification |
5 |
v4.0 |
Image segmentation (medical) |
40 |
v4.0 |
|
Vision |
Object detection (heavy weight) |
5 |
v3.1 |
Language |
Speech recognition |
10 |
v3.1 |
Each benchmark result is computed by dropping the fastest and slowest runs, then taking the mean of the remaining times. For this purpose, a single non-converging run may be treated as the slowest run and dropped. A benchmark result is invalid if there is more than one non-converging run.
In the case of UNET3D, due to large variance, 40 runs are required. Out of the 40 runs, the 4 fastest and 4 slowest are dropped. There can be maximum of 4 non-converging runs. A run is classified as non-converged if the target quality metric is not reached within CEILING(10000*168/samples_in_epoch)
epochs.
Each benchmark result should be normalized by dividing the reference result for the corresponding reference implementation by the benchmark result. This normalization produces higher numbers for better results, which better aligns with human intuition.
An MLPerf submission score is intended to represent the median expected result across a large number of runs.
To reduce statistical variance and the potential to cherry pick results, each benchmark submission is composed of a set of N independent runs, with N chosen based on the observed variation of the benchmark, as described in the table above.
Running multiple iterations of N independent runs with the goal of validating that the submission is close to a median result is encouraged but not required. Running multiple iterations of N runs to try find the lowest one is against the spirit of MLPerf and is prohibited – see Section 2.1, “Strive to be fair”. Results that appear to be too far away from a median result may be rejected.
As a more computationally efficient method of validating that a submission is close to the median result, it is also allowed to run M>N independent runs as a group and to designate N consecutive runs from the group as the runs to be used for scoring, provided that the submitter chooses the N consecutive runs that are closest to the median result. For the purposes of calculating the median, sets of N consecutive runs that would create an invalid benchmark result should be included in the median calculation as "infinite" scores. If the median set would be an invalid benchmark result, the entire result is invalid. Submitting the full run set (vs just the N runs used for scoring) as a reference is optional, but may be required in the future. For purposes of this scoring, "consecutive" is defined as an objective and deterministic method, such as submission timestamps. Submitters are not allowed to pick different orderings to improve their score. Runs may go in parallel on the submitter’s compute resources, as long as there is a way to objectively and deterministically sort the runs, for example by timestamp.
An example could be for a benchmark with N=5 runs, a submitter could ahead of time pick M=10, launch 10 runs on their compute resources, sort the 10 runs by their launch time stamp, then take a sliding window of 5 consecutive runs over those 10 runs. That sliding window would create 6 possible sets of 5 runs. Each of those 6 sets would be olympically scored, and the set with the median runtime would be submitted as that submitter’s score. Any failed runs within those 10 runs would count as infinity time and need to be included in the olympic scoring (could be thrown away as the slowest score). It is recommended that a submitter keep the logs for all M runs, because the review committee may ask for the submitter to share the M logs during the review period.
The score of an MLPerf submission may be scaled if the training committee decides so during the review period. This scaling may be, but not limited to failing to meet the reference convergence limits imposed by the Reference Convergence Points (see following section). To facilitate the automatic generation of the scaled score the scaling factor must be provided in a json file under the name scaling.json in the directory whose scores are going to be scaled.
Reference Convergence Points are used to ensure that the convergence of the submission does not deviate from the convergence of the reference. We are interested in avoiding cases where the submission convergence is faster than the reference. Reference implementation convergence sets a lower bound on epoch convergence that a valid submission should not beat. From a statistical standpoint if the submission mean epochs to converge is significantly lower than the reference mean epochs to converge, then submission convergence points belong to a different population than the reference convergence points, and thus the submission should not be accepted. Compliance to reference convergence points is validated as follows
-
Reference implementations provide at least 2N epoch convergence numbers, where N is the number of submission runs needed for each benchmark. Since convergence is affected by batch size (larger batch size means slower convergence), reference implementations provide convergence data for a few different batch sizes.
-
For GPT3 where there are two reference implementations which have been verified to be equivalent with minimum variance, each reference implementation should provide at least N epoch convergence numbers for each RCP.
-
After a set of Reference Convergence Points is gathered, we find the minimal set of these points that are needed for the fastest possible convergence. For example, if the RCP for batch size 128 is at 10 epochs, the RCP for batch size 256 is at 20 epochs, and the RCP for batch size 512 is also at 20 epochs, then we prune the RCP at the 256 batch size. Based on the assumption that convergence increases with batch size, we expect to be able to converge faster than 20 epochs at batch size 256. In practice we prune ALL RCP points that have slower convergence than the linear interpolation at the same batch size of any two surrounding points. Eventually we end up with a pruned set of RCPs which defines the fastest possible convergence of the reference code as a function of batch size.
-
A potential submitter can request generation of new RCPs by suggesting a better set of hparams to the WG or generate new RCPs by running the reference themselves. A request for a new RCP run should be backed by at least one run on either the submitter’s code or the reference code proving faster convergence. A request to generate RCPs should be made in the Training WG meeting at least 8 weeks before submission deadline and the reference owner (or a volunteer appointed by WG) should provide the RCP at least 4 weeks before submission deadline. Subject to WG’s approval, requester’s set of convergence points (2N runs) may act as temporary RCPs for that round if the RCP request is not met by a timely response.
-
For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester’s set of convergence points (2N runs) can be used as temporary RCPs.
-
Using the mean and standard deviation of the reference convergence we apply a 1-sided independent two-sample Student’s t-test with unequal sample sizes, similar variances with p-value=0.05 (explained here) to find the maximum acceptable speedup for submission convergence.
-
At submission time, the submission is matched to an RCP based on the submission batch size.
-
If there is an RCP for that batch size then mean epochs to converge of the submission is extracted from submission logs. If this does not violate the maximum acceptable speedup condition when compared to the reference then the submission is accepted, otherwise it may be rejected.
-
If there is no RCP for that batch size but there are RCPs for smaller and larger batch sizes an interpolated RCP is created, and the mean epochs to converge is compared against the interpolated RCP just like in the previous case
-
If the submission batch size is larger than the batch size of any RCP the submitter must provide the missing RCPs by running the reference implementation with their batch size.
-
If the submission batch size is smaller that the batch size of any RCP AND the convergence test against the RCP with the minimum batch size fails, then again the submitter must provide the missing RCPs by running the reference implementation with their batch size.
-
Accepted submissions with mean epochs lower than RCP mean (faster) but within the acceptable speedup range are normalized to (potentially interpolated) RCP mean epochs for fairness. New normalized score = Submission-olympic-score * (RCP-mean / olympic-submission-epochs)
-
Please refer to the related Appendix for examples that shed light to the RCP process.
Submitters are encouraged to run the RCP checker script prior to their submission to make sure they do not violate RCP limits.
If a submission fails the RCP test, such as S2 in the Appendix, they have the option to submit with the --rcp_bypass parameter. This will allow the submission to upload, but the submitter must notify the results chair, and prepare for the audit process described in the next section where at review time the submitter should be able to justify why their submission is valid while it failed the RCP test.
If a submission is missing the RCP for the batch size they are submitting, such as S4 and S6 in the Appendix they must provide the missing convergence points by making a PR in the logger. All missing RCPs are due 24h after the submission deadline (Exception is GPT3: where RCPs are due 5 weeks before the submission deadline). RCPs are added by making a pull request into the RCP library in the logging repository. Since the RCP may arrive after the submission deadline, the submitter can use the --rcp_bypass parameter again to have their submission accepted.
During hyperparameter borrowing, borrowers can use hyperparameters from submissions that passed or failed the RCP test. If their submission fails to pass the RCP test they can have it upload by using --rcp-bypass and then prepare for the audit decribed in the next section.
To extract submission convergence points, logs should report epochs as follows.
Benchmark | Epoch reporting | Latest version available |
---|---|---|
BERT |
Training sample (integer) |
v4.1 |
GPT3 |
Training token starting from 0 (integer) |
v4.1 |
Llama2_70B_LoRA |
Training sample (integer) |
v4.1 |
DLRMv2 (DCNv2) |
Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, …, 1.0) |
v4.1 |
Stable-Diffusion |
Training sample (integer) |
v4.1 |
SSD (RetinaNet) |
Epoch |
v4.1 |
R-GAT |
Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, …, 1.0) |
v4.1 |
RN50 |
Epoch |
v4.0 |
UNET3D |
Epoch |
v4.0 |
Mask-RCNN |
Epoch |
v3.1 |
RNN-T |
Epoch |
v3.1 |
In order to reduce the burden on the submitter as well as the Submitter’s Working Group (SWG) during the review period, submitters shall ensure compliance with RCP tests ahead of the submission deadline. Submissions that need new RCPs are required to supply those RCPs at the same time as their submission, as specified in the Training Rules document. While providing new RCPs, a submitter must also include reference run logs for the SWG and reference owner to review.
Submissions with failing RCP tests are rejected by default until the SWG approves the submission. Submitters shall notify the SWG in advance of a potential RCP failure, so they can prefetch requests for additional data and minimize churn during the review period. A submitter requesting approval for a submission with failing RCP test shall provide additional explanatory data to the SWG explaining why the WG should consider the non-compliant submission a fair comparison to compliant submissions. This list will be decided by the WG for each submission individually.
A non-exhaustive list of potential requests of data is:
-
Written statement from the submitter explaining the plausible cause of deviation. This should also be supported by data from A/B experiments.
-
Logs showing training loss of the submission vs training loss of the reference. Note that the reference run should be on reference hardware platform in FP32
-
Model summary showing number of trainable_parameters (weights) in the model vs the same.
-
Debugging via comparing intermediate activations, distributions of initialization weights, and/or compliant randomization on the reference vs the submission. The SWG may further request additional information, not listed above, at their discretion.
A submitter requesting approval for their RCP failing submission during the review period shall provide requested information in a timely manner. All evidence supporting the appeal is due at the latest by the end of Review Week 1. For resubmissions during the review period, all appeal evidence is due at the time of resubmission.
The SWG must come to majority consensus to approve a submission that fails the RCP test. If the SWG cannot come to majority consensus to approve a submission, then potential alternatives are:
-
Normalize submission run epochs to reference epochs to pass RCP test irrespective of accuracy achieved
-
Submission is withdrawn due to non-compliance
-
Node Classification
-
Timed region: Graph and feature loading, training, evaluation are all timed. Graph-partitioning for multi-node runs is not timed.
-
Node features are in fp32 in the dataset, but lower precisions are allowed. Feature precision can be converted offline.
-
Any sparse format may be used for storing the graph. Offline conversion is allowed.
-
Graph partitioning algorithm and locality:
-
Any any general non-data-aware partitioning algorithm that is reproducible, either using a fixed seed or a deterministic algorithm
-
We require that each graph node’s feature can only be read from disk on one exclusive training node. Other training nodes that need this graph node’s feature should fetch it over the network
-
-
Caching: Graph caching is allowed, but feature caching is not allowed.
-
Sampler: Submitters are not expected to exactly match reference sampler implementation due to known framework differences, but must meet RCP criteria.
-
-
Stable Diffusion
-
10 runs per submission
-
Checkpoint must be collected every 512,000 images. CEIL(512000 / global_batch_size) if 512000 is not divisible by GBS.
-
The collected checkpoints may be evaluated freely (in order, out of order, some checkpoints may be skipped), provided that:
-
FID and CLIP scores must to be submitted for all collected checkpoints (up to the first checkpoint with a passing score) for 1/10 of the runs.
-
FID and CLIP scores must to be submitted for the last two checkpoints (the first checkpoint with a passing score and the one before it) for 9/10 of the runs.
-
-
evaluation is done offline, the time is not counted towards the submission time.
-
A passing score is FID⇐90 and CLIP>=0.15
-
-
Image Classification
-
The model may have 1000 or 1001 classes, where the 1001st is "I don’t know"
-
-
Bert
-
Clip-normalization order: The 1.0 and 1.1 exception that benchmarks may implement clip-normalization either before or after accelerator all-reduce has been extended indefinitely to future rounds.
-
--rcp-bert-train-samples log compliance parameter: For all benchmarks other than Bert, convergence for RCP purposes is reported in the last eval_accuracy line of the log file. For Bert, submitters are allowed to add an extra log line with key set to train_samples and value the number of samples to converge. If that is the case, the package compliance checker should be run with the --rcp-bert-train-samples command line parameter.
-
-
DLRMv2 (DCNv2)
-
Because DLRMv2 (DCNv2) benchmark is trained for at most one epoch, epoch numbering starts from 0 in this case. More precisely, it stands for the fraction of epoch iterations passed.
-
Analysis to support this can be found in the document "MLPerf Optimizer Review" in the MLPerf Training document area. TODO: locate the document and provide working link
Benchmark | Algorithm | Framework | Optimizer Implementations |
---|---|---|---|
Image classification |
LARS |
PyTorch |
[No compliant implementation] |
TensorFlow |
MLPERF_LARSOptimizer |
||
MxNet |
SGDwFASTLARS |
||
Image classification |
SGD with Momentum |
PyTorch |
apex.optimizers.FusedSGD |
PyTorch |
torch.optim.SGD |
||
TensorFlow |
tf.train.MomentumOptimizer |
||
MxNet |
[No compliant implementation] |
||
Object detection (heavy weight) |
SGD with Momentum |
PyTorch |
apex.optimizers.FusedSGD |
PyTorch |
torch.optim.SGD |
||
TensorFlow |
tf.train.MomentumOptimizer |
||
Object detection (light weight) |
ADAM |
PyTorch |
torch.optim.Adam |
TensorFlow |
tf.keras.optimizers.Adam |
||
NLP |
LAMB |
PyTorch |
apex.optimizers.FusedLAMB |
TensorFlow |
tf.optimizers.LAMB |
||
Large Language Model |
Adam |
PyTorch |
apex.optimizers.FusedAdam |
PaxML |
praxis.optimizers.Adam |
||
Speech recognition |
LAMB |
PyTorch |
apex.optimizers.FusedLAMB |
TensorFlow |
tf.optimizers.LAMB |
||
Recommendation |
Adagrad |
PyTorch |
torch.optim.Adagrad (dense layers) + torchrec.optim.Adagrad (embeddings) |
Image segmentation (medical) |
SGD with Momentum |
PyTorch |
torch.optim.SGD |
TensorFlow |
tf.train.MomentumOptimizer |
||
MXNet |
mx.optimizer.NAG |
This section contains rules specific to the v1.0 round of MLPerf Training. These do not apply to future rounds, unless explicitly ratified as rules for those rounds, or unless these rules are promoted to official rules in previous sections of this document.
For v1.0 only, Mask-RCNN submitters may use the non-reference backbone located here with the understanding that it converges similarly to the reference backbone. If the non-reference backbone is shown to converge faster than the reference backbone at any scale on any submitted hyperparameter set, all uses of that backbone for any submitter are to be re-run with the reference backbone to have their submission published. For future rounds, the expectation is that all submitters will use the reference backbone, which will fixed at reference code freeze time.
For v1.0 only, BERT submissions may implement clip-norm either before or after inter-accelerator all-reduce. For future rounds, the expectation is that submissions must use clip-norm-after-reduce, to be consistent with most commonly used public BERT model repos.
For performance consistency of at scale BERT submissions for v1.0, submitters are disallowed from using clip-norm-after-reduce to enable additional overlap of communication and math. If a submitter plans to use clip-norm-after-reduce for v1.0, they must notify the committee before the submission deadline, and be prepared to show code in their submission proving that they do not do overlap as a result of clip-norm-after-reduce.
Furthermore, for simplicity, the RCPs for this round will use clip-norm-before-reduce. In theory, this could allow clip-norm-after-reduce submissions that converge faster than they should, but still not faster than clip-norm-before-reduce, but the Training Working Group feels that this is ok risk for v1.0, in interest of simplifying the RCPs for v1.0.
For v1.0 only, the allowed untimed compile time is increased from 20 minutes to 30 minutes. This is to enable new submitters to submit who were close to the 20 minute limit. The 20 minute number was chosen empirically for rounds prior to v1.0. For v1.1 and beyond, the training working group should make a data driven decision on what compile time is reasonable for real user applications.
This section contains rules specific to the v1.1 round of MLPerf Training. These do not apply to future rounds, unless explicitly ratified as rules for those rounds, or unless these rules are promoted to official rules in previous sections of this document.
For v1.0 only, BERT submissions may implement clip-norm either before or after inter-accelerator all-reduce. For future rounds, the expectation is that submissions must use clip-norm-after-reduce, to be consistent with most commonly used public BERT model repos. This exception from v1.0 was extended to v1.1 because of the tight schedule between rounds.
For v1.1, we changed the policy documentation to say that a Preview submission needs to be available at the next submission after 140 days, not 180 days like it was before. However, this does not apply to Preview submissions from v1.0, which will still follow the 180 day policy. For v1.1 Preview submissions and beyond, the 140 day rule will apply. This is not necessarily an "exception," but we are listing it here as a special case for the record.
The RCP checking process is best illustrated with the following examples:
Benchmark A requires 5 submission runs. The reference implementation provides (at least) 10 convergence points, let’s say [16, 14, 16, 17, 16, 16, 15, 16, 15, 16] for batch size 128. The top and bottom run are excluded from the mean and standard deviation computation. So in this case the Mean = 15.75 epochs and Stdev = 0.43. Based on the t-test the maximum allowed speedup for p-value=0.05 is 3.53%. In other words the minimum mean epochs to converge for each submission with batch size 128 is 15.21.
The reference also provides convergence points for batch size 256: [20, 21, 21, 20, 22, 22, 21, 21, 20, 20]. In this case Mean = 20.75, Stdev = 0.66 and based on the t-test the maximum allowed speedup for p-value=0.05 is 4.12%. In other words the minimum mean epochs to converge for batch-256 is 19.93.
Let’s consider now the following submission scenarios:
-
Submitter S1 makes a submission for A with batch size 128, and from the logs the epochs to converge are [15, 15, 15, 16, 16]. Excluding the top and bottom runs the mean epochs to converge is 15.33 (> 15.21), so S1 passes the RCP test for benchmark A, batch size 128.
-
Submitter S2 makes a submission for A with batch size 256, and from the logs the epochs to converge are [19, 19, 19, 20, 21]. Excluding the top and bottom runs the mean epochs to converge is 19.33 (< 19.93), so S2 fails the RCP test for benchmark A, batch size 256.
-
Submitter S3 makes a submission for A with batch size 192, and from the logs the epochs to converge are [17, 18, 18, 18, 20]. There are no RCPs for 192, but there are for larger and lower batch sizes. In this situation we find an interpolation of the mean and standard deviations for the RCPs at batch size 192. Mean = 18.25 and Stdev=0.547. Based on the t-test with p-value=0.05 the maximum allowed speedup is 3.68%. Exclusing the top and botton submission runs, the submission mean epochs to converge is 18, which is more than 18.25 / 1.0368, so the submission is accepted for batch size 192.
-
Submitter S4 makes a submission for A with batch size 512. Since there is neither RCP for that batch size, nor RCPs for larger batch sizes, S2 needs to provide convergence points by running the reference with that batch size.
-
Submitter S5 makes a submission for A with batch size 64 that meets the (stricter) convergence criteria for the RCP with the smallest batch size (128). In this case the submission is accepted.
-
Submitter S6 makes a submission for A with batch size 64 that does not meet the convergence criteria for the RCP with the smallest batch size (128). In this case S1 needs to provide convergence points by running the reference with batch size = 64.
MLPerf recommends calculating utilization as model_tensor_flops / (peak_system_tensor_flops_per_second * runtime_seconds)
where:
-
model_tensor_flops
means only the tensor (ie matrix multiply or convolution) operations that are required by the model definition. Vector or pointwise ops in the model such as bias add, normalization etc, are not counted asmodel_tensor_flops
. Furthermore, implementations that use activation recomputation methods should not count any of the operations added by activation recomputation asmodel_tensor_flops
. -
peak_system_tensor_flops_per_second
means the peak tensor operations of the hardware, counting only tensor math throughput and not additional vector or pointwise math datapaths. -
runtime_seconds
means the mean of the runtimes of the runs used to calculate the benchmark result.
Use of hardware_tensor_flops
(defined as model_tensor_flops plus operations added due to activation recomputation), instead of model_tensor_flops
is strongly discouraged because those are not useful flops for the model. If hardware_tensor_flops
are used for calculating utilization, it is recommended to also provide an accompanying calculation with model_tensor_flops
.
Note utilization is not an official MLPerf metric.