The following is a glossary of domain specific terminology. Although benchmarks are a seemingly simple domain, they have a surprising amount of complexity. It is therefore useful to ensure that the vocabulary used to describe the domain is consistent and precise to avoid confusion.
- metric: a name of a quantifiable metric being measured (e.g., instruction count).
- artifact: a specific rustc binary labeled by some identifier tag (usually a commit sha or some sort of human readable id like "1.51.0" or "test").
- benchmark suite: an entire collection of benchmarks, either compile-time or runtime.
- benchmark: the source of a crate which will be used to benchmark rustc. For example, "hello world".
- profile: a compilation configuration.
check
corresponds to runningcargo check
.debug
corresponds to runningcargo build
.opt
corresponds to runningcargo build --release
.doc
corresponds to running rustdoc.
- scenario: describes the incremental cache state and an optional change in the source since last compilation.
full
: incremental compilation is not used.incr-full
: incremental compilation is used, with an empty incremental cache.incr-unchanged
: incremental compilation is used, with a full incremental cache and no code changes made.incr-patched
: incremental compilation is used, with a full incremental cache and some code changes made.
- backend: the codegen backend used for compiling Rust code.
llvm
: the default codegen backend
- category: a high-level group of benchmarks. Currently, there are three categories, primary (mostly real-world crates), secondary (mostly stress tests), and stable (old real-world crates, only used for the dashboard).
- artifact type: describes what kind of artifact does the benchmark build. Either
library
orbinary
.
- stress test benchmark: a benchmark that is specifically designed to stress a certain part of the compiler. For example, projection-caching stresses the compiler's projection caching mechanisms. Corresponds to the
secondary
category. - real world benchmark: a benchmark based on a real world crate. These are typically copied as-is from crates.io. For example, serde is a popular crate and the benchmark has not been altered from a release of serde on crates.io. Corresponds to the
primary
orstable
categories.
- benchmark: a function compiled by rustc, which function will be benchmarked.
- benchmark group: a crate that contains a set of runtime benchmarks.
- test case: a combination of parameters that describe the measurement of a single (compile-time or runtime) benchmark - a single
test
- For compile-time benchmarks, it is a combination of a benchmark, a profile, and a scenario.
- For runtime benchmarks, it is currently only the benchmark name.
- test: the act of running an artifact under a test case. Each test is composed of many iterations.
- test iteration: a single iteration that makes up a test. Note: we currently normally run 3 test iterations for each test.
- test result: the result of the collection of all statistics from running a test. Currently, the minimum value of a statistic from all the test iterations is used for analysis calculations and the website.
- statistic: a single measured value of a metric in a test result
- statistic description: the combination of a metric and a test case which describes a statistic.
- statistic series: statistics for the same statistic description over time.
- run: a set of tests for all currently available test cases measured on a given artifact.
- artifact comparisons: the comparison of two artifacts. This is composed of many test result comparisons. The comparison page shows a single artifact comparison between two artifacts.
- test result comparison: the delta between two test results for the same test case but different artifacts. The comparison page lists all the test result comparisons as percentages between two runs.
- significance threshold: the threshold at which a test result comparison is considered "significant" (i.e., a real change in performance and not just noise). You can see how this is calculated here.
- significant test result comparison: a test result comparison above the significance threshold. Significant test result comparisons can be thought of as being "statistically significant".
- relevant test result comparison: a test result comparison can be significant but still not be relevant (i.e., worth paying attention to). Relevance is a factor of the test result comparison's significance and magnitude. Comparisons are considered relevant if they are significant and have at least a small magnitude .
- test result comparison magnitude: how "large" the delta is between the two test result's under comparison. This is determined by the average of two factors: the absolute size of the change (i.e., a change of 5% is larger than a change of 1%) and the amount above the significance threshold (i.e., a change that is 5x the significance threshold is larger than a change 1.5x the significance threshold).
- bootstrap: the process of building the compiler from a previous version of the compiler
- compiler query: a query used inside the compiler query system.