Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the min query constraint for the Offline scenario #286

Merged
merged 3 commits into from
Oct 29, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion inference_rules.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -139,10 +139,12 @@ described in the table below.
|Scenario |Query Generation |Duration |Samples/query |Latency Constraint |Tail Latency | Performance Metric
|Single stream |LoadGen sends next query as soon as SUT completes the previous query | 600 seconds |1 |None |90%* | 90%-ile early-stopping latency estimate
|Server |LoadGen sends new queries to the SUT according to a Poisson distribution | 600 seconds |1 |Benchmark specific |99%* | Maximum Poisson throughput parameter supported
|Offline |LoadGen sends all samples to the SUT at start in a single query | 1 query and 600 seconds | At least 24,576 |None |N/A | Measured throughput
|Offline |LoadGen sends all samples to the SUT at start in a single query | 1 query and 600 seconds | At least 24,576** |None |N/A | Measured throughput
|Multistream | Loadgen sends next query, as soon as SUT completes the previous query | 600 seconds | 8 | None | 99%* | 99%-ile early-stopping latency estimate|
|===

** - If the dataset used for the accuracy run of the benchmark task is of size less than 24,576 say `N`, then the Offline scenario query only needs to have at least `N` samples.

An early stopping criterion (described in more detail in <<appendix-early_stopping>>) allows for runs with a relatively small number of processed queries to be valid, with the penalty that the effective computed percentile will be slightly higher. This penalty counteracts the increased variance inherent to runs with few queries, where there is a higher probability that a particular run will, by chance, report a lower latency than the system should reliably support.

In the above table, tail latency percentiles with an asterisk represent the theoretical lower limit of measured percentile for runs processing a very large number of queries. Submitters may opt to run for longer than the time listed in the "Duration" column, in order to decrease the effect of the early stopping penalty. See the following table for a suggested starting point for how to set the minimum number of queries.
Expand Down
Loading