Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding throughput and latency modes to raft-ann-bench #1920

Merged

Conversation

cjnolet
Copy link
Member

@cjnolet cjnolet commented Oct 20, 2023

Separating the way the benhcmarks are measured into throughput and latency modes.

  • latency mode accumulates the times for each batch to be processed and then estimates QPS and provides the average time spent doing processing on the GPU. For batch size of 1, this becomes a fairly estimate of average latency per query. For larger batches, it becomes a fairly accurate estimate of time spent per batch.

  • throughput mode pipelines the individual batches using a thread pool (and stream pool for the GPU algos). For both smaller and larger batches, this gives a good estimate of the amount of data we can push through the hardware in a period of time.

A good comprehensive comparison will include both of these numbers.

@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 20, 2023
@cjnolet cjnolet self-assigned this Oct 20, 2023
@cjnolet cjnolet requested review from a team as code owners October 20, 2023 22:27
@cjnolet cjnolet requested a review from a team as a code owner October 21, 2023 01:36
@github-actions github-actions bot added the CMake label Oct 21, 2023
@cjnolet
Copy link
Member Author

cjnolet commented Oct 21, 2023

I'm finding that googlebench appears to reserve the right to iterate state as it chooses, which makes it really hard to run the iteration all the way through to the end and rely on the end-to-end time. I could be wrong about this, but that's the behavior I'm seeing so far. Need to dig in a bit further.

So far I've tried stream per thread, the stream pool, various different numbers of threads, syncing after each search, syncing only at the end. The behavior I'm seeing just doesn't match my expectations and while the GPU seems to be getting higher utilization than without the thread pool, it's still not getting high enough to match the cost of the threading. Something else is going on here.

@cjnolet
Copy link
Member Author

cjnolet commented Oct 23, 2023

So far, the results seem too good to be real, for both HNSW and RAFT. I'm still investigating to make sure we're computing the timings correctly. From what I've explored so far, it seems we are computing them properly and I do notice the GPU utilization stays near 100% during queries. Here are some initial results at batch size 100.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations        GPU     Recall items_per_second      itopk          k  n_queries total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_cagra.dim32/0/manual_time/threads:1       0.167 ms        0.188 ms         4193     182.7u    0.92665       598.517k/s         32         10        100        419.3k
raft_cagra.dim32/0/manual_time/threads:2       0.099 ms        0.206 ms         6218   401.106u    0.92665       1.00952M/s         32         20        200        621.8k
raft_cagra.dim32/0/manual_time/threads:4       0.094 ms        0.339 ms         6208   1.51917m    0.92665       1.06342M/s         32         20        200        620.8k
raft_cagra.dim32/0/manual_time/threads:8       0.085 ms        0.581 ms         8856   5.46048m    0.92665       1.18062M/s         32         20        200        885.6k
raft_cagra.dim32/0/manual_time/threads:16      0.085 ms         1.39 ms        10192  0.0242496    0.92665       1.17793M/s         32         20        200       1.0192M
raft_cagra.dim32/0/manual_time/threads:32      0.088 ms         3.16 ms         9248  0.0978729    0.92665       1.13207M/s         32         20        200        924.8k
raft_cagra.dim32/1/manual_time/threads:1       0.216 ms        0.295 ms         3253   288.051u    0.97804        462.07k/s         64         10        100        325.3k
raft_cagra.dim32/1/manual_time/threads:2       0.200 ms        0.406 ms         2904   798.239u    0.97806       500.294k/s         64         20        200        290.4k
raft_cagra.dim32/1/manual_time/threads:4       0.182 ms        0.651 ms         4276   2.91943m    0.97814       550.059k/s         64         20        200        427.6k
raft_cagra.dim32/1/manual_time/threads:8       0.179 ms         1.22 ms         5104  0.0114741    0.97792       560.146k/s         64         20        200        510.4k
raft_cagra.dim32/1/manual_time/threads:16      0.137 ms         2.44 ms         5792  0.0391384    0.97802       731.858k/s         64         20        200        579.2k
raft_cagra.dim32/1/manual_time/threads:32      0.119 ms         4.94 ms         6656   0.129464      0.978       840.364k/s         64         20        200        665.6k
raft_cagra.dim32/2/manual_time/threads:1       0.333 ms        0.341 ms         2104   335.359u    0.99482       300.496k/s        128         10        100        210.4k
raft_cagra.dim32/2/manual_time/threads:2       0.306 ms        0.614 ms         2274   1.23051m    0.99467       326.557k/s        128         20        200        227.4k
raft_cagra.dim32/2/manual_time/threads:4       0.285 ms         1.01 ms         2476   4.56206m    0.99473       351.309k/s        128         20        200        247.6k
raft_cagra.dim32/2/manual_time/threads:8       0.258 ms         1.76 ms         2824  0.0165841    0.99479       386.917k/s        128         20        200        282.4k
raft_cagra.dim32/2/manual_time/threads:16      0.218 ms         3.97 ms         3296  0.0641081    0.99469       458.497k/s        128         20        200        329.6k
raft_cagra.dim32/2/manual_time/threads:32      0.215 ms         9.64 ms         3840   0.248017    0.99469       465.242k/s        128         20        200          384k
raft_cagra.dim64/0/manual_time/threads:1       0.234 ms        0.241 ms         2988   235.637u    0.97793        427.89k/s         32         10        100        298.8k
raft_cagra.dim64/0/manual_time/threads:2       0.194 ms        0.407 ms         3746   815.833u    0.97793       515.125k/s         32         20        200        374.6k
raft_cagra.dim64/0/manual_time/threads:4       0.177 ms        0.645 ms         4176   2.85429m    0.97793       563.628k/s         32         20        200        417.6k
raft_cagra.dim64/0/manual_time/threads:8       0.164 ms         1.12 ms         4832   0.010387    0.97793       608.476k/s         32         20        200        483.2k
raft_cagra.dim64/0/manual_time/threads:16      0.159 ms         2.78 ms         5088  0.0458893    0.97793       629.203k/s         32         20        200        508.8k
raft_cagra.dim64/0/manual_time/threads:32      0.118 ms         5.01 ms         6560   0.129813    0.97793       850.063k/s         32         20        200          656k
raft_cagra.dim64/1/manual_time/threads:1       0.331 ms        0.396 ms         2113   390.565u    0.99491       301.993k/s         64         10        100        211.3k
raft_cagra.dim64/1/manual_time/threads:2       0.317 ms        0.681 ms         2374   1.36584m    0.99494        315.45k/s         64         20        200        237.4k
raft_cagra.dim64/1/manual_time/threads:4       0.329 ms         1.16 ms         2500   5.27079m    0.99504       304.066k/s         64         20        200          250k
raft_cagra.dim64/1/manual_time/threads:8       0.246 ms         1.69 ms         2928  0.0158084    0.99489       406.391k/s         64         20        200        292.8k
raft_cagra.dim64/1/manual_time/threads:16      0.210 ms         3.81 ms         2848   0.062437    0.99498       475.464k/s         64         20        200        284.8k
raft_cagra.dim64/1/manual_time/threads:32      0.197 ms         8.79 ms         3904   0.226191    0.99492       508.751k/s         64         20        200        390.4k
raft_cagra.dim64/2/manual_time/threads:1       0.570 ms        0.579 ms         1228   573.078u    0.99861        175.33k/s        128         10        100        122.8k
raft_cagra.dim64/2/manual_time/threads:2       0.633 ms         1.27 ms         1282   2.53863m    0.99874       157.985k/s        128         20        200        128.2k
raft_cagra.dim64/2/manual_time/threads:4       0.641 ms         2.26 ms          868  0.0102623    0.99868        156.12k/s        128         20        200         86.8k
raft_cagra.dim64/2/manual_time/threads:8       0.653 ms         4.42 ms          800  0.0418484     0.9987       153.093k/s        128         20        200           80k
raft_cagra.dim64/2/manual_time/threads:16      0.444 ms         8.15 ms         1696   0.126361    0.99863       225.432k/s        128         20        200        169.6k
raft_cagra.dim64/2/manual_time/threads:32      0.369 ms         17.1 ms         2112   0.430651   0.998788       270.922k/s        128         20        200        211.2k



---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations        GPU     Recall         ef items_per_second          k  n_queries total_queries
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
hnswlib.M12/0/manual_time/threads:1        4.98 ms         5.05 ms          136    5.0398m    0.65857         10       20.0986k/s         10        100         13.6k
hnswlib.M12/0/manual_time/threads:2        2.59 ms         5.91 ms          258   0.011853    0.65857         10       38.6522k/s         20        200         25.8k
hnswlib.M12/0/manual_time/threads:4        1.33 ms         6.03 ms          500   0.024254    0.65857         10       74.9113k/s         20        200           50k
hnswlib.M12/0/manual_time/threads:8       0.727 ms         5.79 ms          944  0.0470166    0.65857         10        137.57k/s         20        200         94.4k
hnswlib.M12/0/manual_time/threads:16      0.417 ms         7.53 ms         1664   0.122489    0.65857         10       239.781k/s         20        200        166.4k
hnswlib.M12/0/manual_time/threads:32      0.301 ms         9.31 ms         2304   0.309402   0.653458         10       331.823k/s         20        200        230.4k
hnswlib.M12/1/manual_time/threads:1        7.77 ms         8.79 ms           88   8.78484m   0.795375         20       12.8625k/s         10        100          8.8k
hnswlib.M12/1/manual_time/threads:2        3.91 ms         8.10 ms          174  0.0161867   0.794943         20       25.5521k/s         20        200         17.4k
hnswlib.M12/1/manual_time/threads:4        2.10 ms         8.53 ms          336  0.0343662   0.794929         20       47.5493k/s         20        200         33.6k
hnswlib.M12/1/manual_time/threads:8        1.09 ms         11.1 ms          640  0.0894582   0.793362         20       92.1564k/s         20        200           64k
hnswlib.M12/1/manual_time/threads:16      0.595 ms         10.1 ms         1168   0.163849   0.791493         20       167.987k/s         20        200        116.8k
hnswlib.M12/1/manual_time/threads:32      0.439 ms         15.6 ms         1600   0.514879    0.78264         20       227.656k/s         20        200          160k
hnswlib.M12/2/manual_time/threads:1        12.6 ms         12.6 ms           55  0.0125655   0.893291         40       7.96743k/s         10        100          5.5k
hnswlib.M12/2/manual_time/threads:2        6.52 ms         14.1 ms          108  0.0282381   0.893556         40       15.3282k/s         20        200         10.8k
hnswlib.M12/2/manual_time/threads:4        3.41 ms         13.6 ms          200  0.0546417    0.89048         40       29.3324k/s         20        200           20k
hnswlib.M12/2/manual_time/threads:8        1.78 ms         15.1 ms          400   0.121527    0.89048         40       56.3352k/s         20        200           40k
hnswlib.M12/2/manual_time/threads:16      0.945 ms         17.4 ms          736   0.280031   0.891065         40       105.866k/s         20        200         73.6k
hnswlib.M12/2/manual_time/threads:32      0.688 ms         21.3 ms         1024   0.705419   0.884563         40       145.437k/s         20        200        102.4k
hnswlib.M12/3/manual_time/threads:1        17.8 ms         21.6 ms           39  0.0215467    0.93459         60       5.62176k/s         10        100          3.9k
hnswlib.M12/3/manual_time/threads:2        8.84 ms         17.7 ms           80  0.0354302    0.93485         60       11.3077k/s         20        200            8k
hnswlib.M12/3/manual_time/threads:4        4.65 ms         19.0 ms          148  0.0761206   0.933622         60       21.4946k/s         20        200         14.8k
hnswlib.M12/3/manual_time/threads:8        2.43 ms         21.2 ms          288   0.170511   0.932778         60       41.1567k/s         20        200         28.8k
hnswlib.M12/3/manual_time/threads:16       1.28 ms         20.4 ms          544   0.327821   0.932412         60       78.1867k/s         20        200         54.4k
hnswlib.M12/3/manual_time/threads:32      0.938 ms         28.9 ms          768   0.962142   0.930042         60       106.558k/s         20        200         76.8k
hnswlib.M12/4/manual_time/threads:1        22.9 ms         22.9 ms           31  0.0228834   0.954742         80       4.37475k/s         10        100          3.1k
hnswlib.M12/4/manual_time/threads:2        11.3 ms         22.6 ms           64  0.0452897     0.9545         80       8.84423k/s         20        200          6.4k
hnswlib.M12/4/manual_time/threads:4        5.90 ms         26.4 ms          120   0.106017   0.954567         80       16.9384k/s         20        200           12k
hnswlib.M12/4/manual_time/threads:8        3.06 ms         26.0 ms          232    0.20948   0.954241         80       32.7187k/s         20        200         23.2k
hnswlib.M12/4/manual_time/threads:16       1.62 ms         25.8 ms          432   0.414281   0.953741         80       61.8592k/s         20        200         43.2k
hnswlib.M12/4/manual_time/threads:32       1.19 ms         37.1 ms          608    1.21724   0.952211         80       84.1989k/s         20        200         60.8k
hnswlib.M12/5/manual_time/threads:1        31.2 ms         31.2 ms           23  0.0312053   0.975261        120       3.20625k/s         10        100          2.3k
hnswlib.M12/5/manual_time/threads:2        15.7 ms         31.4 ms           44  0.0627129   0.975136        120       6.38563k/s         20        200          4.4k
hnswlib.M12/5/manual_time/threads:4        8.28 ms         38.2 ms           88   0.153079   0.975136        120       12.0736k/s         20        200          8.8k
hnswlib.M12/5/manual_time/threads:8        4.27 ms         36.3 ms          168   0.291057    0.97481        120       23.4284k/s         20        200         16.8k
hnswlib.M12/5/manual_time/threads:16       2.26 ms         40.7 ms          320   0.653214    0.97485        120       44.1926k/s         20        200           32k
hnswlib.M12/5/manual_time/threads:32       1.65 ms         56.5 ms          448    1.86197   0.976429        120       60.4288k/s         20        200         44.8k
hnswlib.M12/6/manual_time/threads:1        48.1 ms         48.2 ms           15  0.0481652   0.991267        200       2.07765k/s         10        100          1.5k
hnswlib.M12/6/manual_time/threads:2        25.0 ms         50.1 ms           28   0.100225   0.990929        200       3.99462k/s         20        200          2.8k
hnswlib.M12/6/manual_time/threads:4        12.6 ms         50.4 ms           40   0.202055     0.9903        200       7.92324k/s         20        200            4k
hnswlib.M12/6/manual_time/threads:8        6.61 ms         52.8 ms          112    0.42358   0.990929        200       15.1182k/s         20        200         11.2k
hnswlib.M12/6/manual_time/threads:16       3.50 ms         60.4 ms          208   0.969122   0.990538        200       28.6005k/s         20        200         20.8k
hnswlib.M12/6/manual_time/threads:32       2.52 ms         78.8 ms          288    2.58284   0.990889        200       39.6704k/s         20        200         28.8k
hnswlib.M12/7/manual_time/threads:1        89.1 ms         98.5 ms            8  0.0984822   0.997375        400       1.12248k/s         10        100           800
hnswlib.M12/7/manual_time/threads:2        45.1 ms         90.2 ms           16   0.180337   0.997375        400       2.21914k/s         20        200          1.6k
hnswlib.M12/7/manual_time/threads:4        23.6 ms         94.3 ms           32   0.377762   0.997375        400       4.23773k/s         20        200          3.2k
hnswlib.M12/7/manual_time/threads:8        12.0 ms         96.2 ms           56   0.770891   0.997429        400        8.3061k/s         20        200          5.6k
hnswlib.M12/7/manual_time/threads:16       6.40 ms          102 ms          112    1.64027   0.997429        400       15.6133k/s         20        200         11.2k
hnswlib.M12/7/manual_time/threads:32       4.81 ms          151 ms          160    4.92579     0.9976        400       20.7971k/s         20        200           16k
hnswlib.M12/8/manual_time/threads:1         131 ms          149 ms            6   0.148537      0.998        600         765.97/s         10        100           600
hnswlib.M12/8/manual_time/threads:2        64.6 ms          140 ms           12   0.280549      0.998        600       1.54799k/s         20        200          1.2k
hnswlib.M12/8/manual_time/threads:4        34.2 ms          137 ms           24   0.547471      0.998        600        2.9239k/s         20        200          2.4k
hnswlib.M12/8/manual_time/threads:8        17.7 ms          141 ms           40    1.13302     0.9982        600       5.65117k/s         20        200            4k
hnswlib.M12/8/manual_time/threads:16       9.35 ms          168 ms           80    2.69565     0.9982        600       10.6998k/s         20        200            8k
hnswlib.M12/8/manual_time/threads:32       6.68 ms          209 ms           96    6.84756      0.998        600       14.9612k/s         20        200          9.6k
hnswlib.M12/9/manual_time/threads:1         170 ms          187 ms            4   0.186576     0.9995        800        588.695/s         10        100           400
hnswlib.M12/9/manual_time/threads:2        86.2 ms          173 ms            8   0.345035     0.9995        800       1.15977k/s         20        200           800
hnswlib.M12/9/manual_time/threads:4        44.4 ms          178 ms           16   0.710705     0.9995        800       2.25218k/s         20        200          1.6k
hnswlib.M12/9/manual_time/threads:8        22.9 ms          191 ms           32    1.52913     0.9995        800       4.37226k/s         20        200          3.2k
hnswlib.M12/9/manual_time/threads:16       12.1 ms          193 ms           64    3.08743     0.9995        800       8.29498k/s         20        200          6.4k
hnswlib.M12/9/manual_time/threads:32       8.60 ms          270 ms           96    8.81168   0.999333        800       11.6253k/s         20        200          9.6k
hnswlib.M16/0/manual_time/threads:1        5.97 ms         6.11 ms          118   6.10208m    0.71387         10       16.7618k/s         10        100         11.8k
hnswlib.M16/0/manual_time/threads:2        3.03 ms         6.07 ms          222  0.0121589    0.71387         10       32.9574k/s         20        200         22.2k
hnswlib.M16/0/manual_time/threads:4        1.63 ms         6.72 ms          436  0.0271239    0.71387         10       61.4848k/s         20        200         43.6k
hnswlib.M16/0/manual_time/threads:8       0.867 ms         7.36 ms          800  0.0597051    0.71387         10       115.283k/s         20        200           80k
hnswlib.M16/0/manual_time/threads:16      0.473 ms         8.52 ms         1488   0.138509   0.712946         10       211.348k/s         20        200        148.8k
hnswlib.M16/0/manual_time/threads:32      0.343 ms         10.6 ms         2016   0.352289   0.707111         10       291.423k/s         20        200        201.6k
hnswlib.M16/1/manual_time/threads:1        9.38 ms         10.6 ms           74   0.010586   0.835473         20       10.6575k/s         10        100          7.4k
hnswlib.M16/1/manual_time/threads:2        4.69 ms         9.64 ms          150  0.0192586   0.836133         20       21.3261k/s         20        200           15k
hnswlib.M16/1/manual_time/threads:4        2.49 ms         9.90 ms          284   0.039904   0.838254         20       40.1732k/s         20        200         28.4k
hnswlib.M16/1/manual_time/threads:8        1.29 ms         10.3 ms          552  0.0828572   0.839058         20       77.5009k/s         20        200         55.2k
hnswlib.M16/1/manual_time/threads:16      0.699 ms         12.6 ms         1008   0.203783   0.834587         20       143.076k/s         20        200        100.8k
hnswlib.M16/1/manual_time/threads:32      0.514 ms         22.4 ms         1376   0.751689       0.83         20       194.418k/s         20        200        137.6k
hnswlib.M16/2/manual_time/threads:1        16.2 ms         20.9 ms           43  0.0208731   0.923698         40       6.17857k/s         10        100          4.3k
hnswlib.M16/2/manual_time/threads:2        7.99 ms         20.7 ms           86  0.0414618   0.923698         40       12.5119k/s         20        200          8.6k
hnswlib.M16/2/manual_time/threads:4        4.15 ms         16.5 ms          172  0.0664555   0.923698         40       24.1102k/s         20        200         17.2k
hnswlib.M16/2/manual_time/threads:8        2.11 ms         16.8 ms          328   0.135516   0.923634         40       47.2953k/s         20        200         32.8k
hnswlib.M16/2/manual_time/threads:16       1.12 ms         20.1 ms          624   0.323418   0.922308         40       88.9668k/s         20        200         62.4k
hnswlib.M16/2/manual_time/threads:32      0.845 ms         26.3 ms          864   0.866235   0.917259         40       118.348k/s         20        200         86.4k
hnswlib.M16/3/manual_time/threads:1        20.9 ms         20.9 ms           34  0.0208671   0.955735         60       4.79614k/s         10        100          3.4k
hnswlib.M16/3/manual_time/threads:2        10.6 ms         24.3 ms           66  0.0486461   0.955636         60       9.42025k/s         20        200          6.6k
hnswlib.M16/3/manual_time/threads:4        5.66 ms         22.6 ms          128  0.0907344   0.955156         60       17.6572k/s         20        200         12.8k
hnswlib.M16/3/manual_time/threads:8        2.94 ms         23.4 ms          248   0.188468   0.955194         60       34.0073k/s         20        200         24.8k
hnswlib.M16/3/manual_time/threads:16       1.54 ms         24.5 ms          464   0.393798   0.954655         60       65.1042k/s         20        200         46.4k
hnswlib.M16/3/manual_time/threads:32       1.15 ms         35.8 ms          640    1.17539     0.9526         60       87.2199k/s         20        200           64k
hnswlib.M16/4/manual_time/threads:1        26.8 ms         26.8 ms           26  0.0267923   0.972077         80        3.7356k/s         10        100          2.6k
hnswlib.M16/4/manual_time/threads:2        13.8 ms         30.3 ms           52  0.0606796   0.972077         80       7.25318k/s         20        200          5.2k
hnswlib.M16/4/manual_time/threads:4        7.13 ms         28.5 ms          100    0.11419    0.97164         80        14.026k/s         20        200           10k
hnswlib.M16/4/manual_time/threads:8        3.68 ms         31.8 ms          192   0.255124      0.972         80       27.1454k/s         20        200         19.2k
hnswlib.M16/4/manual_time/threads:16       1.94 ms         31.0 ms          368    0.49777    0.97187         80       51.4939k/s         20        200         36.8k
hnswlib.M16/4/manual_time/threads:32       1.45 ms         45.2 ms          512    1.48193   0.974625         80        69.178k/s         20        200         51.2k
hnswlib.M16/5/manual_time/threads:1        38.6 ms         40.8 ms           18  0.0408008   0.986222        120       2.59315k/s         10        100          1.8k
hnswlib.M16/5/manual_time/threads:2        19.0 ms         38.0 ms           36  0.0760592   0.986222        120       5.26508k/s         20        200          3.6k
hnswlib.M16/5/manual_time/threads:4        10.0 ms         40.1 ms           72   0.160572   0.986222        120       9.97427k/s         20        200          7.2k
hnswlib.M16/5/manual_time/threads:8        5.23 ms         41.8 ms          136   0.335277   0.987235        120       19.1034k/s         20        200         13.6k
hnswlib.M16/5/manual_time/threads:16       2.71 ms         49.0 ms          256   0.786114   0.988437        120       36.9446k/s         20        200         25.6k
hnswlib.M16/5/manual_time/threads:32       2.01 ms         62.9 ms          384    2.06379   0.988333        120       49.6538k/s         20        200         38.4k
hnswlib.M16/6/manual_time/threads:1        59.9 ms         60.0 ms           12  0.0599787   0.995417        200       1.66855k/s         10        100          1.2k
hnswlib.M16/6/manual_time/threads:2        29.1 ms         58.4 ms           24   0.116704   0.995417        200       3.43062k/s         20        200          2.4k
hnswlib.M16/6/manual_time/threads:4        15.8 ms         63.1 ms           48   0.253098   0.995417        200       6.32509k/s         20        200          4.8k
hnswlib.M16/6/manual_time/threads:8        8.00 ms         63.9 ms           96   0.512451   0.995417        200       12.4965k/s         20        200          9.6k
hnswlib.M16/6/manual_time/threads:16       4.18 ms         66.8 ms          176    1.07186   0.995455        200       23.8985k/s         20        200         17.6k
hnswlib.M16/6/manual_time/threads:32       3.10 ms         96.3 ms          224    3.17522   0.995857        200       32.2705k/s         20        200         22.4k
hnswlib.M16/7/manual_time/threads:1         108 ms          108 ms            7   0.108325   0.998286        400        923.704/s         10        100           700
hnswlib.M16/7/manual_time/threads:2        53.6 ms          121 ms           14   0.241055   0.998286        400       1.86432k/s         20        200          1.4k
hnswlib.M16/7/manual_time/threads:4        28.7 ms          115 ms           24   0.459723   0.998167        400       3.48243k/s         20        200          2.4k
hnswlib.M16/7/manual_time/threads:8        14.8 ms          118 ms           48    0.94718   0.998167        400        6.7607k/s         20        200          4.8k
hnswlib.M16/7/manual_time/threads:16       7.78 ms          139 ms           96    2.23137   0.998167        400       12.8607k/s         20        200          9.6k
hnswlib.M16/7/manual_time/threads:32       5.88 ms          185 ms          128    6.03873    0.99825        400        17.008k/s         20        200         12.8k
hnswlib.M16/8/manual_time/threads:1         154 ms          154 ms            5   0.154021     0.9992        600        649.631/s         10        100           500
hnswlib.M16/8/manual_time/threads:2        78.2 ms          157 ms           10   0.313056     0.9992        600       1.27837k/s         20        200            1k
hnswlib.M16/8/manual_time/threads:4        40.4 ms          161 ms           20   0.647298     0.9992        600       2.47273k/s         20        200            2k
hnswlib.M16/8/manual_time/threads:8        21.1 ms          168 ms           32    1.34948      0.999        600       4.74515k/s         20        200          3.2k
hnswlib.M16/8/manual_time/threads:16       11.2 ms          178 ms           64    2.85717      0.999        600       8.96466k/s         20        200          6.4k
hnswlib.M16/8/manual_time/threads:32       7.99 ms          285 ms           96    9.29697   0.998667        600       12.5125k/s         20        200          9.6k
hnswlib.M16/9/manual_time/threads:1         197 ms          197 ms            4   0.196989      0.999        800        507.813/s         10        100           400
hnswlib.M16/9/manual_time/threads:2        98.9 ms          198 ms            8   0.395882      0.999        800       1.01098k/s         20        200           800
hnswlib.M16/9/manual_time/threads:4        52.7 ms          211 ms           16   0.843438      0.999        800       1.89766k/s         20        200          1.6k
hnswlib.M16/9/manual_time/threads:8        26.4 ms          211 ms           24    1.68874   0.998667        800       3.79114k/s         20        200          2.4k
hnswlib.M16/9/manual_time/threads:16       14.0 ms          224 ms           48    3.59585   0.998667        800       7.12169k/s         20        200          4.8k
hnswlib.M16/9/manual_time/threads:32       9.94 ms          312 ms           64     10.181      0.998        800       10.0612k/s         20        200          6.4k
hnswlib.M24/0/manual_time/threads:1        7.22 ms         7.23 ms           95   7.22629m   0.762663         10       13.8555k/s         10        100          9.5k
hnswlib.M24/0/manual_time/threads:2        3.77 ms         8.48 ms          180  0.0170233   0.764833         10       26.5571k/s         20        200           18k
hnswlib.M24/0/manual_time/threads:4        2.00 ms         9.93 ms          344  0.0399798   0.763453         10       49.9122k/s         20        200         34.4k
hnswlib.M24/0/manual_time/threads:8        1.06 ms         9.70 ms          656  0.0784288   0.763085         10       94.5169k/s         20        200         65.6k
hnswlib.M24/0/manual_time/threads:16      0.578 ms         11.5 ms         1200   0.186486   0.760333         10       172.983k/s         20        200          120k
hnswlib.M24/0/manual_time/threads:32      0.424 ms         13.1 ms         1664   0.435502   0.754192         10       235.693k/s         20        200        166.4k
hnswlib.M24/1/manual_time/threads:1        11.9 ms         11.9 ms           58   0.011922   0.876121         20       8.39692k/s         10        100          5.8k
hnswlib.M24/1/manual_time/threads:2        6.03 ms         15.0 ms          116  0.0301655   0.876121         20       16.5965k/s         20        200         11.6k
hnswlib.M24/1/manual_time/threads:4        3.13 ms         14.1 ms          228   0.056683   0.876035         20       31.9425k/s         20        200         22.8k
hnswlib.M24/1/manual_time/threads:8        1.62 ms         14.0 ms          432   0.112565   0.876593         20       61.7565k/s         20        200         43.2k
hnswlib.M24/1/manual_time/threads:16      0.867 ms         14.0 ms          816   0.226048   0.874059         20       115.342k/s         20        200         81.6k
hnswlib.M24/1/manual_time/threads:32      0.649 ms         20.3 ms         1088   0.672552   0.866912         20       154.125k/s         20        200        108.8k
hnswlib.M24/2/manual_time/threads:1        20.1 ms         25.2 ms           35  0.0251623   0.947057         40       4.98152k/s         10        100          3.5k
hnswlib.M24/2/manual_time/threads:2        9.87 ms         20.4 ms           70  0.0406969   0.947057         40       10.1319k/s         20        200            7k
hnswlib.M24/2/manual_time/threads:4        5.28 ms         21.0 ms          132   0.084533   0.946576         40       18.9559k/s         20        200         13.2k
hnswlib.M24/2/manual_time/threads:8        2.73 ms         26.9 ms          256   0.216443   0.945844         40         36.67k/s         20        200         25.6k
hnswlib.M24/2/manual_time/threads:16       1.42 ms         22.6 ms          496   0.365003   0.945903         40       70.2463k/s         20        200         49.6k
hnswlib.M24/2/manual_time/threads:32       1.07 ms         40.8 ms          672    1.34072   0.944476         40       93.2149k/s         20        200         67.2k
hnswlib.M24/3/manual_time/threads:1        27.7 ms         27.7 ms           26   0.027685   0.973577         60       3.61551k/s         10        100          2.6k
hnswlib.M24/3/manual_time/threads:2        13.9 ms         27.8 ms           52  0.0555316   0.973577         60       7.21022k/s         20        200          5.2k
hnswlib.M24/3/manual_time/threads:4        7.17 ms         28.6 ms           96   0.114805   0.973208         60       13.9509k/s         20        200          9.6k
hnswlib.M24/3/manual_time/threads:8        3.78 ms         36.9 ms          192   0.295891   0.973208         60       26.4736k/s         20        200         19.2k
hnswlib.M24/3/manual_time/threads:16       1.97 ms         33.8 ms          352   0.544039   0.973455         60        50.731k/s         20        200         35.2k
hnswlib.M24/3/manual_time/threads:32       1.44 ms         53.2 ms          480    1.74798   0.975867         60       69.3669k/s         20        200           48k
hnswlib.M24/4/manual_time/threads:1        35.6 ms         35.6 ms           19  0.0355842      0.985         80       2.81245k/s         10        100          1.9k
hnswlib.M24/4/manual_time/threads:2        17.7 ms         40.0 ms           40  0.0799514    0.98535         80       5.65945k/s         20        200            4k
hnswlib.M24/4/manual_time/threads:4        9.18 ms         41.8 ms           76   0.167677      0.985         80       10.8981k/s         20        200          7.6k
hnswlib.M24/4/manual_time/threads:8        4.74 ms         37.8 ms          152   0.303604      0.985         80       21.1026k/s         20        200         15.2k
hnswlib.M24/4/manual_time/threads:16       2.49 ms         43.8 ms          288    0.70388   0.985167         80       40.1466k/s         20        200         28.8k
hnswlib.M24/4/manual_time/threads:32       1.84 ms         63.1 ms          416     2.0671   0.986077         80       54.3023k/s         20        200         41.6k
hnswlib.M24/5/manual_time/threads:1        49.4 ms         77.3 ms           15  0.0771654   0.993933        120         2.023k/s         10        100          1.5k
hnswlib.M24/5/manual_time/threads:2        25.3 ms         66.5 ms           28   0.133007   0.993857        120       3.94902k/s         20        200          2.8k
hnswlib.M24/5/manual_time/threads:4        13.1 ms         65.2 ms           56   0.261221   0.993857        120       7.60818k/s         20        200          5.6k
hnswlib.M24/5/manual_time/threads:8        6.62 ms         52.9 ms          112   0.424026   0.993857        120       15.1039k/s         20        200         11.2k
hnswlib.M24/5/manual_time/threads:16       3.48 ms         56.0 ms          208   0.899826   0.993692        120       28.7224k/s         20        200         20.8k
hnswlib.M24/5/manual_time/threads:32       2.58 ms         80.5 ms          288    2.64001      0.994        120       38.8123k/s         20        200         28.8k
hnswlib.M24/6/manual_time/threads:1        76.2 ms         76.3 ms           10  0.0762692     0.9982        200       1.31183k/s         10        100            1k
hnswlib.M24/6/manual_time/threads:2        38.3 ms         76.3 ms           20   0.153328     0.9982        200       2.61061k/s         20        200            2k
hnswlib.M24/6/manual_time/threads:4        19.3 ms         81.8 ms           36    0.32732   0.998667        200       5.17958k/s         20        200          3.6k
hnswlib.M24/6/manual_time/threads:8        10.1 ms         80.7 ms           72   0.647152   0.998667        200       9.89482k/s         20        200          7.2k
hnswlib.M24/6/manual_time/threads:16       5.35 ms         92.5 ms          144    1.48414   0.998667        200       18.6875k/s         20        200         14.4k
hnswlib.M24/6/manual_time/threads:32       4.12 ms          148 ms          192    4.93204     0.9985        200       24.2758k/s         20        200         19.2k
hnswlib.M24/7/manual_time/threads:1         144 ms          144 ms            5   0.143567     0.9994        400        696.433/s         10        100           500
hnswlib.M24/7/manual_time/threads:2        71.2 ms          142 ms           10   0.284842     0.9994        400       1.40484k/s         20        200            1k
hnswlib.M24/7/manual_time/threads:4        37.5 ms          166 ms           20   0.662684     0.9994        400       2.66835k/s         20        200            2k
hnswlib.M24/7/manual_time/threads:8        19.3 ms          185 ms           40    1.48264     0.9994        400       5.18424k/s         20        200            4k
hnswlib.M24/7/manual_time/threads:16       10.1 ms          191 ms           80    3.06474     0.9994        400       9.89745k/s         20        200            8k
hnswlib.M24/7/manual_time/threads:32       7.36 ms          234 ms           96    7.78081   0.999667        400       13.5863k/s         20        200          9.6k
hnswlib.M24/8/manual_time/threads:1         198 ms          198 ms            4    0.19808    0.99975        600        505.064/s         10        100           400
hnswlib.M24/8/manual_time/threads:2        98.6 ms          197 ms            8   0.394749    0.99975        600       1.01376k/s         20        200           800
hnswlib.M24/8/manual_time/threads:4        51.8 ms          207 ms           16   0.829305    0.99975        600       1.93002k/s         20        200          1.6k
hnswlib.M24/8/manual_time/threads:8        26.8 ms          228 ms           32    1.82482    0.99975        600       3.73682k/s         20        200          3.2k
hnswlib.M24/8/manual_time/threads:16       13.8 ms          245 ms           48    3.91569   0.999667        600       7.24733k/s         20        200          4.8k
hnswlib.M24/8/manual_time/threads:32       9.95 ms          310 ms           64    10.1965     0.9995        600       10.0493k/s         20        200          6.4k
hnswlib.M24/9/manual_time/threads:1         242 ms          242 ms            3   0.242041   0.999667        800        413.301/s         10        100           300
hnswlib.M24/9/manual_time/threads:2         122 ms          244 ms            6   0.488984   0.999667        800        818.324/s         20        200           600
hnswlib.M24/9/manual_time/threads:4        64.1 ms          261 ms           12    1.04573   0.999667        800       1.56072k/s         20        200          1.2k
hnswlib.M24/9/manual_time/threads:8        33.3 ms          266 ms           24    2.13001   0.999667        800       3.00573k/s         20        200          2.4k
hnswlib.M24/9/manual_time/threads:16       17.5 ms          280 ms           48    4.48014   0.999667        800       5.71509k/s         20        200          4.8k
hnswlib.M24/9/manual_time/threads:32       12.7 ms          398 ms           64    13.0016     0.9995        800        7.8773k/s         20        200          6.4k
hnswlib.M36/0/manual_time/threads:1        9.28 ms         11.1 ms           74  0.0111364    0.79677         10       10.7773k/s         10        100          7.4k
hnswlib.M36/0/manual_time/threads:2        4.71 ms         9.44 ms          146  0.0188777   0.797616         10       21.2328k/s         20        200         14.6k
hnswlib.M36/0/manual_time/threads:4        2.48 ms         10.9 ms          284   0.043924   0.798901         10       40.3889k/s         20        200         28.4k
hnswlib.M36/0/manual_time/threads:8        1.27 ms         10.1 ms          544  0.0817863   0.799779         10       78.4687k/s         20        200         54.4k
hnswlib.M36/0/manual_time/threads:16      0.690 ms         10.9 ms         1008   0.177156   0.794889         10       144.834k/s         20        200        100.8k
hnswlib.M36/0/manual_time/threads:32      0.505 ms         16.1 ms         1344   0.533914   0.789476         10       198.053k/s         20        200        134.4k
hnswlib.M36/1/manual_time/threads:1        14.7 ms         15.9 ms           47  0.0159139   0.902234         20       6.80197k/s         10        100          4.7k
hnswlib.M36/1/manual_time/threads:2        7.43 ms         14.9 ms           94  0.0297814   0.902234         20       13.4544k/s         20        200          9.4k
hnswlib.M36/1/manual_time/threads:4        3.83 ms         15.3 ms          184  0.0613364   0.901826         20       26.1165k/s         20        200         18.4k
hnswlib.M36/1/manual_time/threads:8        1.99 ms         15.8 ms          360   0.127421     0.9014         20       50.3182k/s         20        200           36k
hnswlib.M36/1/manual_time/threads:16       1.05 ms         18.5 ms          672   0.299218    0.90069         20       95.3738k/s         20        200         67.2k
hnswlib.M36/1/manual_time/threads:32      0.782 ms         24.3 ms          928   0.802356   0.895448         20       127.842k/s         20        200         92.8k
hnswlib.M36/2/manual_time/threads:1        24.6 ms         28.5 ms           29  0.0285256   0.963034         40       4.06855k/s         10        100          2.9k
hnswlib.M36/2/manual_time/threads:2        12.5 ms         25.2 ms           58  0.0503383   0.963034         40       7.98698k/s         20        200          5.8k
hnswlib.M36/2/manual_time/threads:4        6.57 ms         29.3 ms          108   0.117416   0.962407         40       15.2287k/s         20        200         10.8k
hnswlib.M36/2/manual_time/threads:8        3.33 ms         28.7 ms          216   0.230639   0.962407         40       30.0604k/s         20        200         21.6k
hnswlib.M36/2/manual_time/threads:16       1.74 ms         27.6 ms          400   0.445302    0.96156         40       57.5587k/s         20        200           40k
hnswlib.M36/2/manual_time/threads:32       1.30 ms         44.1 ms          544    1.45338   0.962118         40       76.6378k/s         20        200         54.4k
hnswlib.M36/3/manual_time/threads:1        34.3 ms         37.0 ms           21   0.036955   0.983333         60       2.91917k/s         10        100          2.1k
hnswlib.M36/3/manual_time/threads:2        17.2 ms         34.5 ms           42  0.0689138   0.983333         60       5.80863k/s         20        200          4.2k
hnswlib.M36/3/manual_time/threads:4        9.02 ms         36.0 ms           80   0.144422    0.98375         60       11.0899k/s         20        200            8k
hnswlib.M36/3/manual_time/threads:8        4.65 ms         41.7 ms          152   0.334705   0.983368         60       21.5208k/s         20        200         15.2k
hnswlib.M36/3/manual_time/threads:16       2.43 ms         43.0 ms          288   0.690773   0.983667         60       41.2197k/s         20        200         28.8k
hnswlib.M36/3/manual_time/threads:32       1.79 ms         63.5 ms          416    2.08356   0.984769         60       55.8247k/s         20        200         41.6k
hnswlib.M36/4/manual_time/threads:1        41.9 ms         48.9 ms           17  0.0488259   0.991941         80       2.38709k/s         10        100          1.7k
hnswlib.M36/4/manual_time/threads:2        21.4 ms         42.8 ms           32  0.0855323   0.992375         80       4.68081k/s         20        200          3.2k
hnswlib.M36/4/manual_time/threads:4        11.2 ms         57.3 ms           64   0.229526   0.992375         80       8.94934k/s         20        200          6.4k
hnswlib.M36/4/manual_time/threads:8        5.83 ms         46.5 ms          120    0.37312   0.992267         80       17.1657k/s         20        200           12k
hnswlib.M36/4/manual_time/threads:16       3.06 ms         48.7 ms          224   0.783276   0.992071         80       32.7128k/s         20        200         22.4k
hnswlib.M36/4/manual_time/threads:32       2.29 ms         77.2 ms          352    2.53204   0.992182         80       43.6435k/s         20        200         35.2k
hnswlib.M36/5/manual_time/threads:1        58.7 ms         58.7 ms           13  0.0587323   0.996462        120       1.70346k/s         10        100          1.3k
hnswlib.M36/5/manual_time/threads:2        30.3 ms         60.6 ms           24   0.121138   0.996583        120       3.30444k/s         20        200          2.4k
hnswlib.M36/5/manual_time/threads:4        15.4 ms         61.7 ms           48   0.247146   0.996583        120       6.47723k/s         20        200          4.8k
hnswlib.M36/5/manual_time/threads:8        8.15 ms         65.1 ms           96   0.521956   0.996583        120       12.2681k/s         20        200          9.6k
hnswlib.M36/5/manual_time/threads:16       4.25 ms         67.8 ms          176    1.08877   0.996727        120       23.5293k/s         20        200         17.6k
hnswlib.M36/5/manual_time/threads:32       3.14 ms         98.3 ms          256     3.2185     0.9965        120       31.8356k/s         20        200         25.6k
hnswlib.M36/6/manual_time/threads:1        91.3 ms         91.3 ms            8  0.0912819    0.99925        200       1.09586k/s         10        100           800
hnswlib.M36/6/manual_time/threads:2        45.4 ms         91.0 ms           16   0.181882    0.99925        200       2.20048k/s         20        200          1.6k
hnswlib.M36/6/manual_time/threads:4        23.7 ms         94.7 ms           32   0.379176    0.99925        200       4.22157k/s         20        200          3.2k
hnswlib.M36/6/manual_time/threads:8        12.3 ms         98.6 ms           56   0.790261   0.999571        200       8.10256k/s         20        200          5.6k
hnswlib.M36/6/manual_time/threads:16       6.51 ms          105 ms          112    1.68106   0.999571        200        15.359k/s         20        200         11.2k
hnswlib.M36/6/manual_time/threads:32       5.08 ms          159 ms          160     5.2027     0.9996        200       19.7001k/s         20        200           16k
hnswlib.M36/7/manual_time/threads:1         167 ms          181 ms            5    0.18104          1        400        597.821/s         10        100           500
hnswlib.M36/7/manual_time/threads:2        84.1 ms          168 ms           10   0.336406          1        400       1.18957k/s         20        200            1k
hnswlib.M36/7/manual_time/threads:4        44.6 ms          178 ms           16   0.714272          1        400       2.24099k/s         20        200          1.6k
hnswlib.M36/7/manual_time/threads:8        23.0 ms          184 ms           32    1.47525          1        400       4.34047k/s         20        200          3.2k
hnswlib.M36/7/manual_time/threads:16       12.2 ms          216 ms           64    3.45394          1        400        8.2023k/s         20        200          6.4k
hnswlib.M36/7/manual_time/threads:32       8.75 ms          274 ms           96    8.96059          1        400       11.4315k/s         20        200          9.6k
hnswlib.M36/8/manual_time/threads:1         227 ms          227 ms            3   0.226666   0.999667        600        441.348/s         10        100           300
hnswlib.M36/8/manual_time/threads:2         116 ms          234 ms            6   0.467168   0.999667        600        862.811/s         20        200           600
hnswlib.M36/8/manual_time/threads:4        61.3 ms          246 ms           12   0.985194   0.999667        600       1.63106k/s         20        200          1.2k
hnswlib.M36/8/manual_time/threads:8        31.2 ms          250 ms           24    1.99801   0.999667        600       3.20438k/s         20        200          2.4k
hnswlib.M36/8/manual_time/threads:16       16.5 ms          263 ms           48    4.22029   0.999667        600       6.06758k/s         20        200          4.8k
hnswlib.M36/8/manual_time/threads:32       11.8 ms          370 ms           64    12.1066     0.9995        600       8.45976k/s         20        200          6.4k
hnswlib.M36/9/manual_time/threads:1         289 ms          289 ms            3   0.289458   0.999667        800        345.574/s         10        100           300
hnswlib.M36/9/manual_time/threads:2         147 ms          303 ms            6   0.606009   0.999667        800        678.289/s         20        200           600
hnswlib.M36/9/manual_time/threads:4        73.8 ms          295 ms            8    1.18156     0.9995        800       1.35442k/s         20        200           800
hnswlib.M36/9/manual_time/threads:8        38.4 ms          307 ms           16    2.45557     0.9995        800       2.60714k/s         20        200          1.6k
hnswlib.M36/9/manual_time/threads:16       20.9 ms          334 ms           32    5.35832     0.9995        800       4.77889k/s         20        200          3.2k
hnswlib.M36/9/manual_time/threads:32       15.1 ms          471 ms           64    15.4244     0.9995        800       6.64232k/s         20        200          6.4k

@cjnolet cjnolet force-pushed the fea-2312-benchmarks_throughput_mode branch from 48f192a to 18b57cc Compare October 25, 2023 20:09
Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! some minor comments/questions below:

cpp/bench/ann/src/common/ann_types.hpp Outdated Show resolved Hide resolved
python/raft-ann-bench/src/raft-ann-bench/run/__main__.py Outdated Show resolved Hide resolved
cpp/bench/ann/src/raft/raft_benchmark.cu Show resolved Hide resolved
cpp/bench/ann/src/hnswlib/hnswlib_wrapper.h Outdated Show resolved Hide resolved
Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@cjnolet
Copy link
Member Author

cjnolet commented Oct 27, 2023

/merge

@rapids-bot rapids-bot bot merged commit 9ad76fa into rapidsai:branch-23.12 Oct 28, 2023
61 checks passed
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cjnolet, I am late to the party, still I would like to give my comments to the PR. Overall this looks great, and I am happy to see this change. Still I think we should have a follow up PR to improve the benchmark columns, and explanations we give.

docs/source/raft_ann_benchmarks.md Show resolved Hide resolved
docs/source/raft_ann_benchmarks.md Show resolved Hide resolved
cpp/bench/ann/src/common/benchmark.hpp Show resolved Hide resolved
cpp/bench/ann/src/common/benchmark.hpp Show resolved Hide resolved
cpp/bench/ann/src/common/benchmark.hpp Show resolved Hide resolved
Comment on lines +282 to +285
if (state.thread_index() == 0) {
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
state.counters.insert({{"end_to_end", duration}});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why do we need end_to_end. I would prefer that we have two values Latency and Thoughput. Throughput is already there as items_per_second, and average latency can be defined as follows:

Suggested change
if (state.thread_index() == 0) {
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
state.counters.insert({{"end_to_end", duration}});
}
auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
state.counters.insert(
{"Latency", {duration / double(state.iterations()), benchmark::Counter::kAvgThreads}});

Using kAvgThreads ensures that we see an average value over all theads instead of an accumulated value over all threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End-to-end time is the most important indicator of latency here AFAIC and this is an important measurement. Given that this is a pipelined benchmark and gbench guarantees all threads will reach the beginning and end of the state loop together, we have the ability to capture the time spent in the state loop by all threads. This is essentially the time a user can expect to wait to process total_queries. This is probably the most realistic measure of end to end latency we could provide.

Please also note that the measure you are suggesting here would yield the same result, but would be an average. The end_to_end field is intentionally not an average. One thing I'd like to discuss going in the future is allowing the number of iterations to be locked to number of queries (10k in most cases) so that end to end time becomes more directly comparable across experiments. It's an easy argument to set on the gbench side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, since threads are synced end_to_end time and Latency are redundant. But they answer different questions, so let's keep them both.

The number of iterations should not be locked to the number of queries: sometimes we need short benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of iterations should not be locked to the number of queries: sometimes we need short benchmarks.

My point in mentioning that we can do that is because sometimes we want to lock the number of iterations so we can directly compare the end-to-end times across experiments. This is something that's possible today and I actually found it somewhat useful while I was verifying the various measurements in this PR.

if (cudart.found()) {
state.counters.insert({{"GPU Time", gpu_timer.total_time() / state.iterations()},
{"GPU QPS", queries_processed / gpu_timer.total_time()}});
state.counters.insert({{"GPU", gpu_timer.total_time() / double(state.iterations())}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it as "GPU Time" or "GPU Latency", since that is more descriptive.

Suggested change
state.counters.insert({{"GPU", gpu_timer.total_time() / double(state.iterations())}});
double gpu_latency_per_iteration = gpu_timer.total_time() / double(state.iterations());
state.counters.insert({"GPU Latency", {gpu_latency_per_iteration, benchmark::Counter::kAvgThreads}});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shall look into the gpu_timer as a separate issue. What is the expectation from the ANN::search() methods that we call within the benchmark loop, are they expected to sync? Currently they do, but I would like to review it later because different streams and events are synced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little stuck here- ideally we would be using the same steam to queue up the start/end events that we are using in the raft device_resources objects, right? That way we know that the events being measured on that stream are blocking only while the actual kernels for search() are being executed, so we are getting an accurate measure of only the time spent in the gpu (and making this data available on each stream for profiling). Right now, I don't think that's the case, but I could be mistaken.

At the moment, the device_resources instances are using cuda_default_stream_per_thread but the gpu_timer is creating a new stream "with flags". I didn't do a ton of investigating here, but is there a default stream per thread with flags? Does that even make sense? I take it the flags part is needed to queue up the start/stop events?

I had actually removed the GPU column during the search step altogether but added it back before merging so that we could have this conversation (since it's technically not changing any behavior from the way it was done before threading- it was still using the same streams before).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "Does that even make sense?" question also occurred to me and I suspect the answer is no (but @achirkin might prove me wrong).

Copy link
Contributor

@achirkin achirkin Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but as far as I remember, the new-stream per test is for better robustness and fairness against other implementations.
First of all, raft-ann-bench had been used with other GPU algorithms, including third-party, internal, and, sometimes, unfinished ones. This means:

  1. An implementation may be unaware of any raft machinery
  2. An implementation may use its own thread/stream pools, and we know nothing about the streams used internally.
  3. An implementation may mess up the used stream and make debugging harder

I created a new stream with flags to make sure it's non-blocking against whatever streams are allocated internally in any implementation. Hence all synchronization would happen only via explicit waiting on start/stop events.
Also in this setup, the GPU timing includes not only the kernel time, but also all CPU time. It really should be the same as the cpu wall time, because the start event is submitted to an empty stream on each iteration before the algorithm starts. The GPU timing is a high-precision timing, it supposed to be more accurate than the wall time if the iteration time is very small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding and the behavior I've seen, a new gpu_timer stream is created for each thread and the call to search() for each RAFT index type is synchronous wrt it's thread, thus the gpu_timer will wait before for the search to complete before each lap, and thus that wait time will cause the same time gap between start and end events.

Im not sure the faiss indexes will function this way, as I believe they are syncing on the stream that's passed into search() (plus any additional syncs that happen to the stream in the corresponding StandardGouResources instance. We should be able to set the stream on that instance, so we could very well set the cuda stream per thread on it, I believe, which would make it function more like raft.

Copy link
Contributor

@tfeher tfeher Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumption here is that we will have a cudaStreamSync() or equivalent while calling ivf_pq::search(). That would block the CPU thread, and delay scheduling the stop_ event.

[Update]: I have missed that actually we pass the stream from timer to search: ivf_pq::search(, gpu_timer.stream()), so the ANN algorithm has a chance to (and expected to) sync with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumption here is that we will have a cudaStreamSync() or equivalent while calling ivf_pq::search(). That would block the CPU thread, and delay scheduling the stop_ event.

This should be the case, though, shouldn't it?

Copy link
Contributor

@achirkin achirkin Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind cuda_lap is to not require any extra synchronization, such as cudaStreamSync(). The only requirement is that ivf_pq::search() does something in the stream passed to it.
What cuda_lap does:

  1. Waits till previous work in stream_ finishes and submits the start_ event into it. A somewhat risky assumption here is that the event triggers immediately. Maybe, in reality it could trigger some time between the submission and the first use of stream_ (which could be a synchronization or a data transfer or a kernel)? To fix it, we can just add another cudaEventSynchronize or cudaStreamSynchronize right after submitting the event (but I'm not sure this needs fixing).
  2. Submits the stop_ event into the stream_ after the call to the implementation and immediately synchronizes on it. Hence it captures everything the implementation has done in the stream_. However, since the implementation is called in the same thread, the stop_ event is submitted no earlier than the CPU side of the algorithm finishes. Hence, the stop_ time is the larger of the two (cpu completion time, gpu completion time).

As a result, the two events together capture the total execution time, including the GPU work that could still be happening in the stream_ after the CPU side has already finished. Yet, the GPU time cannot be larger than our recorded wall time, because we synchronize on the stop_ event within the benchmark iteration (in the destructor of the cuda_lap).

The implementation must mark its work in the stream_, but not necessarily via synchronizing with the host.
Currently, both FAISS and raft_ivf_pq wrappers have their own streams created as a part of their states. That is, we don't create a new raft_resources handle with this stream and don't set this as the main stream in FAISS! This is done to make sure we can cache the algo handles between runs without using already destroyed streams. In both cases, I use an extra sync event to establish a dependency between the internal implementation streams and the passed cuda_lap stream.

}

// This will be the total number of queries across all threads
state.counters.insert({{"total_queries", queries_processed}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add the dataset rows as a benchmark column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a bad idea.

Comment on lines +394 to +401
* The following are important for getting accuracy QPS measurements on both CPU
* and GPU These make sure that
* - `end_to_end` ~ (`Time` * `Iterations`)
* - `items_per_second` ~ (`total_queries` / `end_to_end`)
* - `Time` = `end_to_end` / `Iterations`
*
* - Latency = `Time`
* - Throughput = `items_per_second`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still not have the desired effect to make Time == Latency. I recommend above to add a specific latency column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been finding that empirically, Time == Latency because we are using the process time to measure the cpu time and the "time" is based on the avg of individual latencies spent in the thread loops. It's not exact based on end_to_end, but it's very very close. Close enough to where I think it's well above the noise level. I'm hoping we can utilize the gbench timers where they make sense to that we can ease the pains for the users of having a bunch of additional columns that contain similar values. If there are measurements that are obviously wrong, I think we should consider adding additonal columns with the correct measurements. I was at least able to consistently see that Time was pretty much Latency with our current settings. I was even able to compute values that were very close to is using our manually measured columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some benchmarks I have measured small but statistically significant decrease in Time while increasing number of threads. That cannot be the case if Time is actual latency. To highlight the issue, I have added a dummy ANN class in #1940, which only sleeps for a fixed amount of time. Let's continue the discussion there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That cannot be the case if Time is actual latency.

I think we might be saying different things here when we say "latency". I"m referring to latency as being the average latency of each individual batch (for each thread) and not the total latency summed together. From my understanding, this is why Time becomes ~CPU / n_threads up until the saturation point where the hardware starts to kick back and stall while it's waiting to schedule new work. Have you also observed this behavior? IVF-PQ is a great example there, where after 8 threads or so, you notice the Time field stays the same or even increases.

docs/source/raft_ann_benchmarks.md Show resolved Hide resolved
@tfeher tfeher mentioned this pull request Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

4 participants