EIC Software
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
BenchmarkTools.hpp
Go to the documentation of this file. Or view the newest version in sPHENIX GitHub for file BenchmarkTools.hpp
1 // This file is part of the Acts project.
2 //
3 // Copyright (C) 2019-2020 CERN for the benefit of the Acts project
4 //
5 // This Source Code Form is subject to the terms of the Mozilla Public
6 // License, v. 2.0. If a copy of the MPL was not distributed with this
7 // file, You can obtain one at http://mozilla.org/MPL/2.0/.
8 
9 #pragma once
10 
12 
13 #include <algorithm>
14 #include <cassert>
15 #include <chrono>
16 #include <cmath>
17 #include <iomanip>
18 #include <numeric>
19 #include <ostream>
20 #include <type_traits>
21 #include <utility>
22 #include <vector>
23 
24 namespace Acts {
25 namespace Test {
26 
27 // === INTRODUCTION ===
28 //
29 // This header contains tools which can be used to avoid unrealistic
30 // compiler optimizations in microbenchmarks, such as elimination of unused
31 // benchmark computations or deduplication of benchmark loops which repeat
32 // the same operation N times with unchanging inputs.
33 //
34 // ### WARNING ON HARDWARE OPTIMIZATIONS ###
35 //
36 // When using these tools, you must keep in mind that compiler optimizations
37 // are only a part of the story, and CPUs will also perform a bunch of dynamic
38 // optimizations on their own, including but by no means limited to...
39 //
40 // - Moving small benchmark inputs into fast caches
41 // - Buffering and caching small benchmark outputs
42 // - Instruction-level parallelization (pipelining, superscalar
43 // execution...) of supposedly unrelated benchmark iterations
44 // - "Turbo" overclocking of CPU cores which is only sustainable in specific
45 // circumstances (short-running code sections, single-core execution...)
46 // - Branch predictor training for specific loops and conditionals
47 //
48 // Unfortunately, there is no way to disable those hardware optimizations in
49 // a way which meets even basic sanity requirements:
50 //
51 // - Easily portable to all CPU models of interest
52 // - Fine-grained enough to target CPU optimizations which are an
53 // unrealistic artifact of the microbenchmark, without also disabling
54 // optimizations which will happen in realistic usage scenarios (like
55 // hot code remaining resident in the CPU instruction cache)
56 // - Generic enough for all benchmark inputs, outputs, and code
57 // - Minimal impact on overall benchmark timings
58 //
59 // You will thus need to live with the fact that any microbenchmark has
60 // hardware bias, which is not as egregious as compiler bias (since the CPU
61 // can't just delete gigantic sections of the code like the compiler does)
62 // but still very much significant on simple benchmarks.
63 //
64 // The proper way to detect and correct this bias is not to try to write more
65 // clever microbenchmarks. It is to use realistic benchmarks of full app
66 // workloads whenever possible, and treat microbenchmarks only as a tool for
67 // detecting performance regressions in known-critical components and
68 // easily fine-tuning the performance of these components, in a manner whose
69 // effectiveness will later have to be checked on a more realistic benchmark.
70 //
71 // #########################################
72 //
73 //
74 // === OPTIMIZATION BARRIERS ===
75 //
76 // Mark some data as read and written in the eye of the compiler, so it can't
77 // optimize under the assumption that the data isn't used or does not change.
78 //
79 // The current implementation has limitations that you should bear in mind:
80 //
81 // - "clobber" data which resides in CPU registers must be flushed to memory
82 // and reloaded from memory afterwards if re-used. This will increase memory
83 // traffic and cache footprint in a potentially unrealistic fashion.
84 // - Putting this optimization barrier on every iteration of a loop will
85 // prevent compiler loop optimizations like autovectorization, which is
86 // generally too strong. Consider putting the barrier every N loop
87 // iterations instead (and checking for various N), or at the very end of
88 // the loop/benchmark if the memory traffic of storing all the end results
89 // doesn't bias your benchmark too much.
90 // - The barrier's implementation uses a very basic subset of the GCC dialect of
91 // inline assembly, which although widely supported (from a quick godbolt
92 // check, in addition to GCC it works on clang, djgpp, ellcc, icc and zapcc),
93 // is most notably not supported by MSVC. I have not yet found an MSVC or
94 // portable equivalent which is UB-free and works even when passing in a
95 // pointer to a local variable, suggestions and patches are welcome.
96 //
97 #ifdef __GNUC__
98 
99 template <typename T>
100 inline void assumeAccessed(T&& clobber) {
101  // This optimization barrier leverages the fact that inline ASM isn't smart,
102  // and couldn't get smarter as the compiler would then need to be able to
103  // parse and understand the side effects of arbitrary assembly statements,
104  // which would be a ridiculous amount of work for compiler devs:
105  //
106  // - The compiler only performs trivial analysis of the assembly (e.g.
107  // number of instructions for inlining purposes), so it can't leverage the
108  // fact that it's empty to optimize out its inputs. It must assume that
109  // the assembly statements will use the declared inputs, emit the declared
110  // outputs, and read/write the contents of declared clobbers.
111  // - A pointer to "clobber" is declared as an input to assembly, and the
112  // "memory" clobber allows the assembly statement to read and write memory
113  // targeted by this pointer. Therefore, the compiler must compute and
114  // write down the value of "clobber" in memory, and reload it into
115  // registers if it's used later on by the program.
116  // - A volatile marker is used, therefore the compiler can't optimize out
117  // the asm statement even though from its specification it should have no
118  // program-visible side effects. It also prevents the compiler from moving
119  // the asm statement out of a loop, which would be bad... But note that it
120  // does _not_ prevent it from moving within a loop iteration.
121  //
122  __asm__ volatile("" : : "g"(&clobber) : "memory");
123 }
124 
125 #else
126 
127 template <typename T>
128 void assumeAccessed(T&& clobber) {
129  // FIXME: Find a reliable optimization barrier for MSVC. As a litmus test, the
130  // assembly generated for the following code should store "42" in
131  // memory twice, not once:
132  //
133  // ```
134  // int x = 42;
135  // assumeAccessed(x);
136  // x = 42;
137  // assumeAccessed(&x);
138  // ```
139  //
140  static_assert(false, "No optimization barrier available for this compiler");
141 }
142 
143 #endif
144 
145 // Mark some data as read in the eye of the compiler, so it can't optimize out
146 // the source computation under the assumption that its output isn't used.
147 //
148 // Has the same caveats as assumeAccessed().
149 //
150 template <typename T>
151 inline void assumeRead(const T& clobber) {
152  // FIXME: I don't know of a finer-grained compiler optimization barrier that
153  // 1/can be used when one only wants to fake a read and 2/works for
154  // all inputs (not just machine types), so for now this function is
155  // only a way to clarify developer intent.
156  assumeAccessed(clobber);
157 }
158 
159 // Mark some data as written in the eye of the compiler, so it can't optimize
160 // out repetitive dependent computations under the assumption that the result
161 // will always be the same (and more generally can't assume anything about the
162 // value of this input beyond type-level properties).
163 //
164 // Has the same caveats as assumeAccessed().
165 //
166 template <typename T>
167 inline void assumeWritten(T& clobber) {
168  // FIXME: I don't know of a finer-grained compiler optimization barrier that
169  // 1/can be used when one only wants to fake a write and 2/ works for
170  // all outputs (not just machine types), so for now this
171  // function is only a way to clarify developer intent.
172  assumeAccessed(clobber);
173 }
174 //
175 //
176 // === MICROBENCHMARK HARNESS ===
177 
178 // Results of a microbenchmark
179 //
180 // Holds the timings of each benchmark run, and allows the user to query various
181 // statistical information about them.
182 //
184  using Duration = std::chrono::duration<double, std::nano>;
185 
187  std::vector<Duration> run_timings;
188 
189  // Total benchmark running time
190  //
191  // Unless your machine has been tuned for extremely stable timings (minimal
192  // background activity, no CPU frequency scaling, no disk spin-down...), you
193  // shouldn't give much credence to benchmarks with running times smaller than
194  // a few hundreds of milliseconds.
195  //
196  Duration totalTime() const {
197  return std::accumulate(run_timings.cbegin(), run_timings.cend(),
198  Duration());
199  }
200 
201  // Robust estimator of the mean benchmark iteration time
202  //
203  // Computed as the average iteration time of the median benchmark run, this
204  // estimator provides a tunable compromise between the mean and median
205  // estimators of the benchmark iteration time:
206  //
207  // - As a mean iteration time, it can be measured with low overhead by tuning
208  // iters_per_run up until the impact of time measurement on benchmark
209  // timings is negligible. It also converges with optimal efficiency on
210  // unbiased data as iters_per_run increases.
211  // - Being based on the median run timing, this estimator is also robust
212  // against outlier measurements, such as timing spikes originating from
213  // bursts of background system load. The more benchmark runs you measure,
214  // the higher the robustness.
215  //
216  // This analysis assumes that the run timing distribution is roughly
217  // symmetric (and therefore the median can be used as an estimator of the
218  // mean), but makes no assumption about iteration time distributions.
219  //
221  assert(iters_per_run > 0);
222  return runTimeMedian() / iters_per_run;
223  }
224 
225  // Standard error on the estimator of the mean benchmark iteration time
226  //
227  // This analysis assumes that the run timing distribution is normal because
228  // the underlying `runTimeError` analysis does. It also assumes that
229  // per-iteration times are independent and identically distributed.
230  //
232  assert(iters_per_run > 0);
233  return runTimeError() / std::sqrt(iters_per_run);
234  }
235 
236  // Sorted benchmark run times, used for computing outlier-robust statistics
237  std::vector<Duration> sortedRunTimes() const {
238  std::vector<Duration> sorted_timings = run_timings;
239  std::sort(sorted_timings.begin(), sorted_timings.end());
240  return sorted_timings;
241  }
242 
243  // Median time per benchmark run
244  //
245  // This is an outlier-robust estimator of the mean benchmark run time if the
246  // run time distribution is roughly symmetric.
247  //
249  assert(run_timings.size() >= 1);
250  const std::vector<Duration> sorted_timings = sortedRunTimes();
251  const size_t midpoint = sorted_timings.size() / 2;
252  if (sorted_timings.size() % 2 == 0) {
253  return (sorted_timings[midpoint - 1] + sorted_timings[midpoint]) / 2;
254  } else {
255  return sorted_timings[midpoint];
256  }
257  }
258 
259  // First and third quartiles of benchmark run time timings
260  std::pair<Duration, Duration> runTimeQuartiles() const {
261  // Unfortunately, quartile computations on datasets whose size is not a
262  // multiple of 4 are not standardized. We use an interpolation- and
263  // symmetry-based definition that follows all consensual properties:
264  //
265  // - When the dataset size is a multiple of 4, the first quantile is
266  // the mean of the last point of the first quarter of the sorted dataset
267  // (called first_point below) and the first point of the second quarter of
268  // the dataset (which is the next point). This is universally agreed upon.
269  // - More generally, when the dataset size is a multiple of 2, the first
270  // quantile is the median of the first half of the sorted dataset. Most
271  // commonly used quantile definitions agree on this, but some don't.
272  // - The third quantile is defined symmetrically with respect to the first
273  // one, starting from the end of the sorted dataset and going downwards.
274  //
275  assert(run_timings.size() >= 2);
276  const std::vector<Duration> sorted_timings = sortedRunTimes();
277  const size_t first_point = (sorted_timings.size() - 2) / 4;
278  const size_t offset = (sorted_timings.size() - 2) % 4;
279  const size_t third_point = (sorted_timings.size() - 1) - first_point;
280  if (offset == 0) {
281  return {sorted_timings[first_point], sorted_timings[third_point]};
282  } else {
283  const auto first_quartile = ((4 - offset) * sorted_timings[first_point] +
284  offset * sorted_timings[first_point + 1]) /
285  4;
286  const auto third_quartile = ((4 - offset) * sorted_timings[third_point] +
287  offset * sorted_timings[third_point - 1]) /
288  4;
289  return {first_quartile, third_quartile};
290  }
291  }
292 
293  // Robust estimator of benchmark run timing standard deviation
294  //
295  // Assuming that the run time distribution is normal aside from occasional
296  // outlier pollution, an outlier-robust, unbiased, and consistent estimator of
297  // its standard deviation can be built from the interquartile range via
298  // the formula estimated_stddev = IQR / (2 * sqrt(2) * erf-1(1/2)).
299  //
300  // This analysis assumes that the run timing distribution is roughly normal,
301  // outlier measurements aside.
302  //
304  auto [firstq, thirdq] = runTimeQuartiles();
305  return (thirdq - firstq) / (2. * std::sqrt(2.) * 0.4769362762044698733814);
306  }
307 
308  // Standard error on the median benchmark run time
309  //
310  // This analysis assumes that the run timing distribution is approximately
311  // normal, a few outliers aside.
312  //
314  return 1.2533 * runTimeRobustStddev() / std::sqrt(run_timings.size());
315  }
316 
317  // Standardized display for benchmark statistics
318  //
319  // The underlying data analysis assumes the full mathematical contract stated
320  // in the documentations of `runTimeError`, `iterTimeAverage` and
321  // `iterTimeError`. It also assumes that _both_ the run and iteration timings
322  // follow an approximately normal distribution, so that the standard 95%
323  // confidence interval formulation can be applied and that the median run
324  // time can be used as an estimator of the mean run time.
325  //
326  friend std::ostream& operator<<(std::ostream& os,
327  const MicroBenchmarkResult& res) {
328  auto old_precision = os.precision();
329  auto old_flags = os.flags();
330  os << std::fixed << res.run_timings.size() << " runs of "
331  << res.iters_per_run << " iteration(s), " << std::setprecision(1)
332  << res.totalTime().count() / 1'000'000 << "ms total, "
333  << std::setprecision(4) << res.runTimeMedian().count() / 1'000 << "+/-"
334  << 1.96 * res.runTimeError().count() / 1'000 << "µs per run, "
335  << std::setprecision(3) << res.iterTimeAverage().count() << "+/-"
336  << 1.96 * res.iterTimeError().count() << "ns per iteration";
337  os.precision(old_precision);
338  os.flags(old_flags);
339  return os;
340  }
341 };
342 
343 // Implementation details, scroll down for more public API
344 namespace benchmark_tools_internal {
345 
346 // General iteration of microBenchmark with inputs and outputs
347 template <typename Callable, typename Input, typename Result>
349  static inline void iter(const Callable& iteration, const Input& input) {
350  assumeWritten(iteration);
351  assumeWritten(input);
352  const auto result = iteration(input);
353  assumeRead(result);
354  }
355 };
356 
357 // Specialization for void(Input) functors, where there is no output
358 template <typename Callable, typename Input>
359 struct MicroBenchmarkIterImpl<Callable, Input, void> {
360  static inline void iter(const Callable& iteration, const Input& input) {
361  assumeWritten(iteration);
362  assumeWritten(input);
363  iteration(input);
364  }
365 };
366 
367 // Specialization for Result(void) functors, where there is no input
368 template <typename Callable, typename Result>
369 struct MicroBenchmarkIterImpl<Callable, void, Result> {
370  static inline void iter(const Callable& iteration) {
371  assumeWritten(iteration);
372  const auto result = iteration();
373  assumeRead(result);
374  }
375 };
376 
377 // Specialization for void() functors, where there is no input and no output
378 template <typename Callable>
379 struct MicroBenchmarkIterImpl<Callable, void, void> {
380  static inline void iter(const Callable& iteration) {
381  assumeWritten(iteration);
382  iteration();
383  }
384 };
385 
386 template <typename T, typename I>
387 using call_with_input_t = decltype(std::declval<T>()(std::declval<I>()));
388 
389 template <typename T>
390 using call_without_input_t = decltype(std::declval<T>()());
391 
392 // If callable is a callable that takes the expected input argument type, then
393 // this specialization will be selected...
394 template <typename Callable, typename Input = void>
396  constexpr static bool is_callable =
397  Concepts ::exists<call_with_input_t, Callable, Input>;
398  static inline void iter(const Callable& iteration, const Input* input) {
399  static_assert(is_callable, "Gave callable that is not callable with input");
400  if constexpr (is_callable) {
401  using Result = std::invoke_result_t<Callable, const Input&>;
403  }
404  }
405 };
406 
407 // If Callable is a callable that takes no argument, this specialization will be
408 // picked instead of the one above...
409 template <typename Callable>
410 struct MicroBenchmarkIter<Callable, void> {
411  constexpr static bool is_callable =
412  Concepts ::exists<call_without_input_t, Callable>;
413 
414  static inline void iter(const Callable& iteration, const void* = nullptr) {
415  static_assert(is_callable,
416  "Gave callable that is not callable without input");
417  if constexpr (is_callable) {
418  using Result = std::invoke_result_t<Callable>;
420  }
421  }
422 };
423 
424 // Common logic between iteration-based and data-based microBenchmark
425 template <typename Callable>
426 MicroBenchmarkResult microBenchmarkImpl(Callable&& run, size_t iters_per_run,
427  size_t num_runs,
428  std::chrono::milliseconds warmup_time) {
429  using Clock = std::chrono::steady_clock;
430 
431  MicroBenchmarkResult result;
432  result.iters_per_run = iters_per_run;
434 
435  const auto warmup_start = Clock::now();
436  while (Clock::now() - warmup_start < warmup_time) {
437  run();
438  }
439 
440  for (size_t i = 0; i < num_runs; ++i) {
441  const auto start = Clock::now();
442  run();
443  result.run_timings[i] = Clock::now() - start;
444  }
445 
446  return result;
447 }
448 
449 } // namespace benchmark_tools_internal
450 
451 // Run a user-provided benchmark `iteration` function that takes no argument
452 // in batches of `iters_per_run` iterations, for `num_runs` batches, after some
453 // warmup period of `warmup_time` has elapsed. Return execution statistics.
454 //
455 // The output of `iteration` is marked as read, so the compiler cannot optimize
456 // it out except by const-propagation (i.e. it is a function of a constexpr
457 // quantity). If `iteration` is a lambda which captures inputs from a higher
458 // scope, those are marked as written on every iteration as well, so the
459 // compiler cannot optimize out the iteration loop into a single iteration.
460 //
461 // Together, these precautions void the need for manually calling assumeRead and
462 // assumeWritten in all simple cases where the benchmark iteration function
463 // ingests some inputs from an outer scope and emits the output of its
464 // computations as a return value.
465 //
466 // We do batched runs instead of using a single iteration loop because:
467 // - It allows us to provide error bars on the timing measurement, which allows
468 // comparing two measurements and detecting timing jitter problems emerging
469 // from e.g. excessive background system activity.
470 // - For short-running iteration functions, it allows you to keep benchmark runs
471 // much shorter than one OS scheduling quantum (typically ~1ms), which enables
472 // high-precision measurements devoid of any scheduling-induced timing jitter,
473 // while still aggregating as much statistics as you want via `num_runs`.
474 //
475 // For optimal timing resolution on modern x86 CPUs, you will want to tune
476 // `iters_per_run` so that the median run timing is around a few tens of µs:
477 // - The TSC CPU clock which is most likely used by your operating system's
478 // high-resolution clock has an overhead of a few tens of CPU cycles, so for
479 // a ~GHz CPU clock this gives you a clock-related bias and noise of ~10ns. At
480 // ~10µs run times, that only contributes for 1/1000 of observed timings.
481 // - The main source of benchmark noise is that your operating system's
482 // scheduler disturbs the program every few miliseconds. With run timings of
483 // ~10µs, this disturbance affects less than 1% of data points, and is thus
484 // perfectly eliminated by our outlier-robust statistics.
485 //
486 // As an extra, and unfortunately sometimes conflicting suggestion, it is
487 // advised to keep `iters_per_run` above 30, which is the classic rule of thumb
488 // for the validity of the central limit theorem. This allows you to safely
489 // assume that the random run timing distribution follows a roughly normal
490 // distribution, no matter what the actual underlying iteration timing
491 // probability law is, and the statistical analysis methods of
492 // `MicroBenchmarkResult` pervasively rely on this normality hypothesis. At
493 // lower `iters_per_run`, please cross-check yourself that your run timings
494 // follow a normal distribution, and use a statistical analysis methodology that
495 // is appropriate for the run timings distribution otherwise.
496 //
497 // You shouldn't usually need to adjust the number of runs and warmup time, but
498 // here are some guidelines for those times where you need to:
499 // - `num_runs` is a somewhat delicate compromise between several concerns:
500 // * Quality of error bars (many runs mean more precise error bars)
501 // * Outlier rejection (many runs mean better outlier rejection)
502 // * Benchmark running time (many runs take longer)
503 // * Handling of short-lived background disturbances (these have a higher
504 // chance of occuring in longer-lived benchmarks, but if they only take
505 // a small portion of the benchmark's running time, they can be taken
506 // care of by outlier rejection instead of polluting the results)
507 // - `warmup_time` should be chosen based on the time it takes for run timings
508 // to reach a steady state on your system after a benchmark starts. Here is a
509 // possible measurement protocol for that:
510 // * Set up a long-running benchmark (several seconds) with no warmup.
511 // * Dump benchmark run timings to a file at the end of every execution.
512 // * Run this benchmark a couple of times (say, 5-10x times).
513 // * Plot the resulting run timing time series against their cumulated sum
514 // (which is a good approximation of the elapsed time).
515 // * Note after how much elapsed time the timings typically become steady.
516 //
517 
518 template <typename Callable>
520  Callable&& iteration, size_t iters_per_run, size_t num_runs = 20000,
521  std::chrono::milliseconds warmup_time = std::chrono::milliseconds(2000)) {
523  [&] {
524  for (size_t iter = 0; iter < iters_per_run; ++iter) {
526  iteration);
527  }
528  },
529  iters_per_run, num_runs, warmup_time);
530 }
531 
532 // Same idea as above, but the iteration function takes one argument, which is
533 // taken from the `inputs` collection. A run is one iteration through `inputs`.
534 //
535 // This variant is convenient when you want to test on random data in order to
536 // avoid the bias of always benchmarking on the same data, but don't want to
537 // "see" the cost of random number generation in your benchmark timings.
538 template <typename Callable, typename Input>
540  Callable&& iterationWithInput, const std::vector<Input>& inputs,
541  size_t num_runs = 20000,
542  std::chrono::milliseconds warmup_time = std::chrono::milliseconds(2000)) {
544  [&] {
545  for (const auto& input : inputs) {
547  iterationWithInput, &input);
548  }
549  },
550  inputs.size(), num_runs, warmup_time);
551 }
552 
553 } // namespace Test
554 } // namespace Acts