▲Reliably Benchmarking Small Changes – Ankush Menatankush.dev

32 points by rbanffy 296 days ago | 16 comments

aktau 290 days ago [-]

The author disables SMT (hyperthreading) like this:

    disabled_cpus=(1 3 5 7 9 11 13 15)
    for cpu_no in $disabled_cpus ; do
      echo 0 | sudo tee /sys/devices/system/cpu/cpu$cpu_no/online
    done

But there is an easier way on Linux that doesn't require parsing /sys/devices/system/cpu/cpu*/topology/thread_siblings_list:

    sudo tee /sys/devices/system/cpu/smt/control <<< off

muziq 295 days ago [-]

Windows has been driving me crackers the last few days trying to benchmark hard to measure optimisations, and tend to end up on long runs, looking for the minimum time.. Usually closing chrome results in an immediate ~10% performance boost even minimised.. I’d love to see an option for a Developer to lock off some cores totally, nothing runs on them unless its in an approved list.. At least then I can profile on those cores and get a reasonable result..

jamwaffles 290 days ago [-]

You can! I needed to run some realtime networking stuff on an isolated core and followed this [1]

I used Windows 11 and the two cores I isolated show no CPU usage in task manager until you run something that's pinned to those cores.

[1]: https://learn.microsoft.com/en-us/windows/iot/iot-enterprise...

krona 290 days ago [-]

These workarounds might be good enough to detect ~1-5% changes from a baseline of a native, pre-compiled application. However this won't be sufficient for many dynamic, JIT-compiled languages which will usually still have large amounts of inter-run variance due to timing-sensitive compilation choices of the runtime. A statically significant ~10% change can be hard to detect in these circumstances from a single run.

In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.

Sesse__ 290 days ago [-]

> In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.

Agreed. Do you have any suggestions? :-)

janwas 290 days ago [-]

I like taking the trimmed mean of 10-20 runs, or if a run is quick, the (half-sample) mode of more runs. See robust_statistics.h.

menaerus 290 days ago [-]

> These workarounds might be good enough to detect ~1-5% changes from a baseline of a native, pre-compiled application.

and

> However this won't be sufficient for many dynamic, JIT-compiled languages which will usually still have large amounts of inter-run variance due to timing-sensitive compilation choices of the runtime.

are not mutually exclusive. Any sufficiently complex statically compiled application will suffer from the same variance issues.

> A statically significant ~10% change can be hard to detect in these circumstances from a single run.

Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?

krona 289 days ago [-]

> Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?

In your 5 sample example, you can't determine if there are any outliers. You need more samples, each containing a multitude observations. Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

Outliers are only important if you have to throw away data; good measures of central tendency should be robust to them unless your data is largely noise.

> Any sufficiently complex statically compiled application will suffer from the same variance issues.

Sure, its a rule of thumb.

menaerus 289 days ago [-]

> In your 5 sample example, you can't determine if there are any outliers. You need more samples

I think the same issue is present no matter how many samples we collect. Statistical apparatus of choice may indeed tell us that given sample is an outlier in our experiment setup but what I am wondering is what if the sample was an actual signal that we measured and not noise.

Concrete example: in 10/100 test-runs you see a regression of 10%. The rest of the test-runs show 3%. You can 10x or 100x that example if you wish. Are those 10% regressions the outliers because "the environment was noisy" or did our code really run slower for whatever conditions/reasons in those experiments?

> Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.

Sesse__ 289 days ago [-]

> In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.

I think at that point, we should get better at saying “OK, we just don't know”. If I can't show within reasonable resource spend that my optimization is worthwhile, then perhaps don't add it. (Of course, it depends on how much it uglifies the code, whether I have some external reason to believe it's better, and so on. But in general, people tend to overestimate how much anything of anything will help :-) )

menaerus 288 days ago [-]

I agree and I am totally happy to say "I tried to measure but the result I found is inconclusive" or "I believe that this is at worst neutral commit - e.g. it won't add any regression". Having spent probably thousands of hours e2e benchmarking the code I wrote, I'm always skeptical about the benchmarking frameworks, blogs, etc.

The last one being the paper from Meta, where they claim that they can detect 0.005% regressions. I really don't think this is possible in sufficiently complex e2e system tests. IME I found it to be extremely challenging to detect regressions, with high confidence, that are below 5%.

Link: https://tangchq74.github.io/FBDetect-SOSP24.pdf

Sesse__ 288 days ago [-]

It really depends on your benchmark and how much bias you're willing to trade for your variance. I mean, SQLite famously uses Callgrind and claims to be able to measure 0.005%; which they definitely can, but only in the CPU that Callgrind simulates, which may or may not coincide with reality. Likewise, I've used similar strategies to the one Meta describes, where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise (I've reliably found -- and later verified by other means -- 0.2% wins in large systems), but won't catch cases like e.g. large-scale code bloat.

The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently. And the more locked you are to a single code path (i.e., your histogram is very skewed), the worse these issues are.

288 days ago [-]

menaerus 288 days ago [-]

EDIT: sorry for the wall of text but I really find this topic to be quite interesting to discuss.

> where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise

Yes, but I have always found that approach to be insufficient IME.

For example, let's say that function profiling data shows that f(x) improved by X% after my change, however, when I run the E2E system tests the results I get are one of the following:

1. E2E system tests over M different workloads show no difference in performance. The correlation between the change and E2E performance in all M workloads is zero.

2. E2E system tests over M different workloads show that performance improved. The correlation between the change and E2E performance is therefore positive.

3. E2E system tests over M different workloads show that performance degraded. The correlation between the change and E2E performance is negative.

IME distribution of probabilities (#1, #2, #3) is ~[.98, .1, .1].

Hypothesis #1: None of the M workloads were sufficient to show that there is a positive or negative correlation between the change and E2E performance. In other words, we haven't found that particular M+1st workload yet that shows that there really is a change in performance.

Hypothesis #2: There is simply no correlation between the change and E2E performance as experiment results have shown.

Hypothesis #3: our benchmark measurement is insufficient to catch the change. Resolution might be lacking. Precision might be lacking. Accuracy also.

I find hypothesis #2 to be the most probable when experiment results are repeatable (precision).

This also means that the majority of changes that we developers are doing for the sake of "optimization gains" can be easily disproved. E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance.

> The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently.

I agree and I see this is a problem of hard-coding all the random variables in our system. Otherwise, we don't have the same initial conditions for each experiment run, which in reality we really don't.

And random variable is pretty much everything. Compiler. Linker. Two consecutive builds of the same binary do not necessarily produce the same binary, e.g. code layout may change. Kernel has a state. Filesystem has a state. Our NVME drives have a state. Then there is a page cache. I/O scheduler. Task scheduler. NUMA. CPU throttling.

So, there's a bunch of multidimensional random variables spread across the time all of which impact the experiment results - a stochastic process by definition.

Sesse__ 288 days ago [-]

> E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance.

My experience actually diverges here. I've had cases where I've done a bunch of optimizations in the 0.5% range, and then when you go and benchmark the system against the version that was three months ago, you actually see a 20% increase in speed.

Of course, this is on a given benchmark which you have to hope is representative; it's impossible to say exactly how every user is going to be in the wild. But if you accept that the goal is to do better on a given E2E benchmark, it absolutely is possible (and again, see SQLite here). But you have to sometimes be able to distinguish between hope and what the numbers are telling you; it really sucks when you have an elegant optimization and you just have to throw it in the bin after a week because the numbers just don't agree with you. :-)

menaerus 287 days ago [-]

> My experience actually diverges here. I've had cases where I've done a bunch of optimizations in the 0.5% range, and then when you go and benchmark the system against the version that was three months ago, you actually see a 20% increase in speed.

Yeah, not IME really. First, I don't know how to measure at 0.5% resolution reliably. Second, this would imply that YoY we should be able to see [~20, ~20+x]% of performance runtime improvement of software we are working on and this doesn't resemble my experience at all - it's usually vice-versa and it's mostly about "how to add new feature without making the rest of this ~5 MLoC software regress". Big optimization wins were quite rare.

Amdahl's law says that "overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used" so throwing a bunch of optimizations at the code does not, for most cases, result with the overall improvement. I can replace my notorious use of std::unordered_map with absl::flat_hash_map and still see no improvement at all.

> it really sucks when you have an elegant optimization and you just have to throw it in the bin after a week because the numbers just don't agree with you. :-)

It really does and I've been there many times. I however learned to understand this as "I have a code of what I thought it should improve our runtime but I found no signal that will support my theory". This automatically makes such changes difficult to merge especially considering that most optimizations aren't of "clean code" practice.

Sesse__ 284 days ago [-]

I see Amdahl's Law as an opportunity, not a limit. :-) If you optimize something, it means the remainder is now even more valuable to optimize, percentage-wise. In a way like compound interest.

290 days ago [-]

Loading comments...

aktau 290 days ago [-]

The author disables SMT (hyperthreading) like this:

    disabled_cpus=(1 3 5 7 9 11 13 15)
    for cpu_no in $disabled_cpus ; do
      echo 0 | sudo tee /sys/devices/system/cpu/cpu$cpu_no/online
    done

But there is an easier way on Linux that doesn't require parsing /sys/devices/system/cpu/cpu*/topology/thread_siblings_list:

    sudo tee /sys/devices/system/cpu/smt/control <<< off