At work, about 50% of my job fits into the category of "The Spice Must Flow".
After 2 years of work (incorporating about 15 mini-projects) we have finally got our downtime for the year down to about 32 minutes (none of which we were the root cause for).
The biggest remaining threat, and the cause of a number of near-miss almost-downtimes is load regressions, because our code is quite complex and it doesn't take a lot for someone to accidentally introduce an extra O(logn) on top of some existing O(n logn) and load-spike something they shouldn't.
To try and deal with this, in additional to our regular pre-release load testing runs, we've been starting to accumulate a benchmark suite using the same structure and for the same reasons we have a test suite.
The idea is to produce several dozen or hundred individual benchmarks that run nightly in a controlled environment (one might term is "smoke benching") to catch performance regressions as they occur, instead of the current scenario where they are only caught just before (or just after) release.
Our current efforts are breaking down at only four or five regression tests, for a variety of reasons. So I've started to experiment with a modified (but completely compatible) version of Benchmark::Timer I'm calling internally Benchmark::Lilburne (named after our team member that does our performance testing, who just happens to also have a name starting with "B").
B:Lilburne already comes with tracking of statistical certainty, courtesy of Benchmark::Timer. To this base I've added a maximum iteration count and maximum runtime, to prevent benchmarks running too long in the face of unreliable performance. This can be common in our setup, which results in benchmarks running for hours trying to reach statistical certainty.
I've also provided a mechanism to integrate with "enterprisey" code which will often do it's own timing capture. The new ->add method lets you add an elapsed time to a benchmark that has been captured independently outside of the benchmark script, which still allowing you to retain the statistics driven iteration of trials.
Finally, we've reached the point where we absolutely have to get rid of Benchmark-style formatted output. Instead B:Lilburne comes with options to output to STDOUT for capture by an external harness instead, in the same way the Test:: family of modules uses a protocol to report test results.
Longer term I'll probably switch to JSON so we can include less table'y data, such as a "verbose" option to spit out the details timings. For simplicity I'm just using CSV as my output format in the short term, since that doesn't require me to define a META.yml-like tree structure.
Structurally, our bench suite is layed out similar to a test suite.
We have a benchmark directory, with a collection of files ending with
I'll report more on our experiments as they continue, but if you know any other prior art in this area, feel free to link me to it in the comments.
If you'd like to see the changes I've made so far, you can see the merge of B:Lilburne features back into Benchmark::Timer in my repository.