A common use case for performance engineers is to determine the effect of a system change relative to an established baseline. If the effect is large and obvious then our work here is done. More often than not, however, effects are not obvious and we are left wondering whether to endorse the change or not. Statistical methods were, in essence, developed to enable us to make statements regarding the behavior of a system (population) given a small glimpse (sample) of how it behaved at one time or another.
Consider 1,000 response time samples collected over an extended period from our web application. This constitutes our performance baseline. We make changes to our application and conduct a quick test, collecting 100 response time samples. The histograms below show how the response times are distributed in each case. The new configuration has some outliers at the higher end of the scale but a narrower distribution around the peak than the baseline. To compare the distributions more equitably our strategy is to repeatedly re-sample the baseline data, calculating the mean of each sample. By the Central Limit Theorem (CLT), this so-called sampling distribution of the sample mean will approach a Gaussian distribution as more and more samples are taken from the baseline data.
As per the CLT, the resulting Gaussian will have the same mean as the original distribution (our baseline data) and variance equal to the original divided by the sample size. In this case our sample size is limited by the size of the test sample i.e. 100. The sequence of graphs below show how our Gaussian distribution forms as the number of bootstrap samples increases, converging on a mean ($\mu$) of 0.767 and standard deviation ($\sigma$) 0.0146.
Gaussian in hand, we can leverage the properties of this well-known distribution to make statements regarding our system. The curve can be used to determine the likelihood of our baseline system of generating samples that fall within a specific range. In this case the test mean is 0.793, higher than the baseline mean, so we want to determine the probability of baseline samples being greater than 0.793. In the figure below, this accounts for the shaded area under the right tail which can be calculated using Z-tables or widgets like this. In our case, this area amounts to 0.0367 (3.7%), which is interpreted as the probability that our baseline system would produce a sample like the test. This is very weak support for the notion that our changes have not regressed system performance.
Loosely speaking, we have conducted a one-sided hypothesis test and the principle applies equally for test samples lower than the baseline mean. In such a test we make the null-hypothesis that there has been no change to the system and then determine the strength of evidence that our test results would occur by chance alone i.e. no change to the system. If this support falls below a selected threshold (5% is common), we reject the null-hypothesis.
If one wanted to improve the accuracy of our estimates above, we would need to decrease the variance of our distribution of the sample mean. In the Bayesian sense, this variance represents the uncertainty in our estimate of the actual mean from the data we have available. For the most accurate estimate, we should maximize the sample size since in the sampling distribution of the sample mean, variance is proportional to the inverse of the sample size. Therefore, in order to improve the accuracy of our estimates, we need to collect more test data with the new configuration. Intuitively this should make sense.
Below we show how the variance of the sampling distribution of the sample mean decreases with sample size. These graphs illustrate location and shape only and probabilities (y-axis) should be ignored.
So there you have it: although we do not have a definitive answer regarding the effect of our changes to the system, we have made some steps towards quantifying our uncertainty. This might not seem like a lot to engineers accustomed to micrometers and 3-point precision but it's a fundamental step towards controlling a given process. Uncertainty is inevitable in any system we choose to measure and an important qualifier for any actionable metrics. Ask Nate Silver.
Notes:
In this example we used web server response time as a metric but this analysis applies equally to any system metric. Data representing our two systems was collected by pointing curl at two different websites:
The python code for this example is available as an IPython notebook here