As per the CLT, the resulting Gaussian will have the same mean as the original distribution (our baseline data) and variance equal to the original divided by the sample size. In this case our sample size is limited by the size of the test sample i.e. 100. The sequence of graphs below show how our Gaussian distribution forms as the number of bootstrap samples increases, converging on a mean ($\mu$) of 0.767 and standard deviation ($\sigma$) 0.0146.
Gaussian in hand, we can leverage the properties of this well-known distribution to make statements regarding our system. The curve can be used to determine the likelihood of our baseline system of generating samples that fall within a specific range. In this case the test mean is 0.793, higher than the baseline mean, so we want to determine the probability of baseline samples being greater than 0.793. In the figure below, this accounts for the shaded area under the right tail which can be calculated using Z-tables or widgets like this. In our case, this area amounts to 0.0367 (3.7%), which is interpreted as the probability that our baseline system would produce a sample like the test. This is very weak support for the notion that our changes have not regressed system performance.
Loosely speaking, we have conducted a one-sided hypothesis test and the principle applies equally for test samples lower than the baseline mean. In such a test we make the null-hypothesis that there has been no change to the system and then determine the strength of evidence that our test results would occur by chance alone i.e. no change to the system. If this support falls below a selected threshold (5% is common), we reject the null-hypothesis.
If one wanted to improve the accuracy of our estimates above, we would need to decrease the variance of our distribution of the sample mean. In the Bayesian sense, this variance represents the uncertainty in our estimate of the actual mean from the data we have available. For the most accurate estimate, we should maximize the sample size since in the sampling distribution of the sample mean, variance is proportional to the inverse of the sample size. Therefore, in order to improve the accuracy of our estimates, we need to collect more test data with the new configuration. Intuitively this should make sense.
Below we show how the variance of the sampling distribution of the sample mean decreases with sample size. These graphs illustrate location and shape only and probabilities (y-axis) should be ignored.
So there you have it: although we do not have a definitive answer regarding the effect of our changes to the system, we have made some steps towards quantifying our uncertainty. This might not seem like a lot to engineers accustomed to micrometers and 3-point precision but it's a fundamental step towards controlling a given process. Uncertainty is inevitable in any system we choose to measure and an important qualifier for any actionable metrics. Ask Nate Silver.
In this example we used web server response time as a metric but this analysis applies equally to any system metric. Data representing our two systems was collected by pointing curl at two different websites:
The python code for this example is available as an IPython notebook here