next up previous contents
Next: Using the code for Up: Discussion Previous: Periodicity   Contents


Monte Carlo Bootstrap Error Analysis

The premise of bootstrapping error analysis is fairly straightforward. For a time series containing N points, choose a set of N points at random, allowing duplication. Compute the average from this ``fake'' data set. Repeat this procedure a number of times and compute the standard deviation of the average of the ``fake'' data sets. This standard deviation is an estimate for the statistical uncertainty of the average computed using the real data. What this technique really measures is the heterogeneity of the data set, relative to the number of points present. For a large enough number of points, the average value computed using the faked data will be very close to the value with the real data, with the result that the standard deviation will be low. If you have relatively few points, the deviation will be high. The technique is quite robust, easy to implement, and correctly accounts for time correlations in the data. Numerical Recipes has a good discussion of the basic logic of this technique. For a more detailed discussion, see ``An introduction to the bootstrap'', by Efron and Tibshirani (Chapman and Hall/CRC, 1994). Please note: bootstrapping can only characterize the data you have. If your data is missing contributions from important regions of phase space, bootstrapping will not help you figure this out.

In principle, the standard bootstrap technique could be applied directly to WHAM calculations. One could generate a fake data set for each time series, perform WHAM iterations, and repeat the calculation many times. However, this would be inefficient, since it would either involve a) generating many time series in the file system, or b) storing the time series in memory. Neither of these strategies is particularly satisfying, the former because it involves generating a large number of files and the latter because it would consume very large amounts of memory. My implementation of WHAM is very memory efficient because not only does it not store the time series, it doesn't even store the whole histogram of that time series, but rather just the nonzero portion.

However, there is a more efficient alternative. The principle behind bootstrapping is that you're trying to establish the various of averages calculated with N points sampling the true distribution function, using your current N points of data as an estimate of the true distribution. The histogram of each time series is precisely that, an estimate of the probability distribution. So, all we have to do is pick random numbers from the distribution defined by that histogram. Once again, Numerical Recipes shows us how to do it: we compute the normalized cumulant function, $c(x)$, generate a random number between 0 and 1 $R$, and solve $c(x) = R$ for $x$. Thus, a single Monte Carlo trial is computed in the following manner:

  1. For each simulation window, use the computed cumulant of the histogram to generate a new histogram, with the same number of points.

  2. Perform WHAM iterations on the set of generated histograms

  3. Store the average normalized probability and free energy, and their squares for each bin in the histogram

There's a subtlety to how you compute fluctuations in the free energy estimates, since the potential of mean force is only defined up to a constant. I have chosen to align the PMFs by computing them from the normalized probabilities, which is effectively the same as setting the Boltzmann-averaged free energies equal. This is a somewhat arbitrary choice (for example, one could also set the unweighted averages equal), but it seems reasonable. If you want something bulletproof, use the probabilities and their associated fluctuations, which don't have this problem.

The situation is slightly more complicated when one attempts to apply the bootstrap procedure in two dimensions, because the cumulant is not uniquely defined. My approach is to flatten the two dimensional histogram into a 1 dimensional distribution, and take the cumulant of that. The rest of the procedure is the same as in the 1-D case. In release 2.0.4, the option to do 2D bootstrapping has been commented out. I'm not sure if there's a programming problem, or implementing the better way of doing the 1D case simply revealed a deeper problem, but 2D bootstrapping is currently broken.

There is one major caveat throughout all of this analysis: thus far, we have assumed that the correlation time in time series is shorter than the snapshot interval. To put it another way, we've assumed that all of the data points are statistically independent. However, this is unlikely to be the case in a typical molecular dynamics setting, which means that the sample size used in the Monte Carlo bootstrapping procedure is too large, which in turn causes the bootstrapping procedure to underestimate the statistical uncertainty.

My code deals with this by allowing you to set the correlation time for each time series used in the analysis, in effect reducing the number of points used in generating the fake data sets (see section refss:format). For instance, if a time series had 1000 points, and you determined by other means that the correlation time was 10x the time interval for the time series, then you would set ``correl time'' to 10, and each fake data set would have 100 points instead of 1000. If the value is unset or is greater than the number of data points, then the full number of data points is used. Please note that the actual time values in the time series are not used in any way in this analysis; for purposes of specifying the correlation time, the interval between consecutive points is always considered to be 1.

The question of how to determine the correlation time is in some sense beyond the scope of this document. In principle, one could simply compute the autocorrelation function for each time series; if the autocorrelation is well approximated by a single exponential, then 2x the decay time (the time it takes the autocorrelation to drop to $1/e$) would be a good choice. If it's multiexponential, then you'd use the longest time constant. However, be careful: you really want to use the longest correlation time sampled in the trajectory, and the fluctuations of the reaction coordinate may fluctuate rapidly but still be coupled to slower modes.

It is important to note that the present version of the code uses the correlation times only for the error analysis and not for the actual PMF calculation. This isn't like to be an issue, as the raw PMFs aren't that sensitive to the correlation times unless they vary by factors of 10 or more.


next up previous contents
Next: Using the code for Up: Discussion Previous: Periodicity   Contents
Alan Grossfield 2010-06-20