Laboratory 7. Bayesian Analyses and Post hoc analyses
Using the Claytonia dataset from your homework, we will be investigating the influence of run length on parameter and tree estimation. For this lab, you will need to use Tracer v 1.5 (Rambaut & Drummond, 2007), R Studio, and Text Wrangler and you will need several sets of files. Three folders of files were sent to you today via email—these contain the files from three independent Bayesian analyses conducted on the same dataset, Claytonia_reduced_plastid.nex, and you should have one folder of files from your own Bayesian analysis. Place these four folders of files in your working directory.
The goals of this lab are to:
- Compare across Bayesian analyses varying in run length (i.e. ngen).
- See how changes in run length affect posterior probability estimates for clades in a tree.
- Understand how post hoc analyses can be used to examine Bayesian parameter estimates.
Working directory. — Create a working directory in a convenient place called Lab7 (no space). In this directory, you should have the folders…
Claytonia_1milliongens,
Claytonia_2.5milliongens,
Claytonia_5milliongens, and
the folder of results from Homework 6 (name this Claytonia_500Kgens).
Comparing average standard deviations of split frequencies in R (i.e., visualizing frequency data with histograms). — We are going to first look at the .mcmc files in each of your folders. These files contain information about the mcmc chain. In Text Wrangler, open the .mcmc files in each of your four run folders and delete the text preceding the tab-delimited table.
First, the file will look like this…
After deleting the extraneous header information, your file should look like this….
Save this file without changing the file name and modify each of the other files in turn.
Open R Studio, navigate to your working directory, and set your working directory as Claytonia_500Kgens.
dta <- read.table(Claytonia_reduced_plastid.nex.mcmc, header = TRUE, sep = “\t”) ##reads a tab-delimited file, you may have to modify the file name…
colnames(dta) ##shows the column names of your .mcmc file—we are especially interested in the last column, “AvgStdDev.s.”
sd <-(dta[,dta$AvgStdDev.s.]) ## make a new object that includes only our column of interest
dim(sd) ##check the dimensions of this column. Is this the number of values you were expecting given the sampling frequency you specified? You may have to check the Bayes block in the nexus (input) file, which is in your job folder.
fivemil<- hist(sd, breaks=100, col=”red”, xlab=”average SD of split frequencies”, main=”500,000 generations”) ##make an object that is a plot of the average standard deviation of split frequencies for your run.
summary(sd) ##find the minimum average standard deviation of split frequencies and report this in the space provided on the next page. Also record the median and mean average SD.
Export the histogram you made to your working directory. Then set your working directory as Claytonia_1milliongens. Read in the Claytonia_1mil.nex.mcmc file using the commands above and make a histogram for this set of runs. Export the histogram as before and repeat the procedure for each of the jobs. In total you should have four histograms showing the change in average standard deviation of split frequencies over the course of your run.
How do the values change when you increase the length (number of generations) of the run?
500,000 generations
Final (minimum) average standard deviation of split frequencies:
Median:
Mean:
1,000,000 generations
Final (minimum) average standard deviation of split frequencies:
Median:
Mean:
2,500,000 generations
Final (minimum) average standard deviation of split frequencies:
Median:
Mean:
5,000,000 generations
Final (minimum) average standard deviation of split frequencies:
Median:
Mean:
When you have recorded your results, compare your findings with the rest of the class. What do you notice about variation in standard deviations for each of the analyses?
Now we have looked at variance associated with the tree search. Let’s examine variance associated with parameter estimation…
Open Tracer. From the File menu, go to Import Trace File… navigate to the folder containing the files from your Homework 6 assignment (i.e., ngen=500,000) and import both Claytonia_1mil.nex.run1.p and Claytonia_1mil.nex.run2.p – these should both appear in Tracer. You may need to change the import file type from BEAST log (*.log) Files to All Files in order to select and import the parameter files.
In Tracer, you should see your two parameter files in the top, left corner of your screen. If you select Combined, you will see the results of both runs.
There are four analysis tabs to choose from:
- Estimates shows the mean, standard deviation, confidence intervals and other statistics about the selected parameter (under Traces on the left side of the screen).
- Density shows the Bayesian posterior density plot for the selected parameter.
- Joint-Marginal appears if exactly two parameters are chosen (hold down shift to select multiple parameters). It then plots one against the other to look at their joint-marginal distribution.
- Trace shows the trace of the parameter against state or generation number. Use this to check mixing, choose a suitable burn-in and look for trends that might suggest problems with convergence.
For your first set of files (500K), examine each tab for each of your two runs. Do the Estimates differ much between runs?
Take a look at the Traces menu on the left side of your screen. In red, you will see ESS (effective sample size) statistics. ESS is the number of independent samples in the trace; a low number means that many correlated samples (often times adjacent each other in the analysis) were detected from the trace and that the trace may not represent the posterior distribution well. Do the ESS values differ between your two runs?
The ESS is sometimes used to determine how much longer a Bayesian analysis would need to run in order to stabilize. Ideally, we would like an ESS greater than 200. If the ESS for run 1 is 50, then we’d need to run the analysis for at least 4 times longer than we did. Given your ESS values for 500,000 generations, how many generations might you specify next time?
Now go to the Trace tab at the top of your screen. This shows you the log likelihood values over your 500,000 generations. We are looking for stabilization in the variance between adjacent runs over sequential generations. Does it look as though your runs are stabilizing?
Go to the Density tab next for each of the parameters—we would like the posterior probability density to be a nice, bell-shaped curve. Select both runs by holding down the shift key and selecting both runs. Do the distributions overlap?
Tracer is used as a post hoc test to determine whether a Bayesian analysis has run long enough. Go back to Import Trace File… and import the parameter files from the other Bayesian analyses I sent to you. These include
Claytonia_1mil.nex.run1.p
Claytonia_1mil.nex.run2.p
Claytonia_2.5mil.nex.run1.p
Claytonia_2.5mil.nex.run2.p
Claytonia_5mil.nex.run1.p
Claytonia_5mil.nex.run2.p
Examine each pair of runs and fill in the tables below:
Some workers have argued that maximum-likelihood and Bayesian analyses will fail to recover the ‘true tree’ and accurate parameter estimates more frequently if the appropriate model of evolution is not used. In our Bayesian analyses, we can see how well the model fits the data by examining the posterior distribution and variance among runs in Tracer. Is the difference in variance between runs low in the Bayes analysis that ran for 5 million generations?
In what respects might a phylogenetic tree based on the 5 million generation analysis differ from the 500 thousand generation analysis you did for Homework 6? Please refer to tree shape, branch length, posterior probabilities, etc.
Software:
Rambaut, A. and A.J. Drummond. 2007. Tracer v1.4, Available from http://beast.bio.ed.ac.uk/Tracer
RStudio (2013). RStudio: Integrated development environment for R (Version 0.97) [Computer software]. Boston, MA. Available from http://www.rstudio.org/