Laboratory 6: DNA sequence alignment, model estimation, and Bayesian inference
In this lab, we will use Bayesian Inference to estimate parameters and phylogenies for two datasets. One is a nuclear ribosomal ITS dataset including ITS1, 5.8S and ITS2. The other is a chloroplast dataset including petA-psbJ, rpl14-rpl46, and trnQ-rps16. In all cases, the characters are nucleotides and the taxa are… Claytonia! Before we estimate trees, though, let’s take a little time to revisit the basics—homology assessment (a.k.a. DNA sequence alignment) and model estimation.
The goals of this lab are to:
- Explore DNA sequence alignments to identify potentially challenging assumptions…
- Edit alignments to reflect your best assessment of homology.
- Estimate models of evolution for multiple DNA matrices and partitions.
- Use model estimates as priors in Bayesian analyses.
To begin, make sure that Seaview and MrBayes 3.2 are in your Applications folder and that you have a working directory for today’s lab called Lab6. You will also need to use a text editor to modify your nexus files later on. The four files I emailed to you earlier today should be in your Lab6 working directory.
Exploring and editing alignments. — Open the Seaview application, go to File, then Open in the main menu, and navigate to your working directory. Open the file called Claytonia_OQuinn_petA-psbJ. Maximize the Seaview window so that you can view as much sequence as possible. Make a few general observations about what you see here…. What are you looking at?
What you see here are several DNA sequences that have not been aligned yet. You can probably see areas that line up and others that look like they might line up if you moved them over. What we want, ultimately, is a DNA alignment (matrix) in which the character states (A, C, T and/or G) for each character (column number) are assigned to the correct character.
- Go to Align and choose Align all to conduct an initial alignment of your sequences. An initial alignment is much faster that aligning sequences “by-hand”. The ClustalX application you have will conduct more thorough alignments, but you’ll have to bring them into an editor like Seaview to make any modifications “by eye”.
- When the alignment is finished, use the scroll bar along the bottom of your window to view the changes. How do the sequences look different now?
- If you notice areas of the alignment that look off (i.e., were not sufficiently aligned by the algorithm in Seaview), modify the alignment yourself by inserting or deleting gaps until you are satisfied with your alignment. Save the alignment as a nexus file.
- Notice that the ends of the sequences are not even—some of them are longer than others. There are many causes for length differences in sequences, some of them with an evolutionary basis and some with a mechanical (user error) basis. What are these reasons?
- Remove the uneven ends from the 3’ and 5’ ends of your sequence alignment by selecting base positions and hitting Delete.
When you are satisfied with your modifications to the first alignment, save it as a Nexus file in your working directory, then open each of the other alignments in turn and edit them using steps 1-5 above. Make notes when you run into a challenging area of the alignment.
Finally, open your petA-psbJ, rpl14-rpl46, and trnQ-rps16 alignments in Seaview (3 different windows should be open), go to the File menu, and choose Concatenate from the menu to make a concatenated plastid dataset. Save the concatenated dataset as a nexus file called Claytonia_OQuinn_plastid.nex.
Model Estimation. — We will estimate model parameters in R for the nuclear and plastid datasets the same way we did in Lab 4. Return to lab 4 and follow the instructions for importing an alignment to R via R Studio. This time let’s open one alignment, estimate a model for that alignment using the AIC, and the results for the first matrix before moving on to the second dataset.
Save your model parameters to your scratch pad. How many parameters were estimated for each dataset?
Tree estimation using Bayesian inference. — Now that you have two alignments— Claytonia_Stoughton_ITS.nex and Claytonia_OQuinn_plastid.nex—let’s run some analyses. Open your ITS dataset and follow the instructions below. When you are finished, follow the instructions for your plastid dataset.
- Make sure that your .nex alignments are in your working directory.
- Although MrBayes reads nexus format, you have to do minor editing to your input file. Open Claytonia_Stoughton_ITS.nex of your nexus files in a text editor and check if you have in the DATA datatype=NUCLEOTIDE. If this is the case, just replace it with datatype=dna.
- While you have this file open, scroll to the end of the file where it says
end;
and paste the following text:
begin mrbayes;
log start replace;
Prset statefreqpr=dirichlet(1,1,1,1);
Lset nst=mixed rates=gamma;
mcmc ngen=10000 printfreq=1000 samplefreq=100;
end;
Consider replacing the prior state frequencies in your Bayes block (1,1,1,1) with the state frequencies calculated in your model estimate.
- Save your file.
- Open Terminal and type mb. At the prompt, type exe, then a space, then drag your nexus file from your working directory into your console and hit return. The analysis should start running immediately.
Note: In MrBayes you can use the command help to see options.
Type help mcmc for options of the mcmc chain.
Change number of generations (ngen), how often the information about the chain is displayed on the screen (printfreq), and how often the Markov chain is sampled (samplefreq), typing the following:
mcmc ngen=10000 printfreq=1000 samplefreq=100
Check the values for the average standard deviation of split frequencies. You should see a value approaching <0.01 for getting convergence of the two runs (although some people use <0.05). Did your runs converge? If they did, at what generations?
You can always run a longer analysis. Just type y (for “yes”) when prompted to continue with the analysis, then enter the number of extra generations. In this case, run 90000 generations, to have a total of 100,000. Check the split frequencies.
Stop the analysis by typing n when the program finishes all generations. As you ran 100,000 generations and sampled each 100, the number of samples (trees) you have for each run is 1,000.
Check the graph by typing plot
- You should see a plot of generation versus log probability of the data.
Check the summary of the parameters, typing sump
- Potential scale reduction factors (PSRF) should approach to 1 to get convergence.
Check the summary of trees (how many trees you have) by typing sumt
Obtain a majority rule consensus tree (50%) from all trees obtained by entering the command sumt conformat=simple.
By default, MrBayes will discrad the first 25% of the trees as burnin for doing the majority rule consensus tree. Check to see if your runs converged at 25%– if they didn’t, you should increase the burnin value typing sumt burnin=x, where x is the number of samples to be treated as burnin.
MrBayes will create a number of files:
*.con – Resulting consensus tree(s) for each run in Newick format.
*.mcmc – Information about the mcmc chain.
*.part – posterior probabilities of the partitions.
*.run#.p – parameters.
*.run#.t – trees.
*.trprobs – trees sorted by posterior probabilities.
For homework, you will run a longer analysis for the plastid dataset and visualize trees with posterior probability estimates in R. In the next lab, we will talk about post hoc tests to see whether we ran our chains long enough to sufficiently estimate both trees and parameters.