Homework 6: DNA sequence alignment, model estimation, and Bayesian inference
You will need the following software for this assignment: Mr. Bayes v3.2.2 and R Studio.
You also need this handout, your chloroplast alignment from class, and 2 nexus files I emailed you: Claytonia_reduced_plastid.nex and Claytonia_28_samples.nex
Homology Assessment. — We aligned three plastid datasets from O’Quinn (unpublished) with ClustalΩ and Muscle algorithms this week. Please look up the plastid map published by Lee et al. in BMC Genomics 2006 7:61 (doi:10.1186/1471-2164-7-6)– can you locate the petA-psbJ, rpl14-rpl46, and trnQ-rps16 loci in the chloroplast map for cotton? Please draw arrows to the three areas where these loci occur.
Several assumptions are commonly made concerning the chloroplast. Let’s do some fact checking. Answer TRUE or FALSE to the following generalizations and provide a one-sentence explanation of your reasoning.
- The chloroplast is maternally inherited.
- The chloroplast has a relatively small genome compared to the plant nuclear genome.
- Different regions of the chloroplast evolve at different rates.
- The chloroplast genome does not experience recombination, but the nuclear genome does.
- The chloroplast genome (as the sole source of data) is useful for testing hypotheses about hybridization.
- The chloroplast genome evolves slowly relative to the nuclear genome in plants.
Let’s examine the chloroplast dataset you worked with in class this week. There are several ways of looking at data—one of them is to examine how the variation is distributed. Open R Studio and set your working directory to get started. Then….
library(phangorn) ##load the phangorn package
plastid <- read.phyDat(“your_data_set.nex”, format=”nexus”, type=”DNA”)
plastid ##how many sequences are in your dataset? How many characters? If you do NOT have approximately 2900 characters in your matrix, email me for a new alignment##
colormatrix <- as.DNAbin(plastid) ##create object of type DNAbin for alignment visualization
image(colormatrix, show.labels=FALSE) ##plot matrix with default colors & labels; omit taxon labels
Wow! You should see, in black, the characters for which we are missing data. Please list a few reasons why we might have missing data for a subset of the taxa—what processes might lead to this pattern?
Refer to the list of samples I provided to you– this includes all of the samples and loci concatenated to make the plastid matrix you’re working with.
- Are there some taxa for which only one or two loci were sequenced?
- Are there some samples for which the sequence labels were likely misspelled?
In an ideal world, phylogenetic analyses are conducted on observations. “Missing data” are special— we assume the existence of a state at every character, even if we have not observed the states empirically. While the inclusion of missing data is admissible, we must try to minimize it if we want to have a reasonable amount of confidence in our phylogeny estimate. Assuming we want to minimize missing data, we have at least two choices:
- Remove the taxa for which we have only sequenced one or two loci (not three).
- Remove the characters (loci) for which we have many missing taxa.
First, we should go through and change all of the sample names that have simply been misspelled. For example, all three loci for Lewisia triphylla and L. pigmaea were sequenced, but one of the names for each of these samples was spelled differently. In the final alignment, these samples occur as two sequences each instead of one…
Place a star next to each of the names on the previous page that you think may have been mislabeled.
I went ahead and changed these names in our alignment, which reduced the dataset by 10 or 11 samples, then made the following observations…
- If we remove all samples for which less than three loci are available, we reduce the dataset to 28 samples.
- If we retain samples with only
petA-psbJ and trnQ-rps16 … 39 samples remain.
trnQ-rps16 and rpl14-rpl46… 36 samples remain.
petA-psbJ and rpl14-rpl46… 43 samples remain.
- Removing all samples for which only one locus is available reduces the dataset to 51 samples.
Let’s try analyzing the dataset including all samples that have at least 2 loci sequenced. This is the dataset I sent you called Claytonia_reduced_plastid.nex
library(phangorn) ##load the phangorn package
reduced <- read.phyDat(“Claytonia_reduced_plastid.nex”, format=”nexus”, type=”DNA”)
reduced ##how many sequences are in your dataset? How many characters? If you do NOT have approximately 2900 characters in your matrix, email me for a new alignment##
colormatrix <- as.DNAbin(reduced) ##create object of type DNAbin for alignment visualization
image(colormatrix, show.labels=FALSE) ##plot matrix with default colors & labels; omit taxon labels
Now modify the commands above to import the dataset Claytonia_28_samples.nex and visualize the matrix.
How many sequences and characters are included in the Claytonia_28_samples dataset? Visualizing the matrix, have we excluded all the missing data from the alignment?
If this were your study (with your study plants), could you avoid having to exclude data from your analyses through some aspect of the experimental design? (3-4 sentences).
Phylogeny estimation with Bayesian Inference. — Next, let’s use the Claytonia_reduced_plastid.nex file to run a Bayesian analysis. If you finished the lab assignment, you will have generated an initial alignment of the chloroplast regions petA- psbJ, rpl14-rpl46, and trnQ-rps16.
If you open MrBayes and execute your sequence matrix without a command block, the output will contain details about the dataset, but no analyses will run. Open the Claytonia_reduced_plastid.nex matrix in a text editor, and paste the following Bayes command block at the end of the file, right after the end; command.
begin mrbayes;
charset trnQrps16 = 1-607;
charset rpl14rpl36 = 608-1714;
charset petApsbJ = 1715-2904;
Prset statefreqpr=dirichlet(1,1,1,1);
Lset nst=mixed rates=gamma;
unlink revmat=(all) tratio=(all) statefreq=(all) shape=(all); mcmc ngen=100000 printfreq=10000 samplefreq=1000 checkpoint=yes checkfreq=10000;
end;
When you have pasted this block at the end of your nexus file, save your nexus file under the same name, Claytonia_reduced_plastid.nex, and open Terminal. Enter…
mb
…after the prompt.
At the next prompt, type exe, then a space, then drag your Claytonia_reduced_plastid.nex file into the console to specify both the path and file name of your matrix and Bayes command block.
Execute the job. When the analysis has run for 100000 generations, you will be prompted…
Continue with analysis? (yes/no):
Record the average standard deviation of split frequencies. Is it less than 0.01? If so, type no to complete the analysis, then continue to the sump command below. If not, type yes.
When prompted for Additional number of generations: type 400000 and hit return. When 400000 more generations have completed, record the average standard deviation of split frequencies, type in no and continue to the sump command below.
SUMP and SUMT: Summarizing parameters and trees. — At the prompt, type in sump to summarize the posterior probabilities. MrBayes, will (by default) exclude the first 25% of your runs as burnin.
At the next prompt, type in sumt conformat=simple to summarize the posterior probabilities. Again, the default for MrBayes is to exclude the first 25% of your runs as burnin.
As you can see, MrBayes has output a bunch of information about model parameters to your console. Fortunately, this information has also been saved to your working directory! Close the MrBayes terminal and open R Studio.
1. In R Studio, navigate to your working directory (i.e., the folder where your Claytonia_reduced_plastid.nex file resides, along with the Bayes output files). Set this as your working directory in R Studio. Find the file in your working directory called Claytonia_reduced_plastid.nex.con.tre
2. In your R console, load the required packages and then read in your tree file…
library(ape)
library(phangorn)
library(vegan) ##you may need to install this one…
tr<-read.nexus(“Claytonia_reduced_plastid.nex.con.tre”) ##read your tree into R.
tr ##view information about this file.
3. As you can see, there are two trees in this file. It is very likely that they’re similar, so let’s work with just one of them now…
tr<- tr[[1]] ##make a set that includes only the 1st of your two trees.
tr ##gives a brief description of your topology, which has tip labels, node labels, etc.
4. Now let’s make it look nicer…
tr<-root(tr, “Lewtriphylla_ROQ515”, node = NULL, resolve.root = FALSE, interactive = FALSE) ##root the tree with Lewisia.
tr<-ladderize(tr) ##sort branches in your tree from long-short so that it’s easier to read.
plot.phylo(tr, use.edge.length=TRUE, direction=”r”, show.node.label=TRUE, show.tip.label=TRUE, label.offset=0.001, cex=.5) ##plot your tree.
5. What about the posterior probability (PP) values? Right now they are stored with four significant digits, but I’d like to show them with just two. PP values are stored as names (vectors) in your tree…
names(tr) ##list the elements of your tree.
tr$node.label ##lists the PP values associated with internal nodes.
In order to reduce the number of significant digits, we must change them from “characters” of clades to numbers.
Plot your tree again and add your PP values with just two significant digits.
plot.phylo(tr, use.edge.length=TRUE, direction=”r”, show.node.label=FALSE, show.tip.label=TRUE, label.offset=0.001, cex=.5)
nodelabels(pp, cex=.5, bg=”lightpink”, frame=”r”) ##puts PP values on nodes.
I still think this looks crowded. Let’s try using simple symbols rather than actual PP values.
p <- pp ##create a vector, p, that is identical to pp.
p[pp >= 0.95] <- “red” ##values of p that match pp >=0.95 will be red.
p[pp < 0.95 & pp >= 0.75] <- “lightpink”
p[pp < 0.75] <- “white”
Okay, let’s try plotting the phylogeny again…
plot.phylo(tr, use.edge.length=TRUE, direction=”r”, show.node.label=FALSE, show.tip.label=TRUE, label.offset=0.001, cex=.5)
nodelabels(pch=21, cex=0.75, bg=p) ##place circles colored by p as node labels.
Looks great! I love pink. The last thing I want to do is change the edge (branch) colors to show which of these Claytonia samples were collected from southern California…
tr$tip.label ##lists all tip labels and their tip numbers
wh<-which.edge(tr,c(16:18,23:27,30)) ##specify tips to have colored branches.
wh ##shows edges corresponding to our taxa of interest.
colo<- rep(“black”, dim(tr$edge)[1]) ##repeat the color black for all edges in the tree.
colo[wh]<-“orange” ##specify which edges (wh) should be orange.
plot.phylo(tr, use.edge.length=TRUE, direction=”r”, edge.color=colo, edge.width=3, show.tip.label=TRUE, label.offset=0.001, cex=.5)
nodelabels(pch=21, cex=0.75, bg=p)
Looks great! Export your tree as a .pdf (expand the length dimension so the branches are not super short) and send it to me.