Laboratory 8. Statistical tests on phylogenies: discordance among hypotheses
We have been talking about criteria used for phylogeny estimation (parsimony, maximum likelihood, Bayesian inference) and how to use post-hoc tests in Bayesian analyses to investigate whether parameter and tree distributions have been estimated sufficiently.
The goals of today’s lab are to:
- Observe the affect of increasing bootstrap replicates on clade support
- Compare bootstrap and posterior probability estimates
- Visualize trees with bootstrap values subtending clades
- Visualize correspondence between multiple trees
You should have three character matrices in your working directory for today. The datasets (PNW_Pyrola_nuclear.nex, PNW_Pyrola_plastid.nex, and PNW_Pyrola_concatenated.nex) are for samples belonging to a species complex in genus Pyrola collected from the Pacific Northwest U.S. Open these datasets in your text editor and record their dimensions (ntax and nchar).
Open R Studio and open the file called Rscript_lab8.R—this contains all the code you need for this lab– you can execute the commands easily using the Run button or by hitting the ⌘ and return keys simultaneously. Navigate to the folder containing your datasets and set your working directory to Lab8. Load the packages you’ll need for today:
Read each of your matrices in to R one at a time, giving the datasets different names. Consider names that are short—below you’ll see I use abbreviations for the full dataset names.
con<-read.phyDat(“PNW_Pyrola_concatenated.nex”, format=”nexus”, type=”DNA”)
con ##how many sequences?_____ Characters?_____ Different site patterns?_____
nuc<-read.phyDat(“PNW_Pyrola_nuclear.nex”, format=”nexus”, type=”DNA”)
nuc ##how many sequences?_____ Characters?_____ Different site patterns?_____
pla<-read.phyDat(“PNW_Pyrola_plastid.nex”, format=”nexus”, type=”DNA”)
pla ##how many sequences?_____ Characters?_____ Different site patterns?_____
Model estimation—Next, let’s estimate models for the first dataset, con. Complete steps 1-6 for con before estimating models for nuc and then pla.
mod<- modelTest(con, tree=NULL, model = c(“JC”, “F81”, “K80”, “HKY”, “SYM”, “GTR”), G = TRUE, I = TRUE, k = 4, control = pml.control(epsilon = 1e-08, maxit = 3, trace = 1), multicore = FALSE)
mod[order(mod$AIC),] ##visualize and order rows the table by Akaike Information Criterion.
Record the best model for con in the table below before going to the next step.
env = attr(mod, “env”) ##examining “env” for your first analysis, mod
ls(env=env) ## a list of models (and trees!)—each one is associated with a set of trees (and the underlying data) for which likelihood scores have been calculated.
mod.eval <- get(“GTR+G”, env) ##modify this command for each dataset to match the model chosen by AIC
con.mod<- eval(mod.eval, env=env) ##store this object to use later in bootstrap analysis
con.mod ##view your results
Record the model summary in a text document by copying/pasting from your R console. We will use these values later. Then, repeat steps 1-6 for nuc and then pla.
Record the best model for each dataset:
~~ At this point, you should have estimated a model of evolution for each character ~~ matrix and created a text file containing those models.
Statistical tests on phylogenies— The modelTest function is very cool because, concurrent to estimating a model, it also estimates a phylogeny. Today, we will use the trees estimated by modelTest to conduct bootstrap analyses.
You should have three models called con.mod, nuc.mod, and pla.mod. When you simply type the name of this object, nuc.mod, the model of nucleotide evolution (including rate matrix) is shown. Let’s look at the other elements of one of these models.
As you can see, this model has several elements, including a “tree”. Let’s look at it.
rooted<-root(nuc.mod$tree, “Ecampanulata”, node = NULL, resolve.root = FALSE, interactive = FALSE)
plot(lad, cex=.5, type=”phylogram”, use.edge.length=FALSE)
This looks great, but we don’t know anything about statistical support for the clades shown here. Fortunately, we can run a bootstrap analysis on the data and tree using the bootstrap.pml function in phangorn.
bs.nuc10 = bootstrap.pml(nuc.mod, bs=10, optNni=TRUE, control = pml.control(trace = 0))
nuclear10<- plotBS(nuc.mod$tree, bs.nuc10, type=”phylogram”, cex=.5)
Rather than rooting and ladderizing our tree in R, let’s export our tree as a .tre file to open and modify in FigTree later.
Your tree file should appear in your working directory.
Repeat the bootstrap analysis on this dataset with 100 generations and write that tree to a file, too. Repeat the two bootstrap analyses (10 BS replicates and 100 BS replicates) for your plastid model, pla.mod. For the concatenated dataset, write only the tree for which you conducted 100 BS replicates.
When you are finished, you should have five tree files:
nucBS10.tre (10 bootstrap replicates)
nucBS100.tre (100 bootstrap replicates)
plaBS10.tre (10 bootstrap replicates)
plaBS100.tre (100 bootstrap replicates)
conBS100.tre (10 bootstrap replicates)
When you are finished computing bootstrap statistics and have written your trees to files, open them in FigTree. Do you see any clades that are supported by less that 65%?
The names for samples each have a suffix appended to them—either ap, cr, de, or pi. These each stand for the four different species represented in the Ingroup for these datasets. Do the phylogenies for the nuclear and plastid datasets appear to be congruent? In other words, does each of the species form a monophyletic group?
SH test—While I could not find an ILD test in R, it is useful to provide some justification for combining two or more datasets for an analysis. As we talked about before, the benefits of combining datasets are that you will have more synapomorphies supporting the clades in your phylogeny estimate. One of the drawbacks is that we know some loci have different evolutionary histories, which may provide us with slightly (or massively) different topologies.
The SH (Shimodaira-Hasegawa) test is a resampling method that uses site likelihood scores to determine whether two sets of plausible trees (like bootstrap trees for ML analyses) are incongruent. The simple null hypothesis we are testing is, ‘The two trees being compared are congruent’ and a more accurate way of wording this is, ‘Differences in tree fit to data are what we would expect by random chance, alone.’ Rejection of the null hypothesis is justification for not combining the datasets.
fit1 <- pml(concat100, con) ##resampling topologies with optimal lnLik from treespace
fit2 <- pml(plastid100, pla)
fit3 <- pml(nuclear100, nuc)
fit1 <- optim.pml(fit1) ## optimize edge weights
fit2 <- optim.pml(fit2)
fit3 <- optim.pml(fit3)
SH.test(fit1, fit2, fit3, B=500) ##conduct SH test; rejection of null hypothesis (<.05) means the nuclear and plastid datasets are significantly different from the concatenated dataset (justification for not combining partitions).
You should see something like this:
> SH.test(fit1, fit2, fit3, B=500)
Trees ln L Diff ln L p-value
[1,] 1 -5033.737 0.0000 0.768 ##shows no diff in lnL among tree1 reps
[2,] 2 -5868.339 834.6015 0.010 ##shows significant diff in lnL concat and both tree1 (nuc)and tree 2 (pla)
[3,] 3 -6389.012 1355.2748 0.000
To show yourself what it looks like when two congruent trees are compared, rerun the SH test above but compare fit1 with itself:
SH.test(fit1, fit1, B=500) ##Do you see the difference?
Plotting two facing phylogenies using cophylo—In some situations, an SH test (or similar test for incongruence) may indicate that two or more datasets for a set of taxa are more dissimilar than we’d expect by chance alone. BUT, there may be evolutionary processes that can account for discordance among different genetic loci (for example, hybridization among species) and we may want to visualize the discordance. Let’s use the trees you developed to show the correspondence between two loci (one nuclear, one chloroplast) for a single set of taxa.
First, we need to root and ladderize two trees that we want to compare.
tree1<-nuc.mod$tree ##tree from our model estimate for the nuclear dataset
rooted<-root(tree1, “Ecampanulata”, node = NULL, resolve.root = FALSE, interactive = FALSE)
tree2<-pla.mod$tree ##tree from our model estimate for the plastid dataset
rooted<-root(tree2, “Ecampanulata”, node = NULL, resolve.root = FALSE, interactive = FALSE)
Next, we need to create an association matrix consisting of tow columns—in the first column we need all of the tip labels for tree1 (so, each row represents a sample) and in column two we need all the tip labels for tree2. Because we want the tip labels in tree1 and tree2 to correspond exactly in this case, we’ll need to sort the tip labels before making the matrix.
tips1<- sort(tree1$tip.label) ##sort labels for tree1 (nuclear tree) alphabetically
tips2<-sort(tree2$tip.label) ##sort labels for tree2 (plastid tree) alphabetically
To combine both lists of labels into a matrix (63 rows, 2 columns), use the cbind command.
cophy<-cbind(tips1, tips2) ##here we make a two-column matrix called cophy—in the first column is the list of taxa associated with tree1 (identical to those in tree2) and the second column will be the taxa for tree2. This matrix will specify the associations (links) between terminals in our nuclear phylogeny and terminals in our chloroplast phylogeny.
cophy ##visualize this small matrix in your console. Are column 1 and column 2 identical?
cophyloplot(tree1, tree2, assoc=cophy, use.edge.length=FALSE, space=40, length.line=-3, gap=2, type=”phylogram”, rotate=TRUE, show.tip.label=FALSE, font=3) ##this is your “cophylogeny” showing where certain taxa come out in the nuclear and chloroplast phylogenies.
When the argument rotate=TRUE is used, you can click on nodes with your mouse and rotate clades so that the terminals match up better. Try this out!
Also try using the argument show.tip.label=TRUE to see which clades represent the different species. When you have something you like, save it as a pdf and turn it in with your homework.
As homework, modify the nuclear and chloroplast phylogenies in FigTree. Your trees should have the following qualities:
- Root the phylogeny using Enkianthus campanulata (Ecampanulata).
- Ladderize (order) the branches so nested relationships are easy to see.
- Show bootstrap support values.
- Color clades (or singles branches, in some cases) according to species—ap, cr, de, pi. You can leave the outgroup taxa black.
- Export each of the four trees as a pdf document, then fit them all onto a single page in Microsoft Word or PowerPoint) for comparison.
- Label each tree with the locus (nuclear, plastid) and the number of bootstrap (BS) replicates.
- For each tree, circle the nodes that are supported by less than 65% BS.
- On the same sheet, provide two or three sentences describing how the trees differ from each other.
- Print the cophylogeny you made and turn it in to me.