Laboratory 4: Characteristics of molecular data—modelling variation
In this lab, you will use R (in R Studio) to examine characteristics of a nucleotide sequences.
The goals of this lab are to:
- Explore different ways of looking at DNA characters.
- Know how and why to partition sets of genetic characters.
- Estimate models of evolution for genetic characters.
- Observe how specifying a model of evolution influences phylogeny estimation.
In the prelab, you downloaded a Clusia dataset from GenBank. This dataset is relatively straightforward, so let’s use it today. Parsimony analyses of these data were published by Gustafsson and Bittrich in a 2002 Nordic Journal of Botany article called Evolution of morphological diversity and resin secretion in flowers of Clusia (Clusiaceae): insights from ITS sequence variation.
Open R Studio and set your working directory. Load the packages you need today using the library command rather than loading them from your package tab
total<-read.phyDat(“Clusia_total.nex”, format=”nexus”, type=”DNA”) ##read data into R
total ##gives the number of sequences, number of characters, and character states
ITS <-as.DNAbin(total) ##create object of type DNAbin for alignment visualization
image(ITS, show.labels=FALSE) ##plot matrix with default colors & labels; omit taxon labels
image(ITS, c(“n”,”-“), “grey”, show.labels=FALSE) ##plot only missing data (n) & indels (-) in grey
You may try other variations on these plots. If you specify show.labels=TRUE you can identify individuals that have large amounts of missing sequence data. In your work, you will need to carefully consider whether individuals like this should be omitted from your analyses. Why is missing data so important in phylogenetic estimation?
Partitioning molecular data
The dataset we have here is a nuclear ribosomal region called ITS—the authors of the paper sequences ITS1, 5.8S, and ITS2. While these are adjacent to each other physically, they are likely experiencing different rates of evolution. Why is that?
We can partition the data, breaking the 671 base pairs of sequence into three sections to estimate independent models of evolution, but we need some information about where these sections begin and end along the length of total …
How would we obtain information on where the spacers and 5.8S begin and end?
I took the annotations from Genbank and split the dataset into three separate matrices; ITS1 (1:281), 5.8S (282:448), ITS2 (449:671). Find these in your working directory and import them.
Copy the command for importing phyDat at the beginning of this document, paste it into your R console, and modify the command so that you create four different matrix objects called…
its1, its2, and five8S
…corresponding to the datasets in your working directory.
The Paradis (2012) text for this class provides great detail on the assumptions that go into nucleotide and amino acid substitution models. These days we estimate models from the data themselves, using the frequencies of nucleotides A, T, C, and G across taxa and characters in the data matrix. A substitution model specifies rates of change among the character states A, T, C, and G. In this next exercise, let’s estimate models of evolution for the Clusia datasets using the modelTest function in phangorn.
mod<- modelTest(total, tree=NULL, model = c(“JC”, “F81”, “K80”, “HKY”, “SYM”, “GTR”), G = TRUE, I = TRUE, k = 4, control = pml.control(epsilon = 1e-08, maxit = 3, trace = 1), multicore = FALSE)
The procedure should take about one minute to complete. The modelTest function will calculate the negative log likelihood of each model and output a table of test results.
mod ##view the table of test results—the highest score (smallest negative value) indicate the most probable model.
mod[order(mod$logLik),] ##order rows the table by log likelihood scores.
Next, order the table of test results by AIC and then by BIC. Record the models (and their scores) found using each of these three criteria (i.e., logLik, AIC, BIC).
Find the table on the last page of this document and record the optimal model under each of the criteria: log likelihood, AIC, and BIC. Next, run the modelTest function for the three ITS partitions and record your results in the table. You will need to give each analysis a different name—I suggest naming the additional analyses mod1, mod2, mod3. What you are estimating here is a model of character evolution for each of your partitions…
!?!? But what do all these models specify, anyway ?!?!
Let’s find out! The modelTest function produces a data.frame, mod, with attribute “env”, an environment that contains all the trees, the data and the calls that allow us to choose a substitution model for further analyses.
Copy the following commands into your console one at a time:
env = attr(mod, “env”) ##examining “env” for your first analysis, mod
ls(env=env) ## a list of models (and trees!)—each one is associated with a set of trees (and the underlying data) for which likelihood scores have been calculated
Let’s look at the details of the models that best describe our character matrices.
mod.eval1 <- get(“GTR+G+I”, env) ##model chosen under both logLik and AIC
eval(mod.eval1, env=env) ##shows the summary of model results
The model details should look something like this:
unconstrained loglikelihood: -3797.682
Proportion of invariant sites: 0.09444846
Discrete gamma model
Number of rate categories: 4
Shape parameter: 0.6007271
a c g t
a 0.000000 1.1706485 3.1524465 1.225787
c 1.170649 0.0000000 0.5129253 6.146962
g 3.152447 0.5129253 0.0000000 1.000000
t 1.225787 6.1469623 1.0000000 0.000000
0.2336418 0.2663005 0.2794328 0.2206249
What you’ve got here are the model parameters (estimated from your data) that can be used to constrain your phylogenetic analysis! Copy these results and paste them into your scratch pad in R Studio. At the top of your scratch pad, preceding the data, write TOTAL, GTR+G+I then Save the file as Clusia modelTest results.
Let’s take a look at the alternative model…
mod.eval2 <- get(“SYM+G”, env) ## model chosen under BIC
Again, paste these results into your scratch pad below the data for GTR+G+I model and add the title, TOTAL, SYM+G. How does the SYM+G model compare to the GTR+G+I model? If you were trying to choose conservatively between these two models, which would you choose?
Now obtain details for each of the models for each of your partitions. For total there were two models, and for the other partitions there may be one or three. How do you decide which criterion—logLik, AIC, or BIC—is best?
The model tests you just conducted allow you to see whether any or all of the partitions are evolving in a way from the whole region. I want to make you aware of another set of models that should be used for chloroplast coding sequences—these specify models of amino acid evolution and are listed here (taken from page 146 of Paradis, 2012**).
Estimating phylogeny under a specific model of evolution
Now that we have a model of evolution, we can apply it to our phylogenetic estimation procedure. In an earlier lab, we discussed different ways of constraining analyses by making assumptions about transition order among character states. This time, we will use information about transition rates also!
For now, we will use the parsimony criterion for choosing optimal trees. In lecture, Mark contrasted Fitch (unordered) parsimony with Sankoff (step matrix) parsimony. In Sankoff parsimony, we specify a cost matrix for character transitions—in this case, the cost of transitioning from A to C, A to G, C to T, etc. Look back at the model results you saved. Each of the sets of model details includes a rate matrix specifying the frequency of transition between each pair of bases. We will use these rates to develop our cost matrix!
In this example below, I am specifying a model for total based on the rate matrix estimated under the GTR+G+I model. Check the rate matrix above to make sure you understand where I got the “weights” from, i.e. cost= c(A->C, A->G, A->T, C->G, C->T, G->T).
weighted<-pratchet(total, start=NULL, method=”sankoff”, cost= c(1.1706485, 3.1524465, 1.225787, 0.5129253, 6.146962, 1.000000), maxit=1000, k=50, trace=1, all=FALSE, rearrangements=”NNI”)
See how many trees you recovered by typing
As you can see, we produced an unrooted phylogeny. Fortunately, I know what the outgroup taxon is here (Tripetalum cymosum), so we can specify it. To use the root function, type
rooted<-root(weighted, “AY145206”, node = NULL, resolve.root = FALSE, interactive = FALSE)
Let’s visualize your tree using the plot() function.
Hmmm… this diagram still looks a little funny—I see that it is rooted with the proper taxon now, but the terminal names are in this huge font and it’s hard to see which terminals belong to clades. Let’s use the ladderize and cex functions to change that.
Export this tree as a .pdf to look at later. Now let’s run the same analysis with equal weights and see if there is a difference.
eqweights<-pratchet(total, start=NULL, method=”sankoff”, cost= c(1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000), maxit=1000, k=50, trace=1, all=FALSE, rearrangements=”NNI”)
rooted<-root(eqweights, “AY145206”, node = NULL, resolve.root = FALSE, interactive = FALSE)
Export this tree as a .pdf also. For your homework, I would like you to do a few things:
1. Compare your weighted (using GTR+I+G) and equal-weights trees to each other and to the phylogenies in Gustafsson and Bittrich’s paper.
2. Using the most complex model of evolution estimated for each of your ITS partitions, generate Sankoff parsimony trees for each of them.
3. Turn in to me:
a. Eight trees (total.weighted, its1.weighted, 5.8S.weighted, its2.weighted, total.unweighted, its1.unweighted, 5.8S.unweighted, its2.unweighted)
b. Write one paragraph to one page comparing and contrasting each weighted/equal-weights pair of trees. Also compare results among the different partitions. Do your results conflict or coincide with Gustafsson and Bittrich?
**Paradis, E. 2012. Analysis of Phylogenetics and Evolution with R [electronic resource], second edition. Springer Science+Business Media, LLC, New York, NY. [ISBN 978-1-4614-1743-9 Online]