Laboratory 3. Multivariate statistical approaches
In this lab, you will use R (in R Studio) to try out a number of multivariate statistical procedures for looking at continuous data. The goals of this lab are to:
- Learn how to interpret summary (or, moment) statistics and boxplots
- Understand the use of multivariate statistics for data exploration and hypothesis testing
- Become familiar with data transformation techniques
- Understand ‘which analyses for which hypotheses?’
You will need the following packages for this lab: ggplots, lattice, outliers, vegan
Using the prelab assignment, import the dataset C_robustipina_2004.csv to R in the R Studio
environment. Check the dimensions and the first few rows of the data set.
As you can see, the first three columns provide information about the individuals sampled (INDIV, POP, TAXON). Additionally, there are different types of characters in this dataset, discrete (categorical) characters and continuous (quantitative) characters. There are several methods for testing how categorical characters like exposure (EXP) and shrubs present (SHRUB) relate to continuous characters, like plant height (H), but for now let’s work with the continuous data (cont).
Check the dimensions and column names for these characters. Are these what you expected? For this lab, we ask the question,
Can subtaxa be distinguished based on morphological characters?
We will focus on just two ways of answering this question using data publishes recently in an AJB paper*; in the first scenario, we have a large sample of individuals that belong to the species, Coryphanthe robustispina, and we want to know if there are any subtaxa that can be distinguished by morphological characteristics.
1. If discrete clusters of individuals are recovered in ordination analyses of multiple characters, then these clusters may be circumscribed as subtaxa associated with different suites of traits (PCA).
In the second scenario, each individual in our dataset has already been assigned to a particular subtaxon and we want to know which characters can be used to best discriminate among subtaxa.
2. If subtaxa are known a priori, then they are likely distinguished based on morphological variation, but some characters will be better than others at predicting which subtaxon the individuals belong to (DFA).
Both PCA (principal components analysis) and DFA (discriminant function analysis) are ordination procedures that are used to test different hypotheses. In conducting either analysis, we are making a few assumptions about the characters.
One of the best ways to identify outliers is to use boxplots. These can also give us a first look at possible differences in means among taxa. To understand what boxplots show, let’s make one for the first character, height (H).
boxplot (cont[,4]~cont$TAXON, main=”H”, las=1)
How does this plot relate to summary statistics for each taxon?
To compare all characters side-by-side, set up a page with multiple panels. Each boxplot will sequentially be added to your plot.
par(mfrow=c(3,4)) #set up a page with 12 panels
boxplot (cont[,4]~cont$TAXON, main=”H”, las=1)
boxplot (cont[,5]~cont$TAXON, main=”D”, las=1)
boxplot (cont[,6]~cont$TAXON, main=”MST”, las=1)
boxplot (cont[,7]~cont$TAXON, main=”IST”, las=1)
boxplot (cont[,8]~cont$TAXON, main=”CSP”, las=1)
boxplot (cont[,10]~cont$TAXON, main=”RSA”, las=1)
boxplot (cont[,12]~cont$TAXON, main=”TSP”, las=1)
boxplot (cont[,13]~cont$TAXON, main=”GSN”, las=1)
sequentially add boxplots for each character, split by taxon, to the panels until you’ve plotted all the characters.
Outliers: When you have identified outlying data points, you may wish to remove them from the dataset for your analyses so that they don’t sway your results wildly in one direction. If you wan to do this, but intend to use all of your characters for ordination analyses, then you must remove the individual (i.e., the entire row) associated with the outlier. There are functions, including rm.outlier in the outlier package, that will identify and remove outliers from your dataset, but remember, outliers are “real” in the sense that they may represent variation you haven’t accounted for in your classification yet, so removing outliers should be done with extreme care and probably only in retrospect.
I have used the rm.outlier command to remove outliers for each character taxon-by-taxon, but the function identified outliers that, according to boxplots, fell within the 1st and 3rd, so I am not using this method. It is probably best to remove outliers if necessary by hand.
For today, let’s use for our analyses a subset of the characters that appear to be more desirable based on the boxplots. Here I am making a new dataset, cont1, with the characters that we should keep:
Check the dimensions (446,13) and column names of cont1 to make sure you selected the characters you intended to.
Now let’s take a look at the distributions of our characters the same way we did in the prelab. First, run the script for paired comparisons that you saved to your working directory last time. If you don’t have it, here it is again:
panel.cor <- function(x, y, digits = 2, prefix = “”, cex.cor, …)
usr <- par(“usr”) on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)
txt <- paste0(prefix, txt)
cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r)
# panel.hist <- function(x, …)
usr <- par(“usr”)
par(usr = c(usr[1:2], 0, 1.5))
h <- hist(x, plot = FALSE)
breaks <- h$breaks
nB <- length(breaks)
y <- h$counts
y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = “cyan”, …)
Next, let’s plot our characters.
pairs(cont1[,4:13], lower.panel=panel.smooth, diag.panel=panel.hist, upper.panel=panel.cor)
Which characters are highly correlated?
Examine the histograms (bar plots along center diagonal) to see whether any characters need to be transformed. Are some skewed in one direction?
Let’s log-transform some of the characters and see if it changes their distributions.
cont2<-logb(cont1[,c(4,10,13)],100) #log transform some of the characters
cont3<-data.frame(cont1[,c(1:3,5:9,11,12)], cont2) #make a new dataset containing the log- transformed characters and the untransformed characters.
pairs(cont3[,4:13], lower.panel=panel.smooth, diag.panel=panel.hist, upper.panel=panel.cor)
What if I have characters that are way larger or way smaller in my dataset? For this, there is a scale function, but we won’t use it today.
PRINCIPAL COMPONENTS ANALYSIS
Use the command you used in the prelab exercise to run a basic PCA on your data:
You learned yesterday that in ordination analyses, there are typically as many components (axes) estimated as there are characters in your matrix. Let’s look at the distribution of variation accounted for by those components by making a Scree plot. Earlier, we specified that we wanted to write boxplots to a series of panels—now we must undo this if we want to view our plots at a normal scale!
As you can see, Comp.1 accounts for a large amount of variation in the entire dataset. You can also type in…
…to view the variation accounted for by each component. In many papers, ordination analyses are represented in two or three dimensions because those are easiest to represent on a page. Based on your Scree plot, how many dimensions do you think best account for overall variation?
Now let’s look at the loadings of the characters on these components. These values indicate how much the variation associated with each character is explained by each component. Values that are much highly negative or positive (deviating from zero) indicate that the character is highly associated with that particular component.
Another way to view the loadings is with a biplot:
Do the loadings of each character match up with your loadings list? Again, let’s look at how the subtaxa are distributed on the graph:
plot(PCA1,PCA2,pch=21, col=tax) legend(4.5,-1.5,levels(tax), pch=21,col=c(“black”,”blue”,”green”,”red”)
Try plotting different components by each other. Does this change the way the clusters look?
Do you think there are four subtaxa here? Which characters are responsible for the most differentiation?
For your homework assignment, you will be comparing your findings to those in the American Journal of Botany publication for which these data were initially used.
* Baker, M.A. and C.A. Butterworth. 2013. Geographic distribution and taxonomic circumscription of populations within Coryphanhe section Robustipina (Cactaceae). American Journal of Botany 100(5): 984-997.