Prelab 4. Transferring DNA sequence data from GenBank to R
GenBank is a database of annotated nucleotide sequences that can be queried to locate sequences (or alignments of sequences). These data are available to the public and it is expected that scientists deposit their data (sequences) here when they’ve published a peer-reviewed paper using those data.
You can access the website by going to http://www.ncbi.nlm.nih.gov/
Once there, find the link on the right side of your browser that says Nucleotide and follow that link. In the new page, place your cursor in the field at the top of the page (right of the menu that says Nucleotide), type in Clusia, and hit return. We have just queried the database to find all nucleotide sequences that have metadata containing the word “Clusia”.
Let’s pursue this genus, Clusia, because Manuel will likely write his dissertation on this member of the wonderful family Clusiaceae.
In the new window I see that my query results in 685 sequences. The first one is a chloroplast region called matK—this is the short name for the maturase K gene. I also see some sequences associated with nuclear ribosomal RNA genes next on the list. Remember, there may be functions associated with these regions of sequence or with the regions flanking these sequences, but today we are interested in looking at nucleotide variation among many sequences rather than what these regions actually do.
The fourth sample on the list is a sequence for the nuclear ribosomal ITS region, which includes the genes and spacers given after the voucher name below.
I am drawn to this sequence because it’s associated with a PopSet, which is a collection of sequences that were all used to construct a character matrix like the one you made for the caminalcules, with samples (species and voucher name) in rows and characters (nucleotide position) in columns. Follow the link that says PopSet just below the accession number for the Clusia valerioi sample (above).
On the next page you can see the title of the paper these data belong to (wow!) and then a long list of sequences that were analyzed for the paper. If you scroll down the page, you will get to a graphical view of the alignment where you can visualize polymorphic characters (ones with two or more character states when you look across taxa) as red color.
Let’s transfer these data into R!
Scroll back up to the top of the page and find a drop-down menu called Send To. Click the drop down menu symbol and choose File as your destination. When prompted for data Format, choose Clustal from the list. You may now minimize or close your internet browser.
Create a working directory called Clusia on your preferred directory (i.e., Desktop or Documents, etc.). Move your new file, “popset_alignclustal.txt”, into your working directory. Rename this file “Clusia_ITS.txt” so that you can identify it easily.
Open R Studio.
In R Studio, under File, navigate to the Clusia folder you just made. When you can view the Clusia_ITS.txt file in this folder in your list, set your working directory.
Load the phangorn package—ideally, any dependencies will load at the same time. The packages that should load automatically are ape, lattice, Matrix, igraph, and rgl.
Read your alignment file into R using the read.phyDat command.
its<-read.phyDat(“Clusia_ITS.txt”, format=”clustal”, type=”DNA”)
Finally, check the dimensions of your matrix by typing
In class we will explore the characters contained in this matrix! If you are dying to make a tree NOW, go back to the handout for Lab 2 and try using the pratchet function for estimating phylogenies using the parsimony criterion.