Week 4 Population Structure
4.1 The aggregate
and apply
functions
Today we’ll two useful functions that you can use for dataframes, apply
and aggregate
. Let’s use the palmerpenguins
dataset again:
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year mean
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 997.
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 1011.
## 3 Adelie Torgersen 40.3 18 195 3250 female 2007 876.
## 4 Adelie Torgersen NA NA NA NA <NA> 2007 NA
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 925.
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 975.
One thing we might want to do is find out which individual has the highest value for each column. We can use the apply
function to do the same thing to each column:
?applyapply(penguins[,3:6],MARGIN=2,which.max)
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 186 20 216 170
Notice that we only used columns 3-6 because these are numeric (rather than categorical). The second argument, MARGIN, is asking whether we should be doing the function for each row (MARGIN=1), or each column (MARGIN=2). The final argument is the function we want to use. which.max
asks for the index (in this case, which row) has the maximum value.
Another useful function is aggregate
. Sometimes you might want to summarize groups within a dataframe. aggregate
does this easily. For example, what if we want to average values for each species:
?aggregateaggregate(penguins[,3:6],by=list(penguins$species),mean)
## Group.1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie NA NA NA NA
## 2 Chinstrap 48.83382 18.42059 195.8235 3733.088
## 3 Gentoo NA NA NA NA
This gives us the mean for each set of rows that share a value in the “species” column. The second argument, by, tells how to group. The third argument is the function to use.
These commands will come in handy today!
Let’s take just the quantitative data for a moment:
4.2 Running a clustering analysis and deciding on a K value
The example file for use in class today can be downloaded here
A common way to analyze population structure is using hierarchical clustering. Probably the most heavily used program to do this is called STRUCTURE. However, we’ll stick to similar tools you can use in R. Today we’ll use LEA
for our clustering analysis. LEA cannot be downloaded from the normal R database, so use the following code:
install.packages("BiocManager")
::install("LEA") BiocManager
Then load the library
library(LEA)
For the analysis today, we’re going to use SNP data. Open the file you downloaded called week4_example.geno using a text editor. This file is in geno format. That means there is one line per individual and genotypes are coded as 0,1,or 2. 0 and 2 are homozygotes and 1 is a heterozygote. 9 means missing data. Notice that this only works when you are looking at SNPs with just 2 alleles (biallelic).
We’ll use the package LEA to read this file in and look at population structure. Make sure you have the right path!!
<- snmf("data/week4_example.geno",
project K = 1:5,
entropy = TRUE,
repetitions = 10,
project = "new")
Remember what K values are? That means ‘how many populations should we try to split this data in to?’ Here we run 10 repetitions per K value. That’s because each run is slightly different so you want to have confidence in your answer. You’ll see what entropy means in a second. You’ll get a bunch of output in your console when you run this command. It will also create a new folder on your computer called “example.snmf.” It contains a bunch of different files for each run of the clustering program. We will be able to use these files to analyze and visualize our results.
Now we’ll chose the “best” value of K. Know that there are several ways to do this and they don’t always agree, so this should be interpreted with caution. Here we will use cross-entropy loss. This basically asks how consistently our model is able to categorize the samples. The lower the cross-entropy the better the model is doing. Let’s plot all the values of K:
plot(project,col="blue",pch=19)
When I ran this, K=4 had the lowest cross entropy. This plot might look slightly different each time you run it.
4.3 The Q matrix
Now that we’ve decided which value of K to look at, we can look at the ancestry estimates for each individual.
<- Q(project,K=4,run=1) qmatrix
This command pulls out the ancestry matrix (called the Q matrix) for the first run with K=3. Let’s look at the Q matrix.
head(qmatrix)
## V1 V2 V3 V4
## [1,] 0.054572900 0.000099991 0.740739000 0.20458800
## [2,] 0.034752500 0.020577600 0.939103000 0.00556706
## [3,] 0.000099982 0.000099982 0.493632000 0.50616900
## [4,] 0.000099990 0.164248000 0.713826000 0.12182700
## [5,] 0.421772000 0.000099982 0.000099982 0.57802800
## [6,] 0.000099982 0.947744000 0.000099982 0.05205620
Here, each row is an individual and each column is one of the inferred clusters. First, lets give the individuals “IDs” as rownames. You’ll see that this is important later. Next, one thing we could do is look at which cluster has the highest ancestry for each individual. The which.max
function just gives you the index (which column in this case) of the highest value.
rownames(qmatrix) <- 1:nrow(qmatrix)
<- apply(qmatrix,1,which.max)
maxCluster maxCluster
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 3 3 4 3 4 2 4 3 2 4 4 3 1 4 1 1 3 1 2 1 1 2 1 3 4 1 3 2 3 2 4 4 4 2
## 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 3 1 2 2 4 2 2 3 1 4 1 3 2 2 1 1
If we pretend maxCluster
gives refers to the population from which the individual came, we can use the aggregate
function to look at average cluster membership across each population:
aggregate(qmatrix,by=list(maxCluster),mean)
## Group.1 V1 V2 V3 V4
## 1 1 0.72256454 0.12944352 0.01253162 0.13546017
## 2 2 0.10454099 0.76702862 0.08949171 0.03893860
## 3 3 0.01798562 0.04743362 0.85725858 0.07732228
## 4 4 0.07057110 0.02581350 0.07891886 0.82469650
NOTE: Although we are using maxCluster
as an indicator of population here, usually we would use some sort of a priori information we had, such as geographic location or maybe morphological species, depending on our question. This distinction will be important for your homework!
4.4 Plotting
Now lets plot the ancestry results. Basically, we just want to make a barplot of the qmatrix. You can do that with the barchart
function in the LEA package:
barchart(project, K=4, run=1,
border=NA, space=0)
## $order
## [1] 13 15 16 18 20 21 23 26 36 43 45 49 50 6 9 19 22 28 30 34 37 38 40 41 47 48 1 2 4 8 12 17 24
## [34] 27 29 35 42 46 3 5 7 10 11 14 25 31 32 33 39 44
We can also try to label each bar with information. For example, lets try to label with the population assignment that we inferred, maxCluster
:
barchart(project, K=4, run=1,
border=NA, space=0) -> bp
axis(1,at=1:length(bp$order),label=maxCluster[bp$order],las=2,cex.axis=0.4)
The grayscale is boring!! R has many many options for color packages and you could even make one of your own! For example, here is a fun one inspired by favorite La Croix flavors!
install.packages("devtools")
::install_github("johannesbjork/LaCroixColoR") devtools
And now use it in our bar chart:
library(LaCroixColoR)
=lacroix_palette("PassionFruit",n=4,type="discrete")
my.colsbarchart(project, K=4, run=1,
border=NA, space=0,col=my.cols) -> bp
axis(1,at=1:length(bp$order),label=maxCluster[bp$order],las=2,cex.axis=0.4)
4.5 Homework
For homework, we will analyze the data from Sendell-Price et al. (2021). I am providing two relevant files: price_genos.geno is the genotypes file for a subset of the SNP markers used in the paper price_pops.csv is a file with the sampling location info for each indidividual included in the genotypes file. Note that they are in the same order as in the genotypes file.
4.5.1 Homework 4: Write a script that does the following:
Read in the metadata file. Print a table showing how many individuals we have from each population.
Run the clustering analysis using the snmf command. Create a cross-entropy plot.
Get a Q matrix for K=5 (used in the paper). Add sample IDs from the metadata file as rownames to the Q matrix. Print the head of the Q matrix.
For the “TAS” population, what are the average ancestry proportions from each cluster?
Plot the ancestry barplot with populations labelled at the bottom. Color palette is up to you :)