Week 4 Population Structure

4.1 The `aggregate` and `apply` functions

Today we’ll two useful functions that you can use for dataframes, apply and aggregate. Let’s use the palmerpenguins dataset again:

library(palmerpenguins)
head(penguins)

## # A tibble: 6 × 9
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year  mean
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int> <dbl>
## 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007  997.
## 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007 1011.
## 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007  876.
## 4 Adelie  Torgersen           NA            NA                  NA          NA <NA>    2007   NA 
## 5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007  925.
## 6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007  975.

One thing we might want to do is find out which individual has the highest value for each column. We can use the apply function to do the same thing to each column:

?apply
apply(penguins[,3:6],MARGIN=2,which.max)

##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##               186                20               216               170

Notice that we only used columns 3-6 because these are numeric (rather than categorical). The second argument, MARGIN, is asking whether we should be doing the function for each row (MARGIN=1), or each column (MARGIN=2). The final argument is the function we want to use. which.max asks for the index (in this case, which row) has the maximum value.

Another useful function is aggregate. Sometimes you might want to summarize groups within a dataframe. aggregate does this easily. For example, what if we want to average values for each species:

?aggregate
aggregate(penguins[,3:6],by=list(penguins$species),mean)

##     Group.1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1    Adelie             NA            NA                NA          NA
## 2 Chinstrap       48.83382      18.42059          195.8235    3733.088
## 3    Gentoo             NA            NA                NA          NA

This gives us the mean for each set of rows that share a value in the “species” column. The second argument, by, tells how to group. The third argument is the function to use.

These commands will come in handy today!

Let’s take just the quantitative data for a moment:

4.2 Running a clustering analysis and deciding on a K value

The example file for use in class today can be downloaded here

A common way to analyze population structure is using hierarchical clustering. Probably the most heavily used program to do this is called STRUCTURE. However, we’ll stick to similar tools you can use in R. Today we’ll use LEA for our clustering analysis. LEA cannot be downloaded from the normal R database, so use the following code:

install.packages("BiocManager")
BiocManager::install("LEA")

Then load the library

library(LEA)

For the analysis today, we’re going to use SNP data. Open the file you downloaded called week4_example.geno using a text editor. This file is in geno format. That means there is one line per individual and genotypes are coded as 0,1,or 2. 0 and 2 are homozygotes and 1 is a heterozygote. 9 means missing data. Notice that this only works when you are looking at SNPs with just 2 alleles (biallelic).

We’ll use the package LEA to read this file in and look at population structure. Make sure you have the right path!!

project <- snmf("data/week4_example.geno",
               K = 1:5, 
               entropy = TRUE, 
               repetitions = 10,
               project = "new")

Remember what K values are? That means ‘how many populations should we try to split this data in to?’ Here we run 10 repetitions per K value. That’s because each run is slightly different so you want to have confidence in your answer. You’ll see what entropy means in a second. You’ll get a bunch of output in your console when you run this command. It will also create a new folder on your computer called “example.snmf.” It contains a bunch of different files for each run of the clustering program. We will be able to use these files to analyze and visualize our results.

Now we’ll chose the “best” value of K. Know that there are several ways to do this and they don’t always agree, so this should be interpreted with caution. Here we will use cross-entropy loss. This basically asks how consistently our model is able to categorize the samples. The lower the cross-entropy the better the model is doing. Let’s plot all the values of K:

plot(project,col="blue",pch=19)

When I ran this, K=4 had the lowest cross entropy. This plot might look slightly different each time you run it.

4.3 The Q matrix

Now that we’ve decided which value of K to look at, we can look at the ancestry estimates for each individual.

qmatrix <- Q(project,K=4,run=1)

This command pulls out the ancestry matrix (called the Q matrix) for the first run with K=3. Let’s look at the Q matrix.

head(qmatrix)

##               V1          V2          V3         V4
## [1,] 0.054572900 0.000099991 0.740739000 0.20458800
## [2,] 0.034752500 0.020577600 0.939103000 0.00556706
## [3,] 0.000099982 0.000099982 0.493632000 0.50616900
## [4,] 0.000099990 0.164248000 0.713826000 0.12182700
## [5,] 0.421772000 0.000099982 0.000099982 0.57802800
## [6,] 0.000099982 0.947744000 0.000099982 0.05205620

Here, each row is an individual and each column is one of the inferred clusters. First, lets give the individuals “IDs” as rownames. You’ll see that this is important later. Next, one thing we could do is look at which cluster has the highest ancestry for each individual. The which.max function just gives you the index (which column in this case) of the highest value.

rownames(qmatrix) <- 1:nrow(qmatrix)
maxCluster <- apply(qmatrix,1,which.max)
maxCluster

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 
##  3  3  4  3  4  2  4  3  2  4  4  3  1  4  1  1  3  1  2  1  1  2  1  3  4  1  3  2  3  2  4  4  4  2 
## 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  3  1  2  2  4  2  2  3  1  4  1  3  2  2  1  1

If we pretend maxCluster gives refers to the population from which the individual came, we can use the aggregate function to look at average cluster membership across each population:

aggregate(qmatrix,by=list(maxCluster),mean)

##   Group.1         V1         V2         V3         V4
## 1       1 0.72256454 0.12944352 0.01253162 0.13546017
## 2       2 0.10454099 0.76702862 0.08949171 0.03893860
## 3       3 0.01798562 0.04743362 0.85725858 0.07732228
## 4       4 0.07057110 0.02581350 0.07891886 0.82469650

NOTE: Although we are using maxCluster as an indicator of population here, usually we would use some sort of a priori information we had, such as geographic location or maybe morphological species, depending on our question. This distinction will be important for your homework!

4.4 Plotting

Now lets plot the ancestry results. Basically, we just want to make a barplot of the qmatrix. You can do that with the barchart function in the LEA package:

barchart(project, K=4, run=1,
         border=NA, space=0)

## $order
##  [1] 13 15 16 18 20 21 23 26 36 43 45 49 50  6  9 19 22 28 30 34 37 38 40 41 47 48  1  2  4  8 12 17 24
## [34] 27 29 35 42 46  3  5  7 10 11 14 25 31 32 33 39 44

We can also try to label each bar with information. For example, lets try to label with the population assignment that we inferred, maxCluster:

barchart(project, K=4, run=1,
         border=NA, space=0) -> bp
axis(1,at=1:length(bp$order),label=maxCluster[bp$order],las=2,cex.axis=0.4)

The grayscale is boring!! R has many many options for color packages and you could even make one of your own! For example, here is a fun one inspired by favorite La Croix flavors!

install.packages("devtools")
devtools::install_github("johannesbjork/LaCroixColoR")

And now use it in our bar chart:

library(LaCroixColoR)
my.cols=lacroix_palette("PassionFruit",n=4,type="discrete")
barchart(project, K=4, run=1,
         border=NA, space=0,col=my.cols) -> bp
axis(1,at=1:length(bp$order),label=maxCluster[bp$order],las=2,cex.axis=0.4)

4.5 Homework

For homework, we will analyze the data from Sendell-Price et al. (2021). I am providing two relevant files: price_genos.geno is the genotypes file for a subset of the SNP markers used in the paper price_pops.csv is a file with the sampling location info for each indidividual included in the genotypes file. Note that they are in the same order as in the genotypes file.

4.5.1 Homework 4: Write a script that does the following:

Read in the metadata file. Print a table showing how many individuals we have from each population.
Run the clustering analysis using the snmf command. Create a cross-entropy plot.
Get a Q matrix for K=5 (used in the paper). Add sample IDs from the metadata file as rownames to the Q matrix. Print the head of the Q matrix.
For the “TAS” population, what are the average ancestry proportions from each cluster?
Plot the ancestry barplot with populations labelled at the bottom. Color palette is up to you :)