Week 3 Genetic Diversity
3.1 Manipulating dataframes
As we saw last week, the most common class we work with in R is a dataframe. This week, we’ll learn a few more ways to manipulate dataframes. This time we’ll use the built in palmerpenguins
dataset:
install.packages("palmerpenguins")
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year mean
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 997.
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 1011.
## 3 Adelie Torgersen 40.3 18 195 3250 female 2007 876.
## 4 Adelie Torgersen NA NA NA NA <NA> 2007 NA
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 925.
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 975.
One quick way to explore the data is using summary
:
summary(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 Median :197.0 Median :4050
## Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202
## 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750
## Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300
## NA's :2 NA's :2 NA's :2 NA's :2
## sex year mean
## female:165 Min. :2007 Min. : 738.9
## male :168 1st Qu.:2007 1st Qu.: 950.9
## NA's : 11 Median :2008 Median :1076.6
## Mean :2008 Mean :1115.9
## 3rd Qu.:2009 3rd Qu.:1257.5
## Max. :2009 Max. :1646.3
## NA's :2
For quantitative columns, this gives you basic summary statistics and for categorical columns it gives you counts of each value. We could have also looked at the categorical data using the table
command:
table(penguins$Species)
## Warning: Unknown or uninitialised column: `Species`.
## < table of extent 0 >
Let’s take just the quantitative data for a moment:
<- penguins[,3:6]
quant head(quant)
## # A tibble: 6 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <int> <int>
## 1 39.1 18.7 181 3750
## 2 39.5 17.4 186 3800
## 3 40.3 18 195 3250
## 4 NA NA NA NA
## 5 36.7 19.3 193 3450
## 6 39.3 20.6 190 3650
For easy summaries of rows or columns, we can us the commands colSums
, colMeans
, rowSums
, and rowMeans
<- rowMeans(quant) means
Maybe we want to add a column to the initial iris dataframe that has the mean of the different measurements. We can do that in two different ways:
$mean <- rowMeans(quant) # Caluclate and name the new column at the same time or...
penguins<- rowMeans(quant) # Calculate means
means2 <- cbind(penguins,means2) # cbind (column bind) to iris dataframe
newpenguins head(newpenguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year mean
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 997.200
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 1010.725
## 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 875.825
## 4 Adelie Torgersen NA NA NA NA <NA> 2007 NA
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 924.750
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 974.975
## means2
## 1 997.200
## 2 1010.725
## 3 875.825
## 4 NA
## 5 924.750
## 6 974.975
3.2 Using the adegenet package to calulate heterozygosity
Today we’ll use microsatellite data from the Weeks et al. (2017) paper using genetic analysis to look at the effects of population recovery after an introduction. You can download it here.
Let’s read in this file:
<- read.csv("data/weeks_genotypes.csv") # Read in a file data
Take a look at the data:
head(data)
## ID Pop Year LOC1 LOC2 LOC3 LOC4 LOC5 LOC6 LOC7 LOC8 LOC9 LOC10
## 1 1 MtBuller 2010 311/311 245/249 303/303 195/201 218/218 211/211 321/321 240/240 157/157 113/113
## 2 2 MtBuller 2010 311/311 245/249 303/303 195/201 218/218 211/211 321/321 240/240 157/157 113/113
## 3 3 MtBuller 2010 311/311 245/249 303/303 NA/NA 218/218 211/211 321/321 240/240 157/157 113/113
## 4 4 MtBuller 2010 311/311 249/249 303/303 195/195 218/218 211/211 321/321 240/240 157/157 113/113
## 5 5 MtBuller 2010 309/311 249/249 303/303 195/201 218/218 211/211 321/321 240/240 157/157 113/115
## 6 6 MtBuller 2010 309/311 NA/NA 303/303 201/201 218/218 211/211 321/321 240/240 157/157 113/115
## LOC11 LOC12 LOC13 LOC14 LOC15 LOC16 LOC17 LOC18 LOC19 LOC20 LOC21 LOC22
## 1 178/178 125/125 309/318 119/119 144/146 214/214 241/241 160/160 150/150 117/117 141/141 137/137
## 2 178/178 125/125 309/309 119/148 144/146 214/214 241/241 160/160 150/150 117/117 141/141 137/137
## 3 178/178 125/125 309/309 119/119 144/144 214/216 241/241 160/160 150/150 117/117 141/141 137/137
## 4 178/178 125/125 309/309 119/119 144/146 214/214 241/241 160/160 150/150 117/117 141/141 137/137
## 5 178/178 125/125 309/318 148/148 144/146 216/216 241/241 160/160 150/150 117/117 141/141 137/141
## 6 178/178 125/125 309/318 119/148 144/144 214/214 241/241 160/160 150/150 117/117 141/141 137/137
## LOC23 LOC24
## 1 157/157 177/177
## 2 157/157 177/177
## 3 157/157 177/177
## 4 157/157 177/177
## 5 157/157 177/177
## 6 157/157 177/177
The first column is an id assigned to each individual. The second column gives the populations from which the individual was sampled. The third column gives the year of sampling. The remaining columns give the microsatellite data. Each individual has two numbers which represent the fragment lengths (read on a gel). If the two fragment lengths are the same, the individual is a heterozygote. If they are different, the individual is a homozygote.
We will use the package adegenet to read in the microsatellite data and calculate heterozygosity. The manual for this package is here. Install the package and call the library:
install.packages("adegenet")
library(adegenet)
The first thing we need to do is let the adegenet read in the microsatellite data. We can do this with the df2genind
. Let’s ask for help on that command:
?df2genind
Look at the arguments. How can we format our allele data to meet the requirements of the package?
<- data[,4:ncol(data)] alleles
The function ncol
gives us the number of columns. So here we are asking for all columns except the first three, which do not contain genotype data. Okay, let’s see if that works
<- df2genind(alleles,sep="/",NA.char="NA/NA")
genind summary(genind)
##
## // Number of individuals: 524
## // Group sizes: 524
## // Number of alleles per locus: 5 5 10 7 8 3 7 16 3 11 10 2 6 10 5 2 10 2 5 3 2 10 9 4
## // Number of alleles per group: 155
## // Percentage of missing data: 1.03 %
## // Observed heterozygosity: 0.12 0.68 0.29 0.6 0.48 0.3 0.36 0.38 0.35 0.64 0.43 0.22 0.54 0.6 0.5 0.21 0.38 0.25 0.25 0.17 0.17 0.39 0.26 0.25
## // Expected heterozygosity: 0.12 0.64 0.36 0.7 0.58 0.32 0.5 0.56 0.51 0.74 0.56 0.45 0.64 0.61 0.63 0.25 0.44 0.46 0.29 0.19 0.35 0.49 0.33 0.38
The object summary has most of the information we want:
<- summary(genind)
results names(results) # Show the different output values we can look at
## [1] "n" "n.by.pop" "loc.n.all" "pop.n.all" "NA.perc" "Hobs" "Hexp"
$Hobs # Show observed heterozygosity results
## LOC1 LOC2 LOC3 LOC4 LOC5 LOC6 LOC7 LOC8 LOC9 LOC10
## 0.1150097 0.6812749 0.2913386 0.5984405 0.4799235 0.2972973 0.3556405 0.3754864 0.3512476 0.6382979
## LOC11 LOC12 LOC13 LOC14 LOC15 LOC16 LOC17 LOC18 LOC19 LOC20
## 0.4335260 0.2160612 0.5393474 0.5965583 0.4990366 0.2103250 0.3754789 0.2471264 0.2480620 0.1739962
## LOC21 LOC22 LOC23 LOC24
## 0.1685824 0.3850575 0.2553191 0.2528736
Notice that observed and expected heterozygosity have been calculated, but across the entire dataset. What if we supply different populations? We use the seppop
function to tell it we want separate measurements for each population
<- df2genind(alleles,sep="/",
genindPop NA.char="NA/NA",
pop=data$Pop)
<- seppop(genindPop)
genindPop genindPop
## $MtBuller
## /// GENIND OBJECT /////////
##
## // 420 individuals; 24 loci; 155 alleles; size: 316.4 Kb
##
## // Basic content
## @tab: 420 x 155 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-16)
## @loc.fac: locus factor for the 155 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, pop = ..1, treatOther = ..2, quiet = ..3,
## drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 420-420)
##
## $MtHigginbotham
## /// GENIND OBJECT /////////
##
## // 104 individuals; 24 loci; 155 alleles; size: 102.9 Kb
##
## // Basic content
## @tab: 104 x 155 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-16)
## @loc.fac: locus factor for the 155 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, pop = ..1, treatOther = ..2, quiet = ..3,
## drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 104-104)
Notice the two locations are now separate. We can get the heterozygosity separately for each population:
<- summary(genindPop$MtBuller)
MtBuller <- summary(genindPop$MtHigginbotham)
MtHiggenbotham $Hobs # Observed heterozygosity for all Mt Buller Samples MtBuller
## LOC1 LOC2 LOC3 LOC4 LOC5 LOC6 LOC7 LOC8 LOC9 LOC10
## 0.1196172 0.6526055 0.2120482 0.5649038 0.3937947 0.2583732 0.2911695 0.3325359 0.3381295 0.6201923
## LOC11 LOC12 LOC13 LOC14 LOC15 LOC16 LOC17 LOC18 LOC19 LOC20
## 0.3317422 0.2696897 0.5083532 0.5751790 0.4892086 0.2625298 0.2559809 0.3086124 0.1654676 0.1527446
## LOC21 LOC22 LOC23 LOC24
## 0.1360382 0.3110048 0.1558753 0.1531100
$Hexp # Expected heterozygosity for all Mt Buller Samples MtBuller
## LOC1 LOC2 LOC3 LOC4 LOC5 LOC6 LOC7 LOC8 LOC9 LOC10
## 0.1129982 0.6064596 0.2329279 0.6085111 0.4022334 0.2458735 0.2869544 0.3583595 0.3399151 0.6384350
## LOC11 LOC12 LOC13 LOC14 LOC15 LOC16 LOC17 LOC18 LOC19 LOC20
## 0.3431457 0.3045010 0.5145363 0.5734332 0.5079217 0.3030058 0.2639076 0.3282291 0.1647056 0.1750275
## LOC21 LOC22 LOC23 LOC24
## 0.1629775 0.3484438 0.1637021 0.1735194
3.2.1 Plotting heterozygosity
Let’s use a barplot to look at variation in heterozygostiy across all our markers:
barplot(MtBuller$Hobs)
When we have a bunch of markers, we often take the mean across all those markers:
mean(MtHiggenbotham$Hobs) # Mean observed heterozygosity in the Mt Higgenbotham population
## [1] 0.5262125
Compare this value to the paper.
3.3 Homework
Now that we know how to estimate observed and expected heterozygosity, we’ll look at changes over time. Start a new script for your homework
3.3.1 Homework 3: Write a script that does the following:
Read in “genotypes.csv” and subset the dataframe to Mt Buller samples only. How many samples do you have from each year?
Calculate mean observed heterozygosity for each year. Put these into a single dataframe.
Calculate mean expected heterozygosity for each year. Add these to the dataframe from question 2.
Plot mean expected heterozygosity over time.
Add a vertical lines to your plot to show when introductions took place (hint - use the
abline
command)