10 Week 9: Genome wide association study (GWAS)
This week we’re going to show you how to perform a genome wide association study or GWAS.
10.1 Download the data
The data for this week consist of several files: our genotypes in an imputed beagle format, a genome index file (.fai), and a few small text files.
Remember we used files that contain each samples genotypes (in .bam format) to generate a genotype likelihood file after we learned to map sample reads to a reference genome.
It is highly recommended to “impute” the genotype likelihood files prior to running the GWAS. What does impute mean?
Our genotype file does not contain a single genotype for each sample at every site, but instead has a range of likelihoods for each possible genotype for every site. Imputing this file reduces this range of possibility and also infers the genotype of un-sampled sites. Imputation has been shown to greatly improve the accuracy of GWA studies.
Imputation can be done in beagle version 3, which can be found here or a direct link to the jar file
Instructions to impute the genotype likelihood file can be found here
Let’s start by downloading the data we will be working with this week:
wget https://raw.githubusercontent.com/BayLab/MarineGenomicsData/main/Week9_2022.tar.gz
tar -xzvf Week9_2022.tar.gz
Move the data from MarineGenomicsData into MarineGenomics and change its name to Week9:
mv MarineGenomicsData/Week8 MarineGenomics/Week9
and one more file that was too big to upload with the others
cd MarineGenomics/Week9
wget https://raw.githubusercontent.com/BayLab/MarineGenomicsData/main/salmon_chr6_19ind.BEAGLE.PL.gz
Now that we’ve finished downloading the data, let’s open Rsutdio and install the package necessary for our today’s lesson
10.2 install packages in R
We’ll be using the package qqman
to make a manhattan plot. Make sure not to copy the # sign into the R console/script.
# install.packages("qqman")
10.3 The data
The data we’re working on for the GWAS comes from the paper by Kijas et al. 2018 where they studied the genetic basis of sex determination in the atlantic Salmon. The paper can be found here or on Canvas. In short they examined the genetic basis of sex in 19 salmon for which they have whole genome sequence data. We’ll only be looking at two chromosomes (2 and 6) of this data.
10.4 take Beagle file and generate lrt file
To do the test of genome wide association we need to take our Beagle file and test whether there is an association with our phenotype (in this case whether a fish has a male or female phenotype). The phenotypes are coded as 0 = Female and 1 = Male in the phenobin
file.
Let’s move back to the terminal and execute this command:
$HOME/angsd/angsd -doMaf 4 -beagle salmon_chr2_19ind.BEAGLE.PL.gz -yBin phenobin -doAsso 2 -fai Salmon.fai
This example looks for associations between the genotypes (the genotype likelihood data - beagle file) and the phenotypes (male/female - binary ‘phenobin’ file). We are using the ‘Salmon.fai’ file as an index (remember we are using an index for the genome so that it is easier to search against it).
This will generate several output files labeled angsdput
. We’ll use the file with the lrt0
extension to plot our manattan plot.
Let’s go back to Rstudio:
10.5 take lrt file and make a manhattan plot
LRT is the likelihood ratio statistic. This statistic is chi square distributed with one degree of freedom. Sites that fails one of the filters are given the value -999.
#Set the working directory for this week's directory
#setwd("~/MarineGenomics/Week9")
#read in the data
<- read.table(gzfile("angsdput.lrt0.gz"), header=T, sep="\t")
lrt
#look at it
str(lrt)
## 'data.frame': 224460 obs. of 8 variables:
## $ Chromosome : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Position : int 13 59 74 153 548 550 613 1182 1361 1533 ...
## $ Major : chr "G" "G" "G" "C" ...
## $ Minor : chr "T" "A" "A" "G" ...
## $ Frequency : num 0.237 0.5 0.316 0.184 0.316 ...
## $ N : int 19 19 19 19 19 19 19 19 19 19 ...
## $ LRT : num -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
## $ high_WT.HE.HO: chr "10/9/0" "1/17/1" "7/12/0" "12/7/0" ...
#we have a few LRT values that are -999, that means that the test failed. we should remove them. How many do we have?
length(which(lrt$LRT == -999))
## [1] 210597
The answer is 210597.
How many LRT values (in this case, loci) do we have in total?
length(lrt$LRT)
## [1] 224460
The answer is 224460, meaning that most are filtered out.
How many do we have left when we remove loci with failed tests?
length(lrt$LRT)-length(which(lrt$LRT == -999))
## [1] 13863
The answer is 13863.
Let’s remove the values that are not -999 and or negative and assign them to a new object called ‘lrt_filt.’
<- lrt[-c(which(lrt$LRT == -999),which(lrt$LRT <= 0)),] lrt_filt
We can look at a histogram of the filtered LRT results:
hist(lrt_filt$LRT, breaks=50)
Everything looks good to proceed to making our manhattan plot.
The function ‘manhattan’ requires each SNP to have it’s own “name.” Let’s make a vector for row numbers that start with the letter ‘r.’
$SNP <- paste("r",1:length(lrt_filt$Chromosome), sep="") lrt_filt
We also need to convert our LRT values to pvalues. We’ll use the command dchisq
to get pvalues from the LRT values.
#get pvalues
$pvalue<-dchisq(lrt_filt$LRT, df=1)
lrt_filt
#we also need to make sure we don't have any tricky values like those below
<- lrt_filt[!(lrt_filt$pvalue=="NaN" | lrt_filt$LRT=="Inf"| lrt_filt$pvalue=="Inf"),] lrt_filt
Create the manhattan plot:
::manhattan(lrt_filt, chr="Chromosome", bp="Position", p="pvalue", suggestiveline = F, genomewideline = F) qqman
Let’s look at a qq-plot of our pvalues to check the model fit
qq(lrt_filt$pvalue)
This looks a bit weird, we would expect it to be mostly a straight line with some deviations at the upper right. If we were moving forward with this analyses we’d want to do more filtering of our data.
We can highlight the values that exceed a threshold. There are several ways to determine a threshold, but one is to make a vector of random phenotypes and re-run our association test. We can then set the highest LRT value from the random phenotype test as our upper limit for our plot with the actual phenotypes.
#make a vector with 19 1's and 0's
<- sample(c(1,0), 19, replace=T)
x
#write this to our week 9 directory
write.table(x, "rando_pheno", row.names = F, col.names = F)
And now use that phenotype file to run our association test again, making sure to specify a different output file. Let’s go back to the terminal:
$HOME/angsd/angsd -doMaf 4 -beagle salmon_chr2_19ind.BEAGLE.PL.gz -yBin rando_pheno -doAsso 2 -fai Salmon.fai -out randotest
And rerun the code in R to see what our maximum LRT values are in this random phenotype test.
<- read.table(gzfile("randotest.lrt0.gz"), header=T, sep="\t")
lrt_rando
#we need to remove those -999 values again
<- lrt_rando[!(lrt_rando$LRT==-999 | lrt_rando$LRT<= 0),]
rando_filt
summary(rando_filt$LRT, na.rm=T)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000001 0.214816 0.669551 Inf 2.029913 Inf 1
#we have some Inf values we also need to remove, let's add those to our filtering line above
<- lrt_rando[!(lrt_rando$LRT==-999 | lrt_rando$LRT<= 0 | lrt_rando$LRT == Inf),]
rando_filt
#let's look at the maximum value of LRT
max(rando_filt$LRT, na.rm=T)
## [1] 12.33287
The maximum value is ~12 This value was generated at random, so any value higher than this is expected to be generated not at random (which is what we are looking for). So we can highlight all of the SNPs that have an LRT greater than 12 in our association test.
#make a list of the candidates
<- lrt_filt[which(lrt_filt$LRT > 12),]$SNP candidates
Let’s highlight this list of candidate loci in our manhattan plot:
::manhattan(lrt_filt, chr="Chromosome", bp="Position", p="pvalue", highlight = candidates, suggestiveline = F, genomewideline = F) qqman
Comparing our results to the Kijas et al. 2018 paper, we have a similar pattern of many SNPs across the chromosome showing a relationship with phenotype (sex). Interestingly this paper found that there are three chromosomes that are associated with sex in Atlantic Salmon, but not all chromosomes give a strong signal in all individuals. For example, only three male individuals were found to have a clear association with chromosome 2, and the other males in the study were found to have an association with chromosomes 3 and 6.
These results highlight the fluid nature of sex determination in animals, even those with a genetic basis to sex determination.
For these exercises you’ll take a closer look at chromosome 6, where you’ll try to find the individuals that are males from a PCA plot.
10.6 Exercises
1.For this exercise, we want to see which individuals actually show genomic levels of variation at chromosome 6. Using the code that we ran for week 6. Make a PCA plot of the salmon_chr3_19ind.BEAGLE.PL.gz file. Use the function
indentify()
to find the points that are clustering apart from the other points. Verify that these are males by referring to the table in the Kijas et al. paper here. Note their individuals are in the same order as our samples though the names don’t match.
Solution
Run pcangsd on our chr6 data
pcangsd -b salmon_chr6_19ind.BEAGLE.PL.gz -o pca6_out -t 28
In R make a PCA plot
#read in the data
<- as.matrix(read.table("pca6_out.cov"))
cov
#compute the eigen values
<- eigen(cov)
e
#how much variation are we explaining here?
$values/sum(e$values) e
## [1] 0.1379496011 0.1246126748 0.0969103029 0.0920252715 0.0774570788
## [6] 0.0685079248 0.0605808079 0.0554445095 0.0493884028 0.0444119246
## [11] 0.0423030396 0.0368221717 0.0329494215 0.0316195242 0.0269138675
## [16] 0.0184940705 0.0149661517 -0.0006450986 -0.0107116466
It looks like ~13% is explained by the first axis.
Let’s create the PCA plot:
#First, we'll install a new package called 'plotly' that lets us create interactive plots:
#install.packages("plotly")
#create a dataframe from the first two PC in object e
<- as.data.frame(e$vectors[,1:2]) e_df
make a plot of the first two axes
::plot_ly(e_df, type = "scatter", mode = "markers", x=e_df$V1, y=e_df$V2, text=row.names(e_df)) plotly
Use the cursor to identify the points that are clustered apart from the others. If you select the points on the left most of the screen, you will find they are rows 1, 2, and 12
The other two points that are leftmost from the remaining points are 16 and 19, but according to Table 1 in the Kilas paper, they are females. Therefore, it is possible that more filtering is needed to identify the particular informative SNPs that correlate well with sex in this fish species.
- Make a manhattan plot for chromosome 6. Use a phenotype file called
pheno_chr6
.
Solution
#In the terminal, run the command in angsd again using the chr6 BEAGLE file and the 'pheno_chr6' file.
# $HOME/angsd/angsd -doMaf 4 -beagle salmon_chr6_19ind.BEAGLE.PL.gz -yBin pheno_chr6 -doAsso 2 -fai Salmon.fai -out chr6_out
#and then we can remake our manhattan plot
<- read.table(gzfile("chr6_out.lrt0.gz"), header=T, sep="\t")
lrt2
#look at it
str(lrt2)
#remove the values that are not -999 and that are negative
<- lrt2[-c(which(lrt2$LRT == -999),which(lrt2$LRT <= 0)),]
lrt_filt2
#make a vector for row numbers that start with the letter 'r'.
$SNP <- paste("r",1:length(lrt_filt2$Chromosome), sep="")
lrt_filt2
#get pvalues
$pvalue <- dchisq(lrt_filt2$LRT, df=1)
lrt_filt2
#we also need to make sure we don't have any tricky values like those below
<- lrt_filt2[!(lrt_filt2$pvalue=="NaN" | lrt_filt2$LRT=="Inf"| lrt_filt2$pvalue=="Inf"),]
lrt_filt2
#make our plot
::manhattan(lrt_filt2, chr="Chromosome", bp="Position", p="pvalue") qqman