10 Week 9: F_st outlier analysis and chunky quails

The lecture for this week focused on F_st and outlier analysis can be found hereWeek 9 Slides

Last week we used PCA to take a broad look at whether populations were genetically distinct. This week we will look at SNPs that differentiate between populations that may represent signals of selection. If you want to learn morea bout how different methods for detecting signatures of selection I recommend visiting this page on the marineomics website: https://marineomics.github.io/POP_10_Signatures_of_Selection.html

We will use data from the following paper focused on genomic variation in common quails:

Sanchez-Donoso et al. (2022) Massive genome inversion drives coexistence of divergent morphs in common quails.Current Biology

We will get more into what the data focuses on later in the lesson, but a few findings from the study is that the quails with and without the inversion differ in their throat coloration, weight (& fat reserves), wing shape and migratory behavior.

Sanchez-Donoso et al. (2022) Phenotypes of the common quail with and without the inversion from the graphical abstract

We will download the data and take a first look in bash and then we will do analysis and generate the plots in RStudio.

10.1 Main Objectives:

Learn how to identify particular SNPs that are driving patterns of divergence between populations
Connect what we learned in previous lessons using PCAs and Structure Plots and how these methods differ from today’s outlier analyses
Practice keeping scripts organized to ensure reproducibility

10.2 Download the data

We first need to download the data.

10.2.1 On Farm

Open a terminal and navigate to your directory. Then use wget to download the data folder. We’ll then use the tar command to un-compress it.

cd /group/rbaygrp/eve198-genomics/yourdirectory

mkdir Week9

cd Week9

wget https://raw.githubusercontent.com/mlarmstrong/IntroGenomics_Data/main/week9.tar.gz


tar -xzvf week9.tar.gz

10.2.2 On Your Own Computer

Similar to last week you need to navigate to the class github page where all of the data is stored https://github.com/mlarmstrong/IntroGenomics_Data Click on week9.zip and then after it takes you to a new page click the three dots in the top right and hit download.

10.3 Unzip the Folder & Look at Data

Regardless of download method, unzip the week9 directory and navigate inside. What do you see?

The SNP data is in a vcf file, just like for previous weeks. Take a look at the head of the metadata file. What columns do you see?

10.4 Downloading R Packages

Open R Studio on your desktop or on Farm. The first thing we need to do is download a few different packages.

Today we will be using the SeqArray and SNPRelate packages to read our data. You used these last week but you were on your desktop, so you will need to install these packages again for Farm. You will need to use Biocmanager to install both of these packages, so run if (!require("BiocManager", quietly = TRUE)).

For the outlier analysis, we will use a package called OutFLANK which also needs the package qvalue to run (we call this a ‘dependency’).

install.packages("BiocManager") # type "yes" when prompted
library(BiocManager)
BiocManager::install(c("qvalue", "SeqArray", "SNPRelate") # When prompted, type "a" - this will take some time
install_github("whitlock/OutFLANK")

10.5 R hygiene

Now that you are getting more comfortable coding in R, it’s a good time to talk about how to make your scripts clean, readable, and reusable. The first and likely most important piece of advice is to ALWAYS do lots of commenting with # so when you come back to your code you know what you did and why

It’s good practice to put three main pieces of information at the top of your code. Your name (for when you share code), the last modified date, and the purpose of the code. Looks like this:

## Code for finding outliers using OUTFlank
## Rachael Bay
## Modified May 28, 2025

After that, I like to have all the libraries that will be needed for the code loaded:

library(SeqArray)

## Warning: package 'SeqArray' was built under R version 4.4.1

## Loading required package: gdsfmt

## Warning: package 'gdsfmt' was built under R version 4.4.1

library(SNPRelate)

## Warning: package 'SNPRelate' was built under R version 4.4.1

library(OutFLANK)

## Loading required package: qvalue

library(gdsfmt)

Finally, I like to make objects for all the files that I will use. Again, this is so that if I want to come back to the code later and run the same analysis on different files, it is obvious what I need to change rather than being buried in the code:

#setwd("/group/rbaygrp/eve198-genomics/rachael")
vcf.path="ChunkyQuails.vcf.gz"
meta.path="ChunkyQuails_meta.csv"

Make sure your working directory is set up to the folder your data is stored in for week 9. Now that our script is organized, we are ready to use the data!

10.6 Finding outliers using OutFLANK

We will read in the vcf file just like we did last week when we did PCA. Give it a try on your own!

Class Exercise 1

Using info from Week 8, read in your data!

Q1. Convert your vcf file to a gds file called “ChunkyQuails.gds”. Then read make a gds object called ‘genofile’. How many samples do you have?

Q2. Read in your metadata file to an object named ‘meta’ (hint: read.csv()). Take a look at the different columns. “Loc1” is the breeding location. How many unique breeding locations are represented?

10.7 Pulling SNP data from the gds file

Later, we will want information on the location of each SNP in the genome. We can pull information from the gds file like this:

chromosomes <- read.gdsn(index.gdsn(genofile, "chromosome"))
positions <- read.gdsn(index.gdsn(genofile, "position"))
variant.id <- read.gdsn(index.gdsn(genofile, "variant.id"))

Then put everything together in a single dataframe like this:

snp.pos <- data.frame(LocusName=variant.id,
                      chr=as.factor(chromosomes),
                      pos=positions)

We also want the matrix of SNP genotypes. We can pull this from the gds object using the snpgdsGetGeno() command:

G <- snpgdsGetGeno(genofile)

## Genotype matrix: 80 samples X 112195 SNPs

Take a look at this matrix. Here I’m just looking at the head of the first 10 columns because the matrix is large!

head(G[,1:10])

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    2    2    2    2    2    2    2    2     2
## [2,]    1    2    2    2    2    2    2    2    2     2
## [3,]    2    2    2    2    2    2    2    1    2     2
## [4,]    2    2    2    2    2    2    2    2    2     2
## [5,]    1    2    2    2    2    2    2    2    2     2
## [6,]    2    2    2    1    2    2    2    2    2     2

We have four different values: 0,1,2, and NA. These represent the reference homozygote, heterozygote, alternate homozygote, and missing data respectively. The program that we want to use, OUTFLank requires a ‘9’ instead of ‘NA’ for missing data. We can do the replacement like this:

G[is.na(G)] <- 9

Take a look at the head of the dataframe again. Did it work?

10.8 Calculating F_st and identifying outliers

What are the dimensions of G? How do we interpret that?

Now we can caluclate F_st for each SNP. We’re going to calculate differentiation among locations (the Loc1 column in your metadata).

WAIT! First we need to determine whether our metadata is in the same order as our genotype matrix.

Class Exercise 2

Is your metadata in the same order as your genotype matrix?

Q1. Get the sample names from your gds object (hint: how did you extract chromosome and position from gds? There is also information in ‘sample.id’)

Q2. Determine whether the sample IDs you extracted from the gds are in the same order as in the metadata file (hint: you did this last week too!)

Phew - NOW we can calculate F_st!

fst <- MakeDiploidFSTMat(G,locusNames=variant.id,popNames=meta$Loc1)
head(fst)

hist(fst$FST,breaks=50)

Once we’ve calculated F_st between the two populations for each SNP individually, we want to determine whether some SNPs are outliers - that is, more differentiated than we would expect. OutFLANK does this by fitting a Chi-Squared distribution to the data and looking to see if the tails of the Chi-Squared distribution have more SNPs than expected:

OF <- OutFLANK(fst,LeftTrimFraction=0.01,RightTrimFraction=0.01,
               Hmin=0.05,NumberOfSamples=9,qthreshold=0.01)

OutFLANKResultsPlotter(OF,withOutliers=T,
                       NoCorr=T,Hmin=0.1,binwidth=0.005,
                       Zoom=F,RightZoomFraction=0.05,titletext=NULL)

It’s a little hard to tell from these plots, but there may be some SNPs with high F_st even where the distribution predicts there should be none. To find these SNPs, we ask which SNPs are statistical outliers?

P1 <- pOutlierFinderChiSqNoCorr(fst,Fstbar=OF$FSTNoCorrbar,
                                dfInferred=OF$dfInferred,qthreshold=0.05,Hmin=0.1)
outliers <- P1$OutlierFlag==TRUE
table(outliers)

## outliers
## FALSE  TRUE 
## 31775   285

Now we can make a manhattan plot! There are lots of packages to make very beautiful manhattan plots and even highlight particular genes of interest. Here we will just use base R. First, take a look at the dataframe that has information on F_st and outliers:

head(P1)

##    LocusName        He           FST            T1         T2  FSTNoCorr
## 1          1 0.2997116 -0.0159898161 -2.407181e-03 0.15054464 0.04630126
## 9          9 0.3408941 -0.0049619118 -8.528996e-04 0.17188930 0.07862600
## 16        16 0.4946028  0.0465988082  1.168634e-02 0.25078625 0.11595269
## 19        19 0.2448980 -0.0812639538 -9.952130e-03 0.12246672 0.02331883
## 25        25 0.2087950  0.0009384393  9.887713e-05 0.10536337 0.08458284
## 26        26 0.1778913  0.0353452459  3.182881e-03 0.09005117 0.10384667
##       T1NoCorr   T2NoCorr meanAlleleFreq    pvalues pvaluesRightTail   qvalues
## 1  0.006982284 0.15080116     0.18354430 0.46013402        0.7699330 0.9999994
## 9  0.013545112 0.17227268     0.21794872 0.89656240        0.4482812 0.9999994
## 16 0.029138558 0.25129696     0.44805195 0.38400724        0.1920036 0.9999994
## 19 0.002864526 0.12284177     0.14285714 0.09330932        0.9533453 0.9999994
## 25 0.008934381 0.10562877     0.11842105 0.79343073        0.3967154 0.9999994
## 26 0.009370817 0.09023704     0.09868421 0.51556265        0.2577813 0.9999994
##    OutlierFlag
## 1        FALSE
## 9        FALSE
## 16       FALSE
## 19       FALSE
## 25       FALSE
## 26       FALSE

We are missing chromosome information, which we want to make a beautiful manhattan plot that is colored by chromosome. Remember how we made the snp.pos dataframe earlier? Basically we want to merge P1 and snp.pos.

Class Exercise 3

Merge two dataframes based on one column

Q1. Merge the P1 and snp.pos dataframes based on the LocusName column. There are many ways to do this. Two of my favorites are ‘merge()’ or ‘left_join()’ which is part of dplyr package. Use the help file for one of them and try to merge the two dataframes into a single called “merged.P1”

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Yay! Now that we have a merged dataframe we can make a manhattan plot:

plot(merged.P1$LocusName,P1$FST,xlab="Position",ylab="FST",col=merged.P1$chr,pch=19,cex=0.5)

What do you notice about this plot? Can you think of a possible explanation for this result?

10.9 Group Work Activity: PCA & Outflank

This week’s activity will combine skills you learned last week and this week using the quail dataset we familiarized ourselved with during class:

Rerun OUTFlank, but this time use the “Karyotype_cluster” cluster instead of geographic populations. This column tells you what inversion genotype a particular quail has. Compare this plot to the plot we made in class - why are there similarities and differences?
Create a PCA of the quail data. You can choose to color by geography or by karyotype (or get fancy with different shapes and colors!) Interpret the figure you make in one to two sentences

Copy the lines in your script used to answer these questions into your canvas submission. Be sure to include the answers for question 1, interpretations for the figures and the two figures you made as well.

10.10 Key Points

Standard organization of R scripts increases reproducibility
Fst outliers can be used to find genomic signals of local adaptation

Class Exercise Solutions

Class Exercises: Solutions

Exercise 1

Convert your vcf file to a gds file called “ChunkyQuails.gds”. Then read make a gds object called ‘genofile’.

seqVCF2GDS(vcf.path,"ChunkyQuails.gds")

## Wed Aug 27 13:46:42 2025
## Variant Call Format (VCF) Import:
##     file:
##         ChunkyQuails.vcf.gz (16.8M)
##     file format: VCFv4.0
##     genome reference: <unknown>
##     # of sets of chromosomes (ploidy): 2
##     # of samples: 80
##     genotype field: GT
##     genotype storage: bit2
##     compression method: LZMA_RA
##     # of samples: 80
##     INFO: NS,DP,AF
##     FORMAT: AD,DP,GQ,PL
## Output:
##     ChunkyQuails.gds
##     [Progress Info: ChunkyQuails.gds.progress]
## Parsing 'ChunkyQuails.vcf.gz':
## + genotype/data   { Bit2 2x80x112195 LZMA_ra(14.1%), 612.4K }
## Digests:
##     sample.id  [md5: 7491c6b4eaee9ba3078493a0d2907177]
##     variant.id  [md5: 59d8acf06ab4f39a290c4fc4a5e1016a]
##     position  [md5: 2536048a36b118218ef1c95e788a198f]
##     chromosome  [md5: a73e4a30c0efeea7a9842dde9cbc96f8]
##     allele  [md5: 9a28d049bceced835db2dfb86ed8cdfb]
##     genotype  [md5: 1d122df67114b974b6d70b98c3553153]
##     phase  [md5: e09164064bd27eb2cd5fbbe89cb2bd59]
##     annotation/id  [md5: a21597987108313d8c18eed6ac3cd45e]
##     annotation/qual  [md5: cc3abe146f504492261c55294bb7af3e]
##     annotation/filter  [md5: 50f223a513d64b2fed133f812af41f1f]
##     annotation/info/NS  [md5: 66886a2dd285c7b02a4d723ba71d3597]
##     annotation/info/DP  [md5: 66886a2dd285c7b02a4d723ba71d3597]
##     annotation/info/AF  [md5: 6e6dcb39f4b68c5a7dc6ad670c6ff366]
##     annotation/format/AD  [md5: 93096ea89bda017c7fd98c0875c519ef]
##     annotation/format/DP  [md5: 219e34546fd54cfbdaf77f852b8c9240]
##     annotation/format/GQ  [md5: fcbed5e7153bf05cf0718c7a549ba14f]
##     annotation/format/PL  [md5: 316019d9afcd850edd107263716149cc]
## Done.
## Wed Aug 27 13:47:27 2025
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'ChunkyQuails.gds' (15.8M)
##     # of fragments: 1017
##     save to 'ChunkyQuails.gds.tmp'
##     rename 'ChunkyQuails.gds.tmp' (15.8M, reduced: 11.1K)
##     # of fragments: 74
## Wed Aug 27 13:47:27 2025

genofile <- seqOpen("ChunkyQuails.gds")

Read in your metadata file to an object named ‘meta’. How many unique breeding locations are represented?

meta <- read.csv(meta.path)
table(meta$Loc1)

## 
## Canary Islands       CE Spain          Italy        Madeira        Morocco 
##             18              5              5             10             13 
##       NE Spain       NW Spain     S Portugal        S Spain 
##              6              6             10              7

length(unique(meta$Loc1))

## [1] 9

Exercise 2

Get the sample names from your gds object

samples <- read.gdsn(index.gdsn(genofile, "sample.id"))

Determine whether the sample IDs you extracted from the gds are in the same order as in the metadata file

identical(samples,meta$SampleID)

## [1] TRUE

Exercise 3

Merge the P1 and snp.pos dataframes based on the LocusName column.

merged.P1 <- merge(P1,snp.pos,by="LocusName")
library(dplyr)
merged.P1 <- left_join(P1,snp.pos,by="LocusName")

10 Week 9: Fst outlier analysis and chunky quails

10.1 Main Objectives:

10.2 Download the data

10.2.1 On Farm

10.2.2 On Your Own Computer

10.3 Unzip the Folder & Look at Data

10.4 Downloading R Packages

10.5 R hygiene

10.6 Finding outliers using OutFLANK

Class Exercise 1

10.7 Pulling SNP data from the gds file

10.8 Calculating Fst and identifying outliers

Class Exercise 2

Is your metadata in the same order as your genotype matrix?

Class Exercise 3

Merge two dataframes based on one column

10.9 Group Work Activity: PCA & Outflank

10.10 Key Points

Class Exercises: Solutions

Exercise 1

Exercise 2

Exercise 3

10 Week 9: F_st outlier analysis and chunky quails

10.8 Calculating F_st and identifying outliers