3 Week 2- Working with Files

We will be following some of the data carpentry tutorial again (Copyright 2016 @ Software Carpentry) “Introduction to the command line for genomics”. We have made some modifications to the data carpentry tutorial to fit our course.

3.1 Main Objectives:

  • Learn the general sequencing pipeline
  • Learn how to use navigational shortcuts and repeat recently used commands to make coding more efficient
  • Learn how to view and work with hidden directories and hidden files
  • Learn how to view and search file contents

3.3 Sequencing Pipeline

To give you some context on the data we are working with today let’s talk about sequencing! It is important to understand the Illumina sequencing pipeline that produced the fastq files we will be working with. This video from Illumina explains the steps well, and I highly recommend checking it out: https://youtu.be/fCd6B5HRaZ8?si=68ZqTO2p2wRX-tUW. The sequencing workflow involves library preparation, cluster generation, sequencing and data analysis. We won’t get into the details of library preparation and cluster generation today, but the video does a good job of explaining it!

Below is also a figure highlighting the different steps of illumina sequencing, starting with the randomly fragmented genomic DNA and ending with sequence data! The figure is from this paper if you want to see more: https://www.sciencedirect.com/science/article/pii/S1359644608001244#fig3

Figure 3. Illumina/Solexa process overview. Genomic DNA is randomly fragmented and adapters are added to each fragment end (a). Single-stranded fragments are then attached randomly inside flow cells (b). Bridge amplification is performed to generate double-stranded fragments (c). Following repeated cycles of denaturation and bridge amplification, millions of DNA copies are present in each flow cell (d). DNA sequence is determined with four-labeled reversible terminators primers and polymerase (e). Unincorporated terminators are washed, and the sequence is determined in each flow cell following laser excitation. The blocked 3′ terminus and fluorophore are removed from the incorporated base and the cycle is repeated (figure adapted from http://www.illumina.com). After we get our sequencing data back in .fq format, we can then use the package “fastqc” to process that data to fastq files which will help us understand the quality of of sequences. Now let’s talk about those files!

3.4 Details on the FASTQ format

A fastq file is a text file with information on the quality of the sequence data

Although it looks complicated (and it is), it’s easy to understand the fastq format with a little decoding. Some rules about the format include…

Line Description
1 Always begins with ‘@’ and then information about the read
2 The actual DNA sequence
3 Always begins with a ‘+’ and sometimes the same info in line 1
4 Has a string of characters which represent the quality (phred) scores; must have same number of characters as line 2

With a full dataset you will also get an html file for each sample that give you more visual representations of your sample quality. These screenshots of a good and bad QC come from tutorials from the bioinformatics core linked here: https://bioinfo-core.org/index.php/9th_Discussion-28_October_2010 We will not have these visuals for our dataset today, but it is helpful to know what else you’d be working with if this was your own dataset!

3.5 Our data set: FASTQ files

We can use the command wget which needs a link to the file that we want to download. If there’s a file saved on a website somewhere (anywhere on the internet) wget will download it for you. If our data file is on github, which is where most of our data will be stored, we’ll use the command git-clone

In this example we’re going to download all the material in our individual directory. Navigate there and then type the following command:

$ wget https://raw.githubusercontent.com/mlarmstrong/IntroGenomics_Data/main/week2.tar.gz

Use the tar command to uncompress the file. This will also automatically make a week2 directory in your individual directory:

tar -xzvf week2.tar.gz

Check out what is in this directory with the ls command. Remember you can do ls -l to see more information on the contents

untrimmed_fastq sra_metadata TableS2_QTL_Bay_2017.txt

Now that we know how to navigate around our directory structure, let’s start working with our sequencing files. We did a sequencing experiment and have two results files, which are stored in our untrimmed_fastq directory.

3.6 Wildcards

Navigate to your untrimmed_fastq directory in your week2 individual directory

$ cd /group/rbaygrp/eve198-genomics/yourdirectory/week2/untrimmed_fastq

We are interested in looking at the FASTQ files in this directory. We can list all files with the .fastq extension using the command:

$ ls *.fastq
SRR097977.fastq  SRR098026.fastq

The * character is a special type of character called a wildcard, which can be used to represent any number of any type of character. Thus, *.fastq matches every file that ends with .fastq.

This command:

$ ls *977.fastq
SRR097977.fastq

lists only the file that ends with 977.fastq.

This command:

$ ls /usr/bin/*.sh
/usr/bin/gettext.sh  /usr/bin/rescan-scsi-bus.sh

Lists every file in /usr/bin that ends in the characters .sh. Note that the output displays full paths to files, since each result starts with /.

This directory contains two files with .fastq extensions. FASTQ is a format for storing information about sequencing reads and their quality. We will be learning more about FASTQ files in a later lesson.

echo is a built-in shell command that writes its arguments, like a line of text to standard output. The echo command can also be used with pattern matching characters, such as wildcard characters. Here we will use the echo command to see how the wildcard character is interpreted by the shell.

$ echo *.fastq
SRR097977.fastq SRR098026.fastq

The * is expanded to include any file that ends with .fastq. We can see that the output of echo *.fastq is the same as that of ls *.fastq.

What would the output look like if the wildcard could not be matched? Compare the outputs of echo *.missing and ls *.missing.

$ echo *.missing
*.missing
$ ls *.missing
ls: cannot access '*.missing': No such file or directory

Class Exercise 1

Do each of the following tasks from your current directory using a single ls command for each:

  1. List all of the files in /usr/bin that start with the letter ‘c’.
  2. List all of the files in /usr/bin that contain the letter ‘a’.
  3. List all of the files in /usr/bin that end with the letter ‘o’.

Bonus: List all of the files in /usr/bin that contain the letter ‘a’ or the letter ‘c’.

Hint: The bonus question requires a Unix wildcard that we haven’t talked about yet. Try searching the internet for information about Unix wildcards to find what you need to solve the bonus problem.

3.7 Command History

If you want to repeat a command that you’ve run recently, you can access previous commands using the up arrow on your keyboard to go back to the most recent command. Likewise, the down arrow takes you forward in the command history.

A few more useful shortcuts:

  • Ctrl+C will cancel the command you are writing, and give you a fresh prompt.
  • Ctrl+R will do a reverse-search through your command history. This is very useful.
  • Ctrl+L or the clear command will clear your screen.

You can also review your recent commands with the history command, by entering:

$ history

to see a numbered list of recent commands. You can reuse one of these commands directly by referring to the number of that command.

For example, if your history looked like this:

259  ls *
260  ls /usr/bin/*.sh
261  ls *R1*fastq

then you could repeat command #260 by entering:

$ !260

Type ! (exclamation point) and then the number of the command from your history. You will be glad you learned this when you need to re-run very complicated commands. For more information on advanced usage of history, read section 9.3 of Bash manual.

3.8 Examining Files

We now know how to switch directories, run programs, and look at the contents of directories, but how do we look at the contents of files?

One way to examine a file is to print out all of the contents using the program cat.

Enter the following command from within the untrimmed_fastq directory:

$ cat SRR098026.fastq

This will print out all of the contents of the SRR098026.fastq to the screen.

Class Exercise 2

  1. Print out the contents of the /group/rbaygrp/eve198-genomics/'yourdirectory'/Week2/untrimmed_fastq/SRR097977.fastq file. What is the last line of the file?
  2. From your home directory, and without changing directories, use one short command to print the contents of all of the files in the group/rbaygrp/eve198-genomics/yourdirectory/week2/untrimmed_fastq directory.

cat is a terrific program, but when the file is really big, it can be annoying to use. The program, less, is useful for this case. less opens the file as read only, and lets you navigate through it. The navigation commands are identical to the man program.

Enter the following command:

$ less SRR097977.fastq

Some navigation commands in less:

key action
Space to go forward
b to go backward
g to go to the beginning
G to go to the end
q to quit

less also gives you a way of searching through files. Use the “/” key to begin a search. Enter the word you would like to search for and press enter. The screen will jump to the next location where that word is found.

Shortcut: If you hit “/” then “enter”, less will repeat the previous search. less searches from the current location and works its way forward. Scroll up a couple lines on your terminal to verify you are at the beginning of the file. Note, if you are at the end of the file and search for the sequence “CAA”, less will not find it. You either need to go to the beginning of the file (by typing g) and search again using / or you can use ? to search backwards in the same way you used / previously.

For instance, let’s search forward for the sequence TTTTT in our file. You can see that we go right to that sequence, what it looks like, and where it is in the file. If you continue to type / and hit return, you will move forward to the next instance of this sequence motif. If you instead type ? and hit return, you will search backwards and move up the file to previous examples of this motif.

Class Exercise 3

What are the next three nucleotides (characters) after the first instance of the sequence quoted above?

Remember, the man program actually uses less internally and therefore uses the same commands, so you can search documentation using “/” as well!

There’s another way that we can look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it’s formatted.

The commands are head and tail and they let you look at the beginning and end of a file, respectively.

$ head SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
$ tail SRR098026.fastq
+SRR098026.247 HWUSI-EAS1599_1:2:1:2:1311 length=35
#!##!#################!!!!!!!######
@SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
GNTGNGGTCATCATACGCGCCCNNNNNNNGGCATG
+SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
B!;?!A=5922:##########!!!!!!!######
@SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
CNCTNTATGCGTACGGCAGTGANNNNNNNGGAGAT
+SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
A!@B!BBB@ABAB#########!!!!!!!######

The -n option to either of these commands can be used to print the first or last n lines of a file.

$ head -n 1 SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
$ tail -n 1 SRR098026.fastq
A!@B!BBB@ABAB#########!!!!!!!######

We can view the first complete read in one of the files in our dataset by using head to look at the first four lines.

$ head -n 4 SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

All but one of the nucleotides in this read are unknown (N). This is a pretty bad read!

Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide. For example, in the line above, the quality score line is:

!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

The # character and each of the ! characters represent the encoded quality for an individual nucleotide. The numerical value assigned to each of these characters depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, Illumina version 1.8 onwards.

Here is a link showing what those different symbols mean for quality scores: https://help.basespace.illumina.com/files-used-by-basespace/quality-scores

Each character is assigned a quality score between 0 and 42 as shown in the chart below.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
                  |         |         |         |         |
Quality score:    0........10........20........30........40..                          

Each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and dependent on how much signal was captured for the base incorporation.

Looking back at our read:

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

we can now see that the quality of each of the Ns is 0 and the quality of the only nucleotide call (C) is also very poor (# = a quality score of 2). This is indeed a very bad read.

3.9 Group Work Activity- Examining a Fastq File

Navigate to eve198-genomics and copy the ‘CCGPMC004_M0D060025C_S150_L003_R1_001.fastq.gz’ file in the week2_activity directory to your individual directory.

cd .. #go back out of your individual directory
cd week2_activity
cp *fastqc.gz ../`yourdirectory`

Next you will need to unzip the fastq file. To do this we will use the command “gunzip”. This unzips a gzipped file. “gzip” zips a file again. It will take a second since this is a larger file than our previous examples. It is actually an urchin sample from a recent genomics project Rachael and I worked on!

gunzip CCGPMC004_M0D060025C_S150_L003_R1_001.fastq.gz

Then answer the following questions and submit them to the “Week 2: Examining Files” assignment on canvas. I would not recommend using ‘less’ or ‘cat’ on this file due to its size.

  1. Write out the first two lines of CCGPMC004_M0D060025C_S150_L003_R1_001.fastq
  2. Remember that line 4 in a fastq file shows the quality (phred) scores. If you look at line 4 and reference the score table from the link and figures below. What is the most common symbol, Q-score and error probability for most of our nucleotide calls?

https://help.basespace.illumina.com/files-used-by-basespace/quality-scores QC Error Table Score Table Score Table

3.10 Key Points

  • The /, ~, and .. characters represent important navigational shortcuts.

  • The history command and the up arrow on your keyboard can be used to repeat recently used commands.

  • You can view file contents using less, cat, head or tail.

Class Exercise Solutions

Exercise 1: Solution

  1. ls /usr/bin/c*
  2. ls /usr/bin/*a*
  3. ls /usr/bin/*o
    Bonus: ls /usr/bin/*[ac]*

Exercise 2: Solution

  1. The last line of the file is C:CCC::CCCCCCCC<8?6A:C28C<608'&&&,'$.
  2. cat ./~/group/rbaygrp/eve198-genomics/yourdirectory/week2/untrimmed_fastq/*

Exercise 3: Solution

CAC