4 Week3: Awk and how to download files

For this week, we’ll continue to use the data that you downloaded last week. If you need to download it again please use the wget link below to download the tar file.

cd /home/margeno/
wget https://raw.githubusercontent.com/BayLab/MarineGenomicsData/main/week2.tar.gz

use tar to uncompress and unzip the file


tar -xzvf week2.tar.gz

This will create a week2 directory in your MarineGenomics directory (it will also create the MarineGenomics directory if you don’t already have one).

4.1 AWK

Awk is a fast and versatile pattern matching programming language. Awk can do the same tasks that sed, grep, cat, and wc; and then it can do a lot more https://www.gnu.org/software/gawk/manual/gawk.html. This program deserves a full class to go into details, so instead we just have this section to make you aware that the program exists.

Let’s see how awk can behave like wc.

$ cd /home/margeno/MarineGenomics/week2/
$ ls 

TableS2_QTL_Bay_2017.txt  sra_metadata  untrimmed_fastq

This table is from the Bay et al. 2017 publication ~/MarineGenomics/week2/TableS2_QTL_Bay_2017.txt and we will use it as our example file for this section.

We can look inside the file by using cat or awk

$ awk '{print $0}' TableS2_QTL_Bay_2017.txt

The instructions are enclosed in single quotes

This command has the same output of “cat”: it prints each line from the example file TableS2_QTL_Bay_2017.txt

The structure of the instruction is the following: - curly braces surround the set of instructions - print is the instruction that sends its arguments to the terminal - $0 is a variable, it means “the content of the current line”

As you can see, the file contains a table.

Trait   n   LOD Chr Position (cM)   Nearest SNP 
mate choice 200 4.5 14  22.43   chrXIV:1713227 
mate choice     200 4.61    21  8   chrXXI:9373717 
discriminant function   200 4.83    12  17  chrXII:7504339 
discriminant function   200 4.23    14  8.1 chrXIV:4632223 
PC2 200 4.04    4   30.76   chrIV:11367975 
PC2 200 6.67    7   47  chrVII:26448674 
centroid size   200 6.97    9   47.8    chrIX:19745222 
x2* 200 3.93    7   60  chrUn:29400087 
y2* 200 9.99    4   32  chrIV:11367975 
x3  200 4.45    1   32.3    chrI:15145305 
x4  200 5.13    16  30.9    chrXVI:12111717 
x5* 200 4.54    15  6   chrXV:505537 
y5  200 4.21    4   24.9    chrIV:15721538 
x6  200 3.96    16  29.5    chrXVI:13588796 
y6* 200 4.14    9   30.2    chrIX:18942598 
y15*    200 5.3 2   27  chrII:19324477 
x16 200 5.49    7   60  chrUn:29400087 
x17     200 4.92    1   32.8    chrI:14261764 
Table S2. Significant QTL loci for mate choice and morphology

Now let’s use awk to count the lines of a file, similarly to what wc -l would do.

As you probably remember, -l is an option that asks for the number of lines only.

However, wc counts the number of newlines in the file, if the last line does not contain a carriage return (i.e. there is no emptyline at the end of the file), the result is going be the actual number of lines minus one.

$ wc -l TableS2_QTL_Bay_2017.txt
19 TableS2_QTL_Bay_2017.txt

A workaround is to use awk. Awk is command line program that takes as input a set of instructions and one or more files. The instructions are executed on each line of the input file(s).

$ awk '{print NR;}' TableS2_QTL_Bay_2017.txt | tail -1

Awk can also search within a file like grep can. Let’s see if there are any significant QTL loci in the chromosome “chrXIV”

$ awk '/chrXIV/' TableS2_QTL_Bay_2017.txt

This chromosome had two significant QTL Loci for mate choice and morphology.

 

When to use awk?

  • for search and replacement of large files (it’s fast!)
  • when manipulating multiple large files

4.2 Moving and Downloading Data

Below we’ll show you some commands to download data onto your instance, or to move data between your computer and the cloud.

4.3 Getting data from the cloud

There are two programs that will download data from a remote server to your local (or remote) machine: wget and curl. They were designed to do slightly different tasks by default, so you’ll need to give the programs somewhat different options to get the same behaviour, but they are mostly interchangeable.

  • wget is short for “world wide web get”, and it’s basic function is to download web pages or data at a web address.

  • cURL is a pun, it is supposed to be read as “see URL”, so its basic function is to display webpages or data at a web address.

Which one you need to use mostly depends on your operating system, as most computers will only have one or the other installed by default.

Today we will use wget to download some data from Ensembl.

Exercise

Before we can start our download, we need to know whether we’re using curl or wget.

To see which program you have, type:

$ which curl
$ which wget

which is a BASH program that looks through everything you have installed, and tells you what folder it is installed to. If it can’t find the program you asked for, it returns nothing, i.e. gives you no results.

On Mac OSX, you’ll likely get the following output:

$ which wget
$ /usr/bin/wget

Once you know whether you have curl or wget, use one of the following commands to download the file:

 

$ cd
$ wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt

Let’s see if the file from ensembl downloaded

ls species_EnsemblBacteria.txt

it did!

4.4 Downloading files from Github

Github is a useful place to store data files and scripts and it is widely used by researchers in many different fields, including genomics. There are a few useful tricks to understanding how to best transfer files from github to your own terminal.

There are two main ways to transfer files from github: + Use git clone to download an entire repository (Directory) + use wget to download a single file

If you’re interested in getting a repository and all of its contents you can use git clone. This can be useful if you’re interested in using data files and the scripts that come with them.

First navigate to your home directory with cd (leaving it blank takes you to your home automatically).

And then use git clone to download the repository reallycoolrepo


cd

git clone https://github.com/SerenaCaplins/reallycoolrepo.git

This should make a new directory called reallycoolrepo. Let’s ls in this directory to see what’s in it.

ls reallycoolrepo/

files forloop.sh MarineGenomics.txt README.md

We have three files and one directory here. You can view the MarineGenomics.txt file with cat


cat reallycoolrepo/MarineGenomics.txt

  __  __                  _                     _____                                      _              
 |  \/  |                (_)                   / ____|                                    (_)             
 | \  / |   __ _   _ __   _   _ __     ___    | |  __    ___   _ __     ___    _ __ ___    _    ___   ___ 
 | |\/| |  / _` | | '__| | | | '_ \   / _ \   | | |_ |  / _ \ | '_ \   / _ \  | '_ ` _ \  | |  / __| / __|
 | |  | | | (_| | | |    | | | | | | |  __/   | |__| | |  __/ | | | | | (_) | | | | | | | | | | (__  \__ \
 |_|  |_|  \__,_| |_|    |_| |_| |_|  \___|    \_____|  \___| |_| |_|  \___/  |_| |_| |_| |_|  \___| |___/
                                                                                                          

Pretty cool huh?

Using git clone to get an entire repository can be useful, but often we’re just interested in getting a single file. We’ve already learned how to get files using wget, but this isn’t as straightforward on git hub. To illustrate when it doesn’t work let’s navigate to a repository where there’s a file that we’re interested in:

copy and paste this link into your browser:

https://github.com/BayLab/MarineGenomicsData

This is the repository where we have been storing all of the data for the class. We typically download a single tar file each week instead of cloning the whole repository all at once (this allows us to make changes to each week without having to download the whole repo every week, which would also override your files).

Say you wanted to get the week10.tar.gz file

If you click on the file you can copy the file path from your browser. A few ways to do this but perhaps easiest is to click the file and copy the path that shows up in your browser.

Week3_gitpath.jpg

https://github.com/BayLab/MarineGenomicsData/blob/main/week10.tar.gz

Seems fine right? Let’s use wget to try and import this into our home directory.

wget https://github.com/BayLab/MarineGenomicsData/blob/main/week10.tar.gz

Now let’s try and untar it


tar -xvzf week10.tar.gz

This prints an error message:


gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

What seems to have gone wrong. We get a clue if we use less to view the file (you normally wouldn’t use less to view to a tar.gz file, but in this case it will tell us something useful). Useful tip: less does work on .gz files!


less week10.tar.gz

You will see something like this:


<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">



  <link crossorigin="anonymous" media="all" integrity="sha512-Xvl7qd6ZFq6aBrViMpY+7UKRL79QzxxYG1kyELGe/sH4sV3eCks8DDXxa3WolACcKPac42eqrfe6m0jazyAIPQ==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-5ef97ba9de9916ae9a06b56232963eed.css" />
  <link crossorigin="anonymous" media="all" integrity="sha512-24GJDHWJro3USSMV5JFy5QbE8eCNYG61UucNp7vJMTaeJMrBy6FLiLFgX9jXnWlddv2VRu/rTLIkxzuRDF9ZVA==" rel="stylesheet" href="https://github.githubassets.com/assets/colors-v2-db81890c7589ae8dd4492315e49172e5.css" />
    <link crossorigin="anonymous" media="all" integrity="sha512-rcBopHrwspQORpXVLihZMP22sFwuIo3fL1DyFo5aXwWnV5FzV/nlAGnX/36fI9GQVc2VN7MiIT34RMCwq8jemg==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-adc068a47af0b2940e4695d52e285930.css" />

This is not what should be in the file. We’re seeing html coding like what you would see for a website. The reason for this is because the link we got from github was to the html page showing the file, not the location of the actual file itself.

Let’s remove this week10.tar.gz file


rm week10.tar.gz

So all of the repositories are encoded as html files to make the github website, we need to get the actual file path of the file itself to use wget.

You can do this by copying the raw file path, which you will find by clicking view raw or by using control-c or right clicking and selecting copy link address (for pc users) the download tab.

Week3_rawgitpath.jpg

Let’s try this again with our new link. It should say the word raw somewhere in the file path


wget https://github.com/BayLab/MarineGenomicsData/raw/main/week10.tar.gz

Now untar it


tar -xzvf week10.tar.gz

You should see something like this, which tells us that it worked!!

MarineGenomicsData/Week10/
MarineGenomicsData/Week10/candidate_fastas.fa

You can remove this as we won’t be using it later in the course

rm -r MarineGenomicsData

So we need to find that raw file path to use wget on any single file that we want in git hub.

4.5 Final bash wrap-up

We’ve covered a lot of ground so far in the last 2 and a half weeks! It’s a good time to review the commands we’ve learned and the skills we’re starting to develop.

In week 1, we showed you:

  • how to access jetstream a cloud computing resourse
  • how to navigate the terminal with bash/UNIX commands such as ls, cd,mv,mkdir, andcp`
  • the differnce between full and relative file paths: ++ full path example: /home/margeno/MarineGenomics/week2/README.txt ++ relative path from the MarineGenomics directroy: week2/README.txt
  • how to use Tab to autofill commands and file paths
  • the man command to see full parameters for bash commands

In week 2, we covered:

  • how to view files using less, cat, head, tail
  • how to view and modify file and directory permissions using chmod
  • how to use wildcards like * to view directory contents
  • to oh so cautiously use rm to permanently delete a file
  • use grep to search a file and >> to append search results to a new file
  • how to write a script using a text editor nano in our case
  • executing a script from a saved file with bash or by making it an executable program with chmod
  • writing for loops

Finally in week3, we learned

  • how to use Awk to edit the contents of a file
  • how to move and download data

At this point if you haven’t already it’s a good time to make a cheat sheet of the commands we’ve learned to keep by your computer so you can reference them at anytime.

There are several very good bash/Unix cheat sheets available online. Here are links to a few of them:

https://cheatography.com/gregcheater/cheat-sheets/bash/

https://cheatography.com/davechild/cheat-sheets/linux-command-line/

https://www.loggly.com/blog/the-essential-cheat-sheet-for-linux-admins/?utm_source=LinkInPDF&utm_medium=social-media&utm_campaign=SocialPush