6 Week 5- The R environment

R icon

This lesson is modified from materials of the STEMinist_R lessons produced by several UC Davis graduate student and which can be found here. The lessons were shortened here to fit into two sessions 75 minute sessions.

These materials are evenly divided between live coding examples performed by the instructor and exercises performed by the students.

This class will take place with students typing directly into an R script for the exercises all of which can be found in the Week 5 semester file here

You can download the R files for this week via wget in the terminal with the following link:

wget https://raw.githubusercontent.com/BayLab/MarineGenomicsData/main/week5_semester.tar.gz

this is a commpressed file which can be uncompressed via:

tar -xzvf week5_semester.tar.gz

You can now open R and load in the R_Day_1_Lesson.R file. This is the script that we will work out of for the rest of the week. You can see it contains many commented sections that begin with a #. This allows you to add comments to your code, explaining what you are doing for each line of code. Commenting code is very important! It explains to someone else what your code does, and can even be useful when you revisit your own code after a few weeks/months/years. Be nice to your future self, comment your code.

The next section contains the commented out code and the script that is run in R in a format that is more easily readable on a website.

6.1 Lesson 1: Orientation to R

R can be used for basic arithmetic:

5+10+23

#> [1] 38

It can also store values in variables:

You can assign an object using an assignment operator <- or =.

number<-10
numbers<-c(10, 11, 12, 14, 16)

You can see your assigned object by typing the name you gave it.

number

#> [1] 10

numbers

#> [1] 10 11 12 14 16

Objects can be numbers or characters:

cat<-"meow"
dog<-"woof"

We can use colons to get sequences of numbers:

n<-1:100

Vectors can also include characters (in quotes): c()=concatenate, aka link things together!

animals<-c("woof", "meow", "hiss", "baa")

6.2 Manipulating a vector object

We can get summaries of vectors with summary()

summary(n)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00   25.75   50.50   50.50   75.25  100.00

We can see how long a vector is with length()

length(n)

#> [1] 100

You can use square brackets [] to get parts of vectors.

n[50]

#> [1] 50

6.3 Operations act on each element of a vector:

# +2
numbers+2

#> [1] 12 13 14 16 18

# *2
numbers*2

#> [1] 20 22 24 28 32

# mean
mean(numbers)

#> [1] 12.6

# ^2
numbers^2

#> [1] 100 121 144 196 256

# sum
sum(numbers)

#> [1] 63

6.4 Operations can also work with two vectors:

#define a new object y
y<-numbers*2
# n + y
numbers + y

#> [1] 30 33 36 42 48

# n * y
numbers * y

#> [1] 200 242 288 392 512

6.5 A few tips below for working with objects:

We can keep track of what objects R is using, with the functions ls() and objects()

ls()

#>   [1] "a"                        "adj.p.val1"               "adj.p.val2"               "adj.p.val3"              
#>   [5] "adj.p.val4"               "adj.p.val5"               "admix.props"              "animals"                 
#>   [9] "areaCircle"               "babynames"                "beag"                     "beag_allIND"             
#>  [13] "beag_allIND_final"        "cali"                     "candidates"               "candidates.1"            
#>  [17] "candidates.2"             "candidates.3"             "candidates.4"             "candidates.5"            
#>  [21] "cat"                      "ChickWeight"              "chlomean"                 "conStruct.data"          
#>  [25] "conStruct.results"        "cov"                      "data"                     "data.block"              
#>  [29] "data_to_plot"             "df1"                      "df2"                      "df3"                     
#>  [33] "df4"                      "dog"                      "e"                        "environ"                 
#>  [37] "fst"                      "g"                        "G"                        "gen"                     
#>  [41] "gen_allIND"               "geno"                     "genos"                    "geo_dist"                
#>  [45] "i"                        "il"                       "iris"                     "L"                       
#>  [49] "lambda1"                  "lambda2"                  "lambda3"                  "lambda4"                 
#>  [53] "lambda5"                  "listy"                    "livability"               "lrt"                     
#>  [57] "lrt_filt"                 "lrt_rando"                "meta"                     "meta.path"               
#>  [61] "msleep"                   "my.colors"                "my.new"                   "my.run"                  
#>  [65] "my_colors"                "my_list"                  "n"                        "names"                   
#>  [69] "ne.pacific"               "number"                   "numbers"                  "OF"                      
#>  [73] "outliers"                 "P1"                       "ph"                       "pheno_chr6"              
#>  [77] "precip_seasonality"       "precip_w.crop"            "precip_wettest"           "project"                 
#>  [81] "q"                        "qval"                     "rando_filt"               "salinity"                
#>  [85] "sea_cuc_geno"             "sea_cuc_lfmm"             "setosa1.petallength"      "setosa1.petalwidth"      
#>  [89] "setosa1area2"             "sites"                    "sites_environ"            "sites_environ_matrix"    
#>  [93] "sites_environ_matrix_nas" "squares"                  "sst_max"                  "sst_max.crop"            
#>  [97] "sst_mean"                 "state.abb"                "state.area"               "state.center"            
#> [101] "state.division"           "state.name"               "state.region"             "state.x77"               
#> [105] "states"                   "states_standardized"      "subgen"                   "submeta"                 
#> [109] "subpops"                  "subsleep"                 "tG"                       "tr_msleep"               
#> [113] "vcf.path"                 "w"                        "x"                        "y"                       
#> [117] "zs1"                      "zs2"                      "zs3"                      "zs4"                     
#> [121] "zs5"

objects() #returns the same results as ls() in this case. because we only have objects in our environment.

#>   [1] "a"                        "adj.p.val1"               "adj.p.val2"               "adj.p.val3"              
#>   [5] "adj.p.val4"               "adj.p.val5"               "admix.props"              "animals"                 
#>   [9] "areaCircle"               "babynames"                "beag"                     "beag_allIND"             
#>  [13] "beag_allIND_final"        "cali"                     "candidates"               "candidates.1"            
#>  [17] "candidates.2"             "candidates.3"             "candidates.4"             "candidates.5"            
#>  [21] "cat"                      "ChickWeight"              "chlomean"                 "conStruct.data"          
#>  [25] "conStruct.results"        "cov"                      "data"                     "data.block"              
#>  [29] "data_to_plot"             "df1"                      "df2"                      "df3"                     
#>  [33] "df4"                      "dog"                      "e"                        "environ"                 
#>  [37] "fst"                      "g"                        "G"                        "gen"                     
#>  [41] "gen_allIND"               "geno"                     "genos"                    "geo_dist"                
#>  [45] "i"                        "il"                       "iris"                     "L"                       
#>  [49] "lambda1"                  "lambda2"                  "lambda3"                  "lambda4"                 
#>  [53] "lambda5"                  "listy"                    "livability"               "lrt"                     
#>  [57] "lrt_filt"                 "lrt_rando"                "meta"                     "meta.path"               
#>  [61] "msleep"                   "my.colors"                "my.new"                   "my.run"                  
#>  [65] "my_colors"                "my_list"                  "n"                        "names"                   
#>  [69] "ne.pacific"               "number"                   "numbers"                  "OF"                      
#>  [73] "outliers"                 "P1"                       "ph"                       "pheno_chr6"              
#>  [77] "precip_seasonality"       "precip_w.crop"            "precip_wettest"           "project"                 
#>  [81] "q"                        "qval"                     "rando_filt"               "salinity"                
#>  [85] "sea_cuc_geno"             "sea_cuc_lfmm"             "setosa1.petallength"      "setosa1.petalwidth"      
#>  [89] "setosa1area2"             "sites"                    "sites_environ"            "sites_environ_matrix"    
#>  [93] "sites_environ_matrix_nas" "squares"                  "sst_max"                  "sst_max.crop"            
#>  [97] "sst_mean"                 "state.abb"                "state.area"               "state.center"            
#> [101] "state.division"           "state.name"               "state.region"             "state.x77"               
#> [105] "states"                   "states_standardized"      "subgen"                   "submeta"                 
#> [109] "subpops"                  "subsleep"                 "tG"                       "tr_msleep"               
#> [113] "vcf.path"                 "w"                        "x"                        "y"                       
#> [117] "zs1"                      "zs2"                      "zs3"                      "zs4"                     
#> [121] "zs5"

# how to get help for a function; you can also write help()
?ls
# you can get rid of objects you don't want
rm(numbers)
# and make sure it got rid of them
ls()

#>   [1] "a"                        "adj.p.val1"               "adj.p.val2"               "adj.p.val3"              
#>   [5] "adj.p.val4"               "adj.p.val5"               "admix.props"              "animals"                 
#>   [9] "areaCircle"               "babynames"                "beag"                     "beag_allIND"             
#>  [13] "beag_allIND_final"        "cali"                     "candidates"               "candidates.1"            
#>  [17] "candidates.2"             "candidates.3"             "candidates.4"             "candidates.5"            
#>  [21] "cat"                      "ChickWeight"              "chlomean"                 "conStruct.data"          
#>  [25] "conStruct.results"        "cov"                      "data"                     "data.block"              
#>  [29] "data_to_plot"             "df1"                      "df2"                      "df3"                     
#>  [33] "df4"                      "dog"                      "e"                        "environ"                 
#>  [37] "fst"                      "g"                        "G"                        "gen"                     
#>  [41] "gen_allIND"               "geno"                     "genos"                    "geo_dist"                
#>  [45] "i"                        "il"                       "iris"                     "L"                       
#>  [49] "lambda1"                  "lambda2"                  "lambda3"                  "lambda4"                 
#>  [53] "lambda5"                  "listy"                    "livability"               "lrt"                     
#>  [57] "lrt_filt"                 "lrt_rando"                "meta"                     "meta.path"               
#>  [61] "msleep"                   "my.colors"                "my.new"                   "my.run"                  
#>  [65] "my_colors"                "my_list"                  "n"                        "names"                   
#>  [69] "ne.pacific"               "number"                   "OF"                       "outliers"                
#>  [73] "P1"                       "ph"                       "pheno_chr6"               "precip_seasonality"      
#>  [77] "precip_w.crop"            "precip_wettest"           "project"                  "q"                       
#>  [81] "qval"                     "rando_filt"               "salinity"                 "sea_cuc_geno"            
#>  [85] "sea_cuc_lfmm"             "setosa1.petallength"      "setosa1.petalwidth"       "setosa1area2"            
#>  [89] "sites"                    "sites_environ"            "sites_environ_matrix"     "sites_environ_matrix_nas"
#>  [93] "squares"                  "sst_max"                  "sst_max.crop"             "sst_mean"                
#>  [97] "state.abb"                "state.area"               "state.center"             "state.division"          
#> [101] "state.name"               "state.region"             "state.x77"                "states"                  
#> [105] "states_standardized"      "subgen"                   "submeta"                  "subpops"                 
#> [109] "subsleep"                 "tG"                       "tr_msleep"                "vcf.path"                
#> [113] "w"                        "x"                        "y"                        "zs1"                     
#> [117] "zs2"                      "zs3"                      "zs4"                      "zs5"

6.6 EXERCISE 1.1

Open Rstudio and perform an arithmetic calculation in the command line.

Solution

#this can be whatever you decide to do!
5*134

#> [1] 670

Create a numeric vector in the command line containing:

the numbers 2, 9, 3, 8, and 3 and assign this vector to a global variable x.

Perform arithmetic with x.

Convince yourself R works as a calculator, and knows order of operations.

Multiply x by 10, and save the result as a new object named y

Calculate the difference in the sum of the x vector and the sum of the y vector

Solution

x <- c(2, 9, 3, 8, 3)
x * 20

#> [1]  40 180  60 160  60

x + 4 * 24

#> [1]  98 105  99 104  99

y <- x * 10
sum(x) - sum(y)

#> [1] -225

Call the help files for the functions ls() and rm()

What are the arguments for the ls() function?

What does the ‘sorted’ argument do?

Solution

?ls
#From the help file:    sorted is a logical indicating if the resulting character should be sorted alphabetically. Note that this is part of ls() may take most of the time.

6.7 1.2 Characterizing a dataframe

We’ll now move from working with objects and vectors to working with dataframes:

Here are a few useful functions:
- install.packages()
- library()
- data()
- str()
- dim()
- colnames() and rownames()
- class()
- as.factor()
- as.numeric()
- unique()
- t()
- max(), min(), mean() and summary()

We’re going to use data on sleep patterns in mammals. This requires installing a package (ggplot2) and loading the data

Install the package ggplot2. This only has to be done once and after installation we should then comment out the command to install the package with a #.

#install.packages("ggplot2")
#load the package
library (ggplot2)

#> Need help? Try Stackoverflow: https://stackoverflow.com/tags/ggplot2

#> 
#> Attaching package: 'ggplot2'

#> The following object is masked _by_ '.GlobalEnv':
#> 
#>     msleep

Load the data (it’s called msleep).

data("msleep")

There are many functions in R that allow us to get an idea of what the data looks like. For example, what are it’s dimensions (how many rows and columns)?

# head() -look at the beginning of the data file
# tail() -look at the end of the data file
head(msleep)

#> # A tibble: 6 × 11
#>   name                       genus      vore  order        conservation sleep_total sleep…¹ sleep…² awake  brainwt  bodywt
#>   <chr>                      <chr>      <chr> <chr>        <chr>              <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>
#> 1 Cheetah                    Acinonyx   carni Carnivora    lc                  12.1    NA    NA      11.9 NA        50    
#> 2 Owl monkey                 Aotus      omni  Primates     <NA>                17       1.8  NA       7    0.0155    0.48 
#> 3 Mountain beaver            Aplodontia herbi Rodentia     nt                  14.4     2.4  NA       9.6 NA         1.35 
#> 4 Greater short-tailed shrew Blarina    omni  Soricomorpha lc                  14.9     2.3   0.133   9.1  0.00029   0.019
#> 5 Cow                        Bos        herbi Artiodactyla domesticated         4       0.7   0.667  20    0.423   600    
#> 6 Three-toed sloth           Bradypus   herbi Pilosa       <NA>                14.4     2.2   0.767   9.6 NA         3.85 
#> # … with abbreviated variable names ¹sleep_rem, ²sleep_cycle

tail(msleep)

#> # A tibble: 6 × 11
#>   name                 genus    vore  order        conservation sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
#>   <chr>                <chr>    <chr> <chr>        <chr>              <dbl>     <dbl>       <dbl> <dbl>   <dbl>   <dbl>
#> 1 Tenrec               Tenrec   omni  Afrosoricida <NA>                15.6       2.3      NA       8.4  0.0026   0.9  
#> 2 Tree shrew           Tupaia   omni  Scandentia   <NA>                 8.9       2.6       0.233  15.1  0.0025   0.104
#> 3 Bottle-nosed dolphin Tursiops carni Cetacea      <NA>                 5.2      NA        NA      18.8 NA      173.   
#> 4 Genet                Genetta  carni Carnivora    <NA>                 6.3       1.3      NA      17.7  0.0175   2    
#> 5 Arctic fox           Vulpes   carni Carnivora    <NA>                12.5      NA        NA      11.5  0.0445   3.38 
#> 6 Red fox              Vulpes   carni Carnivora    <NA>                 9.8       2.4       0.35   14.2  0.0504   4.23

# str()
str(msleep)

#> tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
#>  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
#>  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
#>  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
#>  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
#>  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
#>  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
#>  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
#>  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
#>  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
#>  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
#>  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...

dim(), ncol(), nrow()- dimensions, number of columns, number of rows colnames(), rownames() - column names, row names

Rstudio also allows us to just look into the data file with View()

6.8 How to access parts of the data:

We can also look at a single column at a time. There are three ways to access this: $, [,#] or [,“a”].

Quick Tip: Think about “rc cola” or “remote control car” to remember that [5,] means fifth row and [,5] means fifth column!

Each way has it’s own advantages:

msleep[,3]

#> # A tibble: 83 × 1
#>    vore 
#>    <chr>
#>  1 carni
#>  2 omni 
#>  3 herbi
#>  4 omni 
#>  5 herbi
#>  6 herbi
#>  7 carni
#>  8 <NA> 
#>  9 carni
#> 10 herbi
#> # … with 73 more rows

msleep[, "vore"]

#> # A tibble: 83 × 1
#>    vore 
#>    <chr>
#>  1 carni
#>  2 omni 
#>  3 herbi
#>  4 omni 
#>  5 herbi
#>  6 herbi
#>  7 carni
#>  8 <NA> 
#>  9 carni
#> 10 herbi
#> # … with 73 more rows

msleep$vore

#>  [1] "carni"   "omni"    "herbi"   "omni"    "herbi"   "herbi"   "carni"   NA        "carni"   "herbi"   "herbi"  
#> [12] "herbi"   "omni"    "herbi"   "omni"    "omni"    "omni"    "carni"   "herbi"   "omni"    "herbi"   "insecti"
#> [23] "herbi"   "herbi"   "omni"    "omni"    "herbi"   "carni"   "omni"    "herbi"   "carni"   "carni"   "herbi"  
#> [34] "omni"    "herbi"   "herbi"   "carni"   "omni"    "herbi"   "herbi"   "herbi"   "herbi"   "insecti" "herbi"  
#> [45] "carni"   "herbi"   "carni"   "herbi"   "herbi"   "omni"    "carni"   "carni"   "carni"   "omni"    NA       
#> [56] "omni"    NA        NA        "carni"   "carni"   "herbi"   "insecti" NA        "herbi"   "omni"    "omni"   
#> [67] "insecti" "herbi"   NA        "herbi"   "herbi"   "herbi"   NA        "omni"    "insecti" "herbi"   "herbi"  
#> [78] "omni"    "omni"    "carni"   "carni"   "carni"   "carni"

Sometimes it is useful to know what class() the column is:

class(msleep$vore)

#> [1] "character"

class(msleep$sleep_total)

#> [1] "numeric"

We can also look at a single row at a time. There are two ways to access this: 1. by indicating the row number in square brackets next to the name of the dataframe name[#,] and by calling the actual name of the row (if your rows have names) name["a",].

msleep[43,]

#> # A tibble: 1 × 11
#>   name             genus  vore    order      conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
#>   <chr>            <chr>  <chr>   <chr>      <chr>              <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
#> 1 Little brown bat Myotis insecti Chiroptera <NA>                19.9         2         0.2   4.1 0.00025   0.01

msleep[msleep$name == "Mountain beaver",]

#> # A tibble: 1 × 11
#>   name            genus      vore  order    conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
#>   <chr>           <chr>      <chr> <chr>    <chr>              <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
#> 1 Mountain beaver Aplodontia herbi Rodentia nt                  14.4       2.4          NA   9.6      NA   1.35

We can select more than one row or column at a time:

 # see two columns
msleep[,c(1, 6)]

#> # A tibble: 83 × 2
#>    name                       sleep_total
#>    <chr>                            <dbl>
#>  1 Cheetah                           12.1
#>  2 Owl monkey                        17  
#>  3 Mountain beaver                   14.4
#>  4 Greater short-tailed shrew        14.9
#>  5 Cow                                4  
#>  6 Three-toed sloth                  14.4
#>  7 Northern fur seal                  8.7
#>  8 Vesper mouse                       7  
#>  9 Dog                               10.1
#> 10 Roe deer                           3  
#> # … with 73 more rows

 # and make a new data frame from these subsets
subsleep<-msleep[,c(1, 6)]

But what if we actually care about how many unique things are in a column?

 # unique()
unique(msleep[, "order"])

#> # A tibble: 19 × 1
#>    order          
#>    <chr>          
#>  1 Carnivora      
#>  2 Primates       
#>  3 Rodentia       
#>  4 Soricomorpha   
#>  5 Artiodactyla   
#>  6 Pilosa         
#>  7 Cingulata      
#>  8 Hyracoidea     
#>  9 Didelphimorphia
#> 10 Proboscidea    
#> 11 Chiroptera     
#> 12 Perissodactyla 
#> 13 Erinaceomorpha 
#> 14 Cetacea        
#> 15 Lagomorpha     
#> 16 Diprotodontia  
#> 17 Monotremata    
#> 18 Afrosoricida   
#> 19 Scandentia

 # table()
table(msleep$order)

#> 
#>    Afrosoricida    Artiodactyla       Carnivora         Cetacea      Chiroptera       Cingulata Didelphimorphia 
#>               1               6              12               3               2               2               2 
#>   Diprotodontia  Erinaceomorpha      Hyracoidea      Lagomorpha     Monotremata  Perissodactyla          Pilosa 
#>               2               2               3               1               1               3               1 
#>        Primates     Proboscidea        Rodentia      Scandentia    Soricomorpha 
#>              12               2              22               1               5

 # levels(), if class is factor (and if not we can make it a factor)
levels(as.factor(msleep$order))

#>  [1] "Afrosoricida"    "Artiodactyla"    "Carnivora"       "Cetacea"         "Chiroptera"      "Cingulata"      
#>  [7] "Didelphimorphia" "Diprotodontia"   "Erinaceomorpha"  "Hyracoidea"      "Lagomorpha"      "Monotremata"    
#> [13] "Perissodactyla"  "Pilosa"          "Primates"        "Proboscidea"     "Rodentia"        "Scandentia"     
#> [19] "Soricomorpha"

6.9 Data Manipulation

If your data is transposed in a way that isn’t useful to you, you can switch it. Note that this often changes the class of each column!

In R, each column must have the same type of data:

 # t()
tr_msleep<-t(msleep)
str(tr_msleep)

#>  chr [1:11, 1:83] "Cheetah" "Acinonyx" "carni" "Carnivora" "lc" "12.1" NA NA "11.90" NA "  50.000" "Owl monkey" ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:11] "name" "genus" "vore" "order" ...
#>   ..$ : NULL

It’s important to know the class of data if you want to manipulate it. For example, you can’t add characters. msleep contains several different types of data.

Some common classes are: factors, numeric, integers, characters, logical

 # class()
class(msleep)

#> [1] "tbl_df"     "tbl"        "data.frame"

 # str()
str(msleep)

#> tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
#>  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
#>  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
#>  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
#>  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
#>  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
#>  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
#>  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
#>  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
#>  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
#>  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
#>  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...

Often we want to summarize data. There are many ways of doing this in R:

 # calculate mean() of a column
mean(msleep$sleep_total)

#> [1] 10.43373

 # max()
max(msleep$sleep_total)

#> [1] 19.9

 # min()
min(msleep$sleep_total)

#> [1] 1.9

 # summary()
summary(msleep$sleep_total)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.90    7.85   10.10   10.43   13.75   19.90

Sometimes, the values we care about aren’t provided in a data set. When this happens, we can create a new column that contains the values we’re interested in:

  # what if what we cared about was our sleep_total/sleep_rem ratio?
  # add a sleep_total/sleep_rem ratio column to our msleep dataframe with $
msleep$total_rem<-msleep$sleep_total/msleep$sleep_rem
  # look at our dataframe again. It now contains 12 columns, one of them being the one we just created.
head(msleep)

#> # A tibble: 6 × 12
#>   name                       genus      vore  order        conser…¹ sleep…² sleep…³ sleep…⁴ awake  brainwt  bodywt total…⁵
#>   <chr>                      <chr>      <chr> <chr>        <chr>      <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
#> 1 Cheetah                    Acinonyx   carni Carnivora    lc          12.1    NA    NA      11.9 NA        50       NA   
#> 2 Owl monkey                 Aotus      omni  Primates     <NA>        17       1.8  NA       7    0.0155    0.48     9.44
#> 3 Mountain beaver            Aplodontia herbi Rodentia     nt          14.4     2.4  NA       9.6 NA         1.35     6   
#> 4 Greater short-tailed shrew Blarina    omni  Soricomorpha lc          14.9     2.3   0.133   9.1  0.00029   0.019    6.48
#> 5 Cow                        Bos        herbi Artiodactyla domesti…     4       0.7   0.667  20    0.423   600        5.71
#> 6 Three-toed sloth           Bradypus   herbi Pilosa       <NA>        14.4     2.2   0.767   9.6 NA         3.85     6.55
#> # … with abbreviated variable names ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle, ⁵total_rem

6.10 EXERCISE 1.2

Reminder of those useful commands: dataframename[row , col], str(), dim(), nrow(), unique(), length(), rownames(), summary(), min(), max(), mean(), range(), levels(), factor(), as.factor(), class(), ncol(), nrow(), table(), sum(), quantile(), var() We’ll use the built-in ‘iris’ dataset. the command: data(iris) # this loads the ‘iris’ dataset. You can view more information about this dataset with help(iris) or ?iris

How many rows are in the dataset?

Solution

data(iris)
nrow(iris)

#> [1] 150

What are three distinct ways to figure this out?

Solution

#nrows
#str
#dim

How many species of flowers are in the dataset?

Solution

levels(iris$Species)

#> [1] "setosa"     "versicolor" "virginica"

What class is iris?

Solution

class(iris)

#> [1] "data.frame"

How many columns does this data frame have? What are their names?

Solution

colnames(iris)

#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

What class did R assign to each column?

Solution

str(iris)

#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Assign the first flower’s petal’s width and length to new objects called setosa1.petalwidth and setosa1.petallength

Solution

setosa1.petalwidth<-iris[1,2]
setosa1.petallength<-iris[1,3]

Calculate the approximate area of the petal of the first flower, setosa1 (assume petal area can be approximated by a rectangle).

Solution

#using our premade objects
setosa1area2<-setosa1.petalwidth*setosa1.petallength

Calculate the petal area of each flower in the iris dataset and assign this to a new column named PetalArea.

Solution

iris$PetalArea<-iris$Petal.Length*iris$Petal.Width

What is the maximum sepal length of the irises?

Solution

max(iris$Sepal.Length)

#> [1] 7.9

What is the average sepal length among all flowers in the dataset?

Solution

mean(iris$Sepal.Length)

#> [1] 5.843333

How about the minimum and median sepal length?

Solution

min(iris$Sepal.Length)

#> [1] 4.3

median(iris$Sepal.Length)

#> [1] 5.8

6.11 1.3 Subsetting datasets & logicals

A few useful commands: equals ==, does not equal !=, greater than >, less than <, and &, and a pipe which can also indicate “and” |.

Reminder there are two assignment operators in R <- and a single equals sign =. The one you use really depends on how you learned to use R, and are otherwise equivalent.

Logical conditions vs. assignment operators:

Logical values of TRUE and FALSE are special in R. What class is a logical value?

TRUE

#> [1] TRUE

FALSE

#> [1] FALSE

# what class is a logical value?
class(TRUE)

#> [1] "logical"

Logical values are stored as 0 for FALSE and 1 for TRUE. Which means you can do math with them!

TRUE + 1

#> [1] 2

FALSE + 1

#> [1] 1

sum(c(TRUE,TRUE,FALSE,FALSE))

#> [1] 2

!TRUE

#> [1] FALSE

!c(TRUE,TRUE,FALSE,FALSE)

#> [1] FALSE FALSE  TRUE  TRUE

Logicals will be the output of various tests:

1 == 1

#> [1] TRUE

1 == 2

#> [1] FALSE

 # does not equal
1 != 1

#> [1] FALSE

1 != 2

#> [1] TRUE

 # greater than
1 > 1

#> [1] FALSE

1 >= 1

#> [1] TRUE

 # less than
1 < 3

#> [1] TRUE

 # combining logical conditions with and (&), or(|)
1 == 1 & 2 == 2

#> [1] TRUE

1 == 1 & 1 == 2

#> [1] FALSE

1 == 1 | 1 == 2

#> [1] TRUE

 # we can take the opposite of a logical by using !
!TRUE

#> [1] FALSE

This is very useful because we can use logicals to query a data frame or vector.

 # Which numbers in 1:10 are greater than 3?
1:10 > 3

#>  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

 # How many numbers in 1:10 are greater than 3?
sum(1:10 > 3)

#> [1] 7

# in our msleep data frame, which species have total sleep greater than 18 hours?
# reload the msleep data with library(ggplot2) and data(msleep) if you need to
msleep[,"sleep_total"]>18

#>       sleep_total
#>  [1,]       FALSE
#>  [2,]       FALSE
#>  [3,]       FALSE
#>  [4,]       FALSE
#>  [5,]       FALSE
#>  [6,]       FALSE
#>  [7,]       FALSE
#>  [8,]       FALSE
#>  [9,]       FALSE
#> [10,]       FALSE
#> [11,]       FALSE
#> [12,]       FALSE
#> [13,]       FALSE
#> [14,]       FALSE
#> [15,]       FALSE
#> [16,]       FALSE
#> [17,]       FALSE
#> [18,]       FALSE
#> [19,]       FALSE
#> [20,]       FALSE
#> [21,]       FALSE
#> [22,]        TRUE
#> [23,]       FALSE
#> [24,]       FALSE
#> [25,]       FALSE
#> [26,]       FALSE
#> [27,]       FALSE
#> [28,]       FALSE
#> [29,]       FALSE
#> [30,]       FALSE
#> [31,]       FALSE
#> [32,]       FALSE
#> [33,]       FALSE
#> [34,]       FALSE
#> [35,]       FALSE
#> [36,]       FALSE
#> [37,]        TRUE
#> [38,]       FALSE
#> [39,]       FALSE
#> [40,]       FALSE
#> [41,]       FALSE
#> [42,]       FALSE
#> [43,]        TRUE
#> [44,]       FALSE
#> [45,]       FALSE
#> [46,]       FALSE
#> [47,]       FALSE
#> [48,]       FALSE
#> [49,]       FALSE
#> [50,]       FALSE
#> [51,]       FALSE
#> [52,]       FALSE
#> [53,]       FALSE
#> [54,]       FALSE
#> [55,]       FALSE
#> [56,]       FALSE
#> [57,]       FALSE
#> [58,]       FALSE
#> [59,]       FALSE
#> [60,]       FALSE
#> [61,]       FALSE
#> [62,]        TRUE
#> [63,]       FALSE
#> [64,]       FALSE
#> [65,]       FALSE
#> [66,]       FALSE
#> [67,]       FALSE
#> [68,]       FALSE
#> [69,]       FALSE
#> [70,]       FALSE
#> [71,]       FALSE
#> [72,]       FALSE
#> [73,]       FALSE
#> [74,]       FALSE
#> [75,]       FALSE
#> [76,]       FALSE
#> [77,]       FALSE
#> [78,]       FALSE
#> [79,]       FALSE
#> [80,]       FALSE
#> [81,]       FALSE
#> [82,]       FALSE
#> [83,]       FALSE

 # Using which() to identify which rows match the logical values (TRUE) and length to count how many species there are
which(msleep[,"sleep_total"]>18)  #22 37 43 62 --> the rows that contain organisms that sleep more than 18 hrs

#> [1] 22 37 43 62

length(which(msleep[,"sleep_total"]>18)) #4 --> number of species that sleep more than 18 hrs

#> [1] 4

 # which four species are these?
msleep[which(msleep[,"sleep_total"]>18),]

#> # A tibble: 4 × 12
#>   name                 genus      vore    order           conservat…¹ sleep…² sleep…³ sleep…⁴ awake brainwt bodywt total…⁵
#>   <chr>                <chr>      <chr>   <chr>           <chr>         <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>   <dbl>
#> 1 Big brown bat        Eptesicus  insecti Chiroptera      lc             19.7     3.9   0.117   4.3  3  e-4  0.023    5.05
#> 2 Thick-tailed opposum Lutreolina carni   Didelphimorphia lc             19.4     6.6  NA       4.6 NA       0.37     2.94
#> 3 Little brown bat     Myotis     insecti Chiroptera      <NA>           19.9     2     0.2     4.1  2.5e-4  0.01     9.95
#> 4 Giant armadillo      Priodontes insecti Cingulata       en             18.1     6.1  NA       5.9  8.1e-2 60        2.97
#> # … with abbreviated variable names ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle, ⁵total_rem

# what if we only want to see the bats that sleep more than 18 hours per 24 hour period?
msleep[which(msleep[,"sleep_total"]>18 & msleep[,"order"] == "Chiroptera"),]

#> # A tibble: 2 × 12
#>   name             genus     vore    order      conservation sleep_total sleep_rem sleep_cy…¹ awake brainwt bodywt total…²
#>   <chr>            <chr>     <chr>   <chr>      <chr>              <dbl>     <dbl>      <dbl> <dbl>   <dbl>  <dbl>   <dbl>
#> 1 Big brown bat    Eptesicus insecti Chiroptera lc                  19.7       3.9      0.117   4.3 0.0003   0.023    5.05
#> 2 Little brown bat Myotis    insecti Chiroptera <NA>                19.9       2        0.2     4.1 0.00025  0.01     9.95
#> # … with abbreviated variable names ¹sleep_cycle, ²total_rem

6.12 EXERCISE 1.3 indexing by logical statements

A few useful commands: “==”, “!=”, “>”, “<”, “&”, “|”, sum(), which(), table(), ! 1. Create your own logical vector with three TRUEs and three FALSEs

Solution

a = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)
a ## let's print to screen and make sure it is stored in this variable

#> [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

Produce a vector of the index number of the TRUE values

Solution

which(a)  ## which gives you the index of TRUE values automatically

#> [1] 1 2 5

which(a == TRUE)   ## but sometimes it's reassuring to state exactly what you're doing

#> [1] 1 2 5

Produce a second vector which indexes the numbers of the falses

Solution

which(!a)

#> [1] 3 4 6

which(a == FALSE)

#> [1] 3 4 6

Go back to the iris dataset, which can be loaded with data(iris) 4. How many irises have sepals less than 5.5 cm?

Solution

data(iris)  ## this reloads the data set in case you've closed R since using iris
sum(iris[,'Sepal.Length']<5.5)  ## remember TRUE's are 1 and FALSE's are 0

#> [1] 52

length(which(iris[,'Sepal.Length']<5.5))  ## here, which() will only return the index of TRUE values, so we're counting how many there are

#> [1] 52

Which iris individual has the largest petal length? What is the width of it’s petal?

Solution

max(iris[,'Petal.Length'])  ## this gives us the length of the longest petal

#> [1] 6.9

which(iris[,'Petal.Length'] == max(iris[,'Petal.Length']))  ## this gives us the index of the individual with the longest petal

#> [1] 119

iris[,'Petal.Width'][which(iris[,'Petal.Length'] == max(iris[,'Petal.Length']))] ## now we're subsetting the Petal.Width column by the index of the individual with the longest petal

#> [1] 2.3

## another way to do this would be to use the index of the individual with the longest petal to pick rows, and the Petal.Width name to pick columns and subset the entire data frame
iris[which(iris[,'Petal.Length'] == max(iris[,'Petal.Length'])) , 'Petal.Width']

#> [1] 2.3

How many of the irises are in this dataset belong to the species versicolor?

Solution

sum(iris[,'Species']=='versicolor')

#> [1] 50

table(iris[,'Species']) ## this gets us all three species

#> 
#>     setosa versicolor  virginica 
#>         50         50         50

How many irises have petals longer than 6cm?

Solution

sum(iris[,'Petal.Length'] > 6)

#> [1] 9

Create a vector of species name for each iris with sepals longer than 6cm.

Solution

iris[,'Species'][iris[,'Sepal.Length']>6]

#>  [1] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
#> [11] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
#> [21] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [31] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [41] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [51] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [61] virginica 
#> Levels: setosa versicolor virginica

iris[iris[,'Sepal.Length']>6, 'Species'] ## alternatively, we can put the logical vector in the row part, and Species in the column part, to get a vector back

#>  [1] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
#> [11] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
#> [21] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [31] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [41] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [51] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
#> [61] virginica 
#> Levels: setosa versicolor virginica

How many irises have sepals shorter than 5cm, but wider than 3cm?

Solution

sum( iris[,'Sepal.Length'] < 5 & iris[,'Sepal.Width'] > 3 )

#> [1] 13

How many irises have petals narrower than 0.2cm or shorter than 1.5cm?

Solution

sum( iris[,'Petal.Width'] < 0.2 | iris[,'Petal.Length'] < 1.5 )

#> [1] 26

What is the average width of setosa iris sepals that are longer than 5cm?

Solution

mean( iris[,'Sepal.Width'][iris[,'Sepal.Length'] > 5][iris[,'Species']=='setosa']) ## convince yourself the second part is a logical vector that subsets iris[,'Sepal.Width']

#> [1] 3.22

mean( iris[iris[,'Sepal.Length'] > 5, 'Sepal.Width'][iris[,'Species']=='setosa']) ## again, we can alternatively subset using logical vectors in the row position

#> [1] 3.22

What is the difference between the longest and shortest petal lengths of the species virginica?

Solution

max(iris[,'Petal.Length'][iris[,'Species']=='virginica']) - min(iris[,'Petal.Length'][iris[,'Species']=='virginica'])

#> [1] 2.4

What proportion of flowers in the dataset have petals wider than 1cm?

Solution

sum(iris[,'Petal.Width'] > 1 ) / nrow(iris) ## here, we're counting up how many are wider than 1 cm, and dividing by the total number of flowers to get a proportion

#> [1] 0.62

Create a new column within your dataframe, called sepalCategory, and set all values equal to ‘long’ Subset short values of this column, and set their values to ‘short’ (Short sepals are those less than 5.5 cm) How many plants with short sepals are there? How many long?

Solution

# new column for long
iris[,'sepalCategory'] = 'long'  ## this sets ever entry in the column equal to 'long'
# new column for short (< 5.5 cm)
iris[,'sepalCategory'][iris[,'Sepal.Length']<5.5] = 'short'  ## this sets only those entries that match our condition to 'short'
# how many plants with short sepals are there? How many long?
table(iris[,'sepalCategory'])

#> 
#>  long short 
#>    98    52