6 Week 5- Welcome to the R environment!

This lesson is modified from materials compiled by Serena Caplins from the STEMinist_R lessons produced by several UC Davis graduate student and which can be found here. The lecture slides to introduce R more in depth are here: Week 5 Slides)

6.1 Main Objectives

Learn how to use Rstudio through FarmOnDemand
Understand the different parts of the Rstudio window
Learn how to work with objects in R and general R operations
Learn how to manipulate data and explore a dataframe

6.2 Orientation to R

First let’s navigate to our R studio on Farm OnDemand! Launch your session with the information shown below: Give yourself 2 for memory and make sure your conda environment is set to r-4.4.2

Fill out this information

After we start it up we will want to create a new file called “Week5-IntroR.R” to type our scripts for today.

To do this, we will click File, New File then R script. Name your file with your name-Week5-introR

# before a sentence comments out what you want before you write code. Without a # R will think what you are writing needs to be run as a command! I have used comments in this example to write what the script is for. The top left is the script, the bottom left is the terminal window where code and outputs will be typed, the top right is your environment R window- example R can be used for basic arithmetic. Type this into your script and hit enter:

#adding prompt
5+10+23

## [1] 38

It can also store values in variables:

You can assign an object using an assignment operator <- or =.

#storing variables
number<-10

numbers<-c(10, 11, 12, 14, 16)

You can see your assigned object by typing the name you gave it.

#assigned objects
number

## [1] 10

numbers

## [1] 10 11 12 14 16

Objects can be numbers or characters:

#objects as characters
cat<-"meow"
dog<-"woof"

We can use colons to get sequences of numbers:

#sequence of numbers
n<-1:100

Data Structures include:

Vector
Lists
Matrices
Factor
Data frame

Vectors can also include characters (in quotes): c()=concatenate, aka link things together!

#link things together with this code
animals<-c("woof", "meow", "hiss", "baa")

6.3 Manipulating a vector object

We can get summaries of vectors with summary()

#summarize what you have done so far
summary(n)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   25.75   50.50   50.50   75.25  100.00

We can see how long a vector is with length()

#how long is vector n
length(n)

## [1] 100

You can use square brackets [] to get parts of vectors.

#what is the 50th entry in our vector n?
n[50]

## [1] 50

Class Exercise

What is the 2nd entry in our vector animals?

create a new vector with the following code for different field sites to answer questions 2 & 3:

site_code <- c("BB","ShC","BH","Pes","StC","MR","PV","KH","CM","RP","Car","SR","RM","Dav","JB","Mal","PB","PA","Leu","VD","A")

Without counting, which site is the 10th entry?

How many sites are there total?

6.4 Operations act on each element of a vector:

# +2
numbers+2

## [1] 12 13 14 16 18

# *2
numbers*2

## [1] 20 22 24 28 32

# mean
mean(numbers)

## [1] 12.6

# ^2
numbers^2

## [1] 100 121 144 196 256

# sum
sum(numbers)

## [1] 63

6.5 Operations can also work with two vectors:

#define a new object y
y<-numbers*2

# n + y
numbers + y

## [1] 30 33 36 42 48

# n * y
numbers * y

## [1] 200 242 288 392 512

6.6 A few tips below for working with objects:

We can keep track of what objects R is using, with the functions ls() and objects()

ls()

## [1] "animals"   "cat"       "dog"       "n"         "number"    "numbers"  
## [7] "site_code" "y"

objects() #returns the same results as ls() in this case. because we only have objects in our environment.

## [1] "animals"   "cat"       "dog"       "n"         "number"    "numbers"  
## [7] "site_code" "y"

This is where those objects show up with you type ls(): Returns a list of what is present in our R environment

# how to get help for a function; you can also write help()
?ls

# you can get rid of objects you don't want

rm(numbers)

# and make sure it got rid of them
ls()

## [1] "animals"   "cat"       "dog"       "n"         "number"    "site_code"
## [7] "y"

After removal

Call the help files for the functions ls() and rm() + What are the arguments for the ls() function? + What does the ‘sorted’ argument do?

From the help file: sorted is a logical indicating if the resulting character should be sorted alphabetically. Note that this is part of ls() may take most of the time.

6.7 Characterizing a dataframe

We’ll now move from working with objects and vectors to working with dataframes:

Here are a few useful functions! I will go over each as we introduce them throughout the lesson today:
- install.packages()
- library()
- data()
- str()
- dim()
- colnames() and rownames()
- class()
- as.factor()
- as.numeric()
- unique()
- t()
- max(), min(), mean() and summary()

We’re going to use data on sleep patterns in mammals. This requires installing a package (ggplot2) and loading the data

Install the package ggplot2. This only has to be done once and after installation we should then comment out the command to install the package with a #.

#install.packages("ggplot2")

#load the package

library (ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.1

Load the data (it’s called msleep). This dataset includes information bout mammal sleep times and weights that was taken from a study by V. M. Savage and G. B. West. “A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.”

The data includes name (common name), genus, vore (carnivore, omnivore, etc), order, conservation (status), sleep_total (total amount of sleep in hours), sleep_rem (rem sleep in hours), sleep_cycle (length of sleep cycle, in hours), awake (amount of time spent awake, in hours), brainwt (brain weight in kilograms) and bodywt (body weight in kilograms)

data("msleep")

There are many functions in R that allow us to get an idea of what the data looks like. For example, what are it’s dimensions (how many rows and columns)?

# head() -look at the beginning of the data file
# tail() -look at the end of the data file

head(msleep)

## # A tibble: 6 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

tail(msleep)

## # A tibble: 6 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Tenrec  Tenr… omni  Afro… <NA>                15.6       2.3      NA       8.4
## 2 Tree s… Tupa… omni  Scan… <NA>                 8.9       2.6       0.233  15.1
## 3 Bottle… Turs… carni Ceta… <NA>                 5.2      NA        NA      18.8
## 4 Genet   Gene… carni Carn… <NA>                 6.3       1.3      NA      17.7
## 5 Arctic… Vulp… carni Carn… <NA>                12.5      NA        NA      11.5
## 6 Red fox Vulp… carni Carn… <NA>                 9.8       2.4       0.35   14.2
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

# str()
str(msleep)

## tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
##  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
##  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
##  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...

dim(), ncol(), nrow()- dimensions, number of columns, number of rows colnames(), rownames() - column names, row names

Rstudio also allows us to just look into the data file with View(). Try to look at the msleep data using View(msleep)

6.8 How to access parts of the data:

We can also look at a single column at a time. There are three ways to access this: $, [,#] or [,“a”].

Think about “Remote Control car” to remember that [5,] means fifth row and [,5] means fifth column! Rows are listed first and columns are listed second.

Each way has it’s own advantages! The first subsets the third column of data, so you need to know where your data of interest is. The second subsets the vore column only. The third prints all of the data from the vore column in your console window.

msleep[,3]

## # A tibble: 83 × 1
##    vore 
##    <chr>
##  1 carni
##  2 omni 
##  3 herbi
##  4 omni 
##  5 herbi
##  6 herbi
##  7 carni
##  8 <NA> 
##  9 carni
## 10 herbi
## # ℹ 73 more rows

msleep[, "vore"]

## # A tibble: 83 × 1
##    vore 
##    <chr>
##  1 carni
##  2 omni 
##  3 herbi
##  4 omni 
##  5 herbi
##  6 herbi
##  7 carni
##  8 <NA> 
##  9 carni
## 10 herbi
## # ℹ 73 more rows

msleep$vore

##  [1] "carni"   "omni"    "herbi"   "omni"    "herbi"   "herbi"   "carni"  
##  [8] NA        "carni"   "herbi"   "herbi"   "herbi"   "omni"    "herbi"  
## [15] "omni"    "omni"    "omni"    "carni"   "herbi"   "omni"    "herbi"  
## [22] "insecti" "herbi"   "herbi"   "omni"    "omni"    "herbi"   "carni"  
## [29] "omni"    "herbi"   "carni"   "carni"   "herbi"   "omni"    "herbi"  
## [36] "herbi"   "carni"   "omni"    "herbi"   "herbi"   "herbi"   "herbi"  
## [43] "insecti" "herbi"   "carni"   "herbi"   "carni"   "herbi"   "herbi"  
## [50] "omni"    "carni"   "carni"   "carni"   "omni"    NA        "omni"   
## [57] NA        NA        "carni"   "carni"   "herbi"   "insecti" NA       
## [64] "herbi"   "omni"    "omni"    "insecti" "herbi"   NA        "herbi"  
## [71] "herbi"   "herbi"   NA        "omni"    "insecti" "herbi"   "herbi"  
## [78] "omni"    "omni"    "carni"   "carni"   "carni"   "carni"

If you wanted to save these as objects, you need to add an arrow an a new name for that object. They should all be the same!

column3<-msleep[,3]
voreonly<-msleep[, "vore"]
vores<-msleep$vore

head(column3) #do this or View() for all of your new objects!

## # A tibble: 6 × 1
##   vore 
##   <chr>
## 1 carni
## 2 omni 
## 3 herbi
## 4 omni 
## 5 herbi
## 6 herbi

It’s important to know the class of data if you want to manipulate it. For example, you can’t add characters. msleep contains several different types of data. We see with str() that there are columns of data that are characters and numeric.

Data Types/Classes:

Character (names)
Numeric (numbers)
Logical (T/F)
Integer (2L for example)
Complex (imaginary #s)
Raw (not really used)

class(msleep$vore) #character!

## [1] "character"

class(msleep$sleep_total) #numeric!

## [1] "numeric"

We can also look at a single row at a time. There are two ways to access this:

by indicating the row number in square brackets next to the name of the dataframe name[#,]
by calling the actual name of the row (if your rows have names) name["a",].

msleep[43,]

## # A tibble: 1 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Little… Myot… inse… Chir… <NA>                19.9         2         0.2   4.1
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

msleep[msleep$name == "Mountain beaver",]

## # A tibble: 1 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Mounta… Aplo… herbi Rode… nt                  14.4       2.4          NA   9.6
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Mountain Beaver by Coke Smith https://www.nps.gov/articles/000/mapping-mountain-beavers-in-point-reyes-a-collaboration-between-the-national-park-service-and-uc-berkeley.htm

We can select more than one row or column at a time:

 # see two columns

msleep[,c(1, 6)]

## # A tibble: 83 × 2
##    name                       sleep_total
##    <chr>                            <dbl>
##  1 Cheetah                           12.1
##  2 Owl monkey                        17  
##  3 Mountain beaver                   14.4
##  4 Greater short-tailed shrew        14.9
##  5 Cow                                4  
##  6 Three-toed sloth                  14.4
##  7 Northern fur seal                  8.7
##  8 Vesper mouse                       7  
##  9 Dog                               10.1
## 10 Roe deer                           3  
## # ℹ 73 more rows

 # and make a new data frame from these subsets

subsleep<-msleep[,c(1, 6)]

But what if we actually care about how many unique things are in a column?

 # unique()
unique(msleep[, "order"])

## # A tibble: 19 × 1
##    order          
##    <chr>          
##  1 Carnivora      
##  2 Primates       
##  3 Rodentia       
##  4 Soricomorpha   
##  5 Artiodactyla   
##  6 Pilosa         
##  7 Cingulata      
##  8 Hyracoidea     
##  9 Didelphimorphia
## 10 Proboscidea    
## 11 Chiroptera     
## 12 Perissodactyla 
## 13 Erinaceomorpha 
## 14 Cetacea        
## 15 Lagomorpha     
## 16 Diprotodontia  
## 17 Monotremata    
## 18 Afrosoricida   
## 19 Scandentia

 # table()
table(msleep$order)

## 
##    Afrosoricida    Artiodactyla       Carnivora         Cetacea      Chiroptera 
##               1               6              12               3               2 
##       Cingulata Didelphimorphia   Diprotodontia  Erinaceomorpha      Hyracoidea 
##               2               2               2               2               3 
##      Lagomorpha     Monotremata  Perissodactyla          Pilosa        Primates 
##               1               1               3               1              12 
##     Proboscidea        Rodentia      Scandentia    Soricomorpha 
##               2              22               1               5

 # levels(), if class is factor (and if not we can make it a factor) showing the way that the data is displayed
levels(as.factor(msleep$order))

##  [1] "Afrosoricida"    "Artiodactyla"    "Carnivora"       "Cetacea"        
##  [5] "Chiroptera"      "Cingulata"       "Didelphimorphia" "Diprotodontia"  
##  [9] "Erinaceomorpha"  "Hyracoidea"      "Lagomorpha"      "Monotremata"    
## [13] "Perissodactyla"  "Pilosa"          "Primates"        "Proboscidea"    
## [17] "Rodentia"        "Scandentia"      "Soricomorpha"

6.9 Data Manipulation

If your data is transposed in a way that isn’t useful to you, you can switch it. Note that this often changes the class of each column!

In R, all entries within a column must have the same format to be interpreted :

 # t()

tr_msleep<-t(msleep)

str(tr_msleep)

##  chr [1:11, 1:83] "Cheetah" "Acinonyx" "carni" "Carnivora" "lc" "12.1" NA ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:11] "name" "genus" "vore" "order" ...
##   ..$ : NULL

Often we want to summarize data. There are many ways of doing this in R:

 # calculate mean() of a column
mean(msleep$sleep_total)

## [1] 10.43373

 # max()
max(msleep$sleep_total)

## [1] 19.9

 # min()
min(msleep$sleep_total)

## [1] 1.9

 # summary()
summary(msleep$sleep_total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.90    7.85   10.10   10.43   13.75   19.90

Sometimes, the values we care about aren’t provided in a data set. When this happens, we can create a new column that contains the values we’re interested in:

  # what if what we cared about was our sleep_total/sleep_rem ratio?
  # add a sleep_total/sleep_rem ratio column to our msleep dataframe with $
msleep$total_rem<-msleep$sleep_total/msleep$sleep_rem

  # look at our dataframe again. It now contains 12 columns, one of them being the one we just created.
head(msleep)

## # A tibble: 6 × 12
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # ℹ 3 more variables: brainwt <dbl>, bodywt <dbl>, total_rem <dbl>

6.10 Group Work Activity: practice exploring a dataframe

Iris Diagram for Reference

We’ll use the built-in ‘iris’ dataset. the command: data(iris) # this loads the ‘iris’ dataset. You can view more information > about this dataset with help(iris) or ?iris This dataset was published by Ronald Fisher in his 1936 paper: “The use of multiple measurements in taxonomic problems”. It has three plant species (setosa, virginica, versicolor) and four morphological traits measured for each sample in centimeters: Sepal.Length, Sepal.Width, Petal.Length and Petal.Width. It is important to acknowledge that the field of genetics has been built on eugenics, and Fisher was a prominent geneticist and eugenicist. More information about this can be accessed here: https://www.ucl.ac.uk/biosciences/gee/ucl-centre-computational-biology/ronald-aylmer-fisher-1890-1962

Include copied text from your R script that is relevant to the questions below in your submission and answers to the following questions. Remember you can work in groups but everything is submitted individually. Go back to the iris dataset, which can be loaded with data(iris) and submit code and answers for the following questions:

How many rows are in the dataset? What are three distinct ways to figure this out?

How many species of flowers are in the dataset? What class is ‘iris’ ?

How many columns does this data frame have? What are their names?

What class did R assign to each column?

Now for writing your own commands. Assign the first flower’s petal’s width and length to new objects called setosa1.petalwidth and setosa1.petallength. Then Calculate the approximate area of the petal of the first flower, setosa1 (assume petal area can be approximated by a rectangle).

Next, calculate the petal area of each flower in the iris dataset and assign this to a new column named PetalArea.

What is the maximum sepal length of the irises? What about the minimum, median and average sepal length among all flowers in the dataset?

Reminder of those useful commands: dataframename[row , col], str(), dim(), nrow(), unique(), length(), rownames(), summary(), min(), max(), mean(), range(), levels(), factor(), as.factor(), class(), ncol(), nrow(), table(), sum(), quantile(), var()

6.11 Key Points

Useful functions such as install.packages(), library() can help us upload packages and data from R, while other functions such as str(), dim(), and unique() can help us investigate dataframes
To look at our data we can use several commands, including view() or data$columnofinterest if you only want to look at one variable.
Manipulating data in R can be extremely helpful in the analysis stage, and we can get data on mean(), min(), max(), and summary() data on different variables of interest

Class Exercise Solution

Class Exercise: Solution

animals[2] “meow”

site_code[10] “RP”

summary(site_code) 21 sample sites