Data Preparation with DPLYR

class: left, middle, inverse, title-slide

.title[
# Data Preparation with DPLYR
]

---

# Data Preparation

# Packages for Today
Today we are going to be mainly using the dplyr package and the readr package. Since they are both contained within the tidyverse package, we will just load the tidyverse package.

```r
#Run install.packages if you have never installed the tidyverse on your machine.
#install.packages("tidyverse") 
library(tidyverse)

# OR JUST

library(dplyr)
library(readr)
library(stringr)
library(tidyr)
```

---

Read in the rodent sightings dataset. This dataset is different from the previous class as it looks at individual rodent sightings instead of looking at an aggregated level where each row is the average for an entire species. We are going to be using the `read_csv()` package today, which will read in the data with some added functionality above `read.csv()`.

```r
rodents <- read_csv('rodent_surveys.csv')

# OR 
# rodents <- read.csv('rodent_surveys.csv')
# rodents <- tibble(rodents)
```

---

## dplyr

dplyr is a package in R that allows you to work with and manipulate your data. It will allow us to focus in on the variables (columns) of interest and the observations (rows) of interest.

## The Pipe

The pipe is an operator in R that allows you to chain together functions in dplyr. In the past, to use multiple functions you would have to nest your functions inside of each other, but using the pipe allows you to chain the functions together in a readable/reproducible format. The pipe character is `%>%` and essentially means "then" You can create the character using `CTRL+SHIFT+M` or `CMD+SHIFT+M`

The pipe essentially takes whatever is before the pipe as input for whatever is after the pipe.

```r
# Normal way to view your dataset
View(rodents)

# Using the pipe, the rodents dataframe is passed on to the View function
rodents %>% 
  View()
```
---

```r
# The head function looks at the top of the dataframe
head(rodents)
```

```
## # A tibble: 6 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 OL      exclosure <NA>     NA    NA
## 2 1978-01-08 <NA>    exclosure <NA>     NA    NA
## 3 1978-01-08 PF      exclosure M        15     7
## 4 1978-01-08 OT      control   <NA>     NA    NA
## 5 1978-01-08 DM      control   M        36    40
## 6 1978-01-08 DM      control   F        34    37
```

```r
# When you have a function with multiple arguments, the first argument will
# be taken from whatever is before the pipe operator.
head(rodents, 3)
```

```
## # A tibble: 3 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 OL      exclosure <NA>     NA    NA
## 2 1978-01-08 <NA>    exclosure <NA>     NA    NA
## 3 1978-01-08 PF      exclosure M        15     7
```

```r
rodents %>% 
  head(3)
```

```r
rodents %>% 
  head(5) %>% 
  tail(1)
```

```
## # A tibble: 1 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 DM      control   M        36    40
```

```r
# The pipe is useful for cases like this, where you have three nested functions
head(tail(head(rodents, 36), 15), 3)
```

```
## # A tibble: 3 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 DS      control   M        53   136
## 2 1978-01-08 OT      exclosure <NA>     NA    NA
## 3 1978-01-08 OT      exclosure <NA>     NA    NA
```

```r
rodents %>% 
  head(36) %>% 
  tail(15) %>% 
  head(3)
```

## distinct/count

The distinct function will provide for you the distinct values of a column, while count provides you with both the distinct values of a column and then number of times each value shows up. The following example investigates the different species (`species`) in the `rodents` dataframe:

```r
rodents %>% 
  distinct(species)
```

```
## # A tibble: 21 × 1
##    species
##    <chr>  
##  1 OL     
##  2 <NA>   
##  3 PF     
##  4 OT     
##  5 DM     
##  6 DS     
##  7 RM     
##  8 DO     
##  9 PM     
## 10 PP     
## # … with 11 more rows
```

```r
#Pipe into nrow() to get the number of distinct species (other functions also exist to do something similar)
rodents %>% 
  distinct(species) %>% 
  nrow()
```

```
## [1] 21
```

---

```r
rodents %>% 
  count(species)
```

```
## # A tibble: 21 × 2
##    species     n
##    <chr>   <int>
##  1 BA        211
##  2 DM       7207
##  3 DO       1597
##  4 DS        569
##  5 OL        521
##  6 OT       1593
##  7 PB       4095
##  8 PE        853
##  9 PF       1082
## 10 PH         17
## # … with 11 more rows
```
---

## Exercise 
Using the `rodents` dataframe:

1. Using the pipe, get the summary of the dataframe

2. How many distinct types of treatment values are in the `rodents` dataset?

3. How many times does `PB` come up in `species`?

---

```r
#1.
rodents %>% 
  summary()
```

```
##    censusdate           species         
##  Min.   :1978-01-08   Length:28364      
##  1st Qu.:1994-03-11   Class :character  
##  Median :2003-12-20   Mode  :character  
##  Mean   :2002-06-25                     
##  3rd Qu.:2012-02-19                     
##  Max.   :2022-02-26                     
##                                         
##   treatment             sex                 hfl       
##  Length:28364       Length:28364       Min.   : 6.00  
##  Class :character   Class :character   1st Qu.:21.00  
##  Mode  :character   Mode  :character   Median :24.00  
##                                        Mean   :26.55  
##                                        3rd Qu.:35.00  
##                                        Max.   :55.00  
##                                        NA's   :715    
##       wgt        
##  Min.   :  4.00  
##  1st Qu.: 17.00  
##  Median : 26.00  
##  Mean   : 32.54  
##  3rd Qu.: 43.00  
##  Max.   :259.00  
##  NA's   :675
```

```r
#2.
rodents %>% 
  distinct(treatment) %>% 
  nrow()
```

```
## [1] 3
```

```r
  #3

#3. 
rodents %>% 
  count(species)
```

```
## # A tibble: 21 × 2
##    species     n
##    <chr>   <int>
##  1 BA        211
##  2 DM       7207
##  3 DO       1597
##  4 DS        569
##  5 OL        521
##  6 OT       1593
##  7 PB       4095
##  8 PE        853
##  9 PF       1082
## 10 PH         17
## # … with 11 more rows
```

```r
  #4095
```

---

## Arrange

The `arrange()` function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the `desc()` function to arrange by descending.

The following code would get the number of times each species shows up in the `rodents` dataset:

```r
rodents %>% 
  count(species) %>% 
  arrange(n)
```

```
## # A tibble: 21 × 2
##    species     n
##    <chr>   <int>
##  1 PH         17
##  2 SO         23
##  3 RF         30
##  4 PL         35
##  5 RO         40
##  6 PI         55
##  7 SF        194
##  8 BA        211
##  9 SH        215
## 10 <NA>      399
## # … with 11 more rows
```

---

```r
# Since the default is ascending order, we are not getting the results that are probably useful, so let's use the desc() function
rodents %>% 
  count(species) %>% 
  arrange(desc(n))
```

```
## # A tibble: 21 × 2
##    species     n
##    <chr>   <int>
##  1 PP       7270
##  2 DM       7207
##  3 PB       4095
##  4 RM       1812
##  5 DO       1597
##  6 OT       1593
##  7 PF       1082
##  8 PE        853
##  9 DS        569
## 10 PM        546
## # … with 11 more rows
```

```r
# Alternatively you can use the "-" to denote descending
rodents %>% 
  count(species) %>% 
  arrange(-n)
```

---

## filter

If you want to filter **rows** of the data where some condition is true, use the `filter()` function.

1. The first argument is the data frame you want to filter, e.g. `filter(mydata, ...`.
2. The second argument is a condition you must satisfy, e.g. `filter(clean, variable == "levelA")`.

- `==`: Equal to
- `!=`: Not equal to
- `>`, `>=`: Greater than, greater than or equal to
- `<`, `<=`: Less than, less than or equal to

If you want to satisfy *all* of multiple conditions, you can use the "and" operator, `&`.

The "or" operator `|` (the pipe character, usually shift-backslash) will return a subset that meet *any* of the conditions.

```r
#Keeps those rodents that have a weight greater than 60
rodents %>% 
  filter(wgt > 60) %>% head(n = 5)
```

```
## # A tibble: 5 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 DS      control   F        47   117
## 2 1978-01-08 DS      control   M        49   132
## 3 1978-01-08 DS      control   F        50   126
## 4 1978-01-08 DS      control   F        51   122
## 5 1978-01-08 DS      control   F        48   116
```

```r
# Keeps those rodents that weight less than or 
# equal to 10
rodents %>% 
  filter(wgt <= 10)  %>% head(n = 5)
```

```
## # A tibble: 5 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 PF      exclosure M        15     7
## 2 1978-03-11 PF      exclosure M        15     8
## 3 1978-06-08 PF      exclosure M        15     8
## 4 1978-07-07 PF      exclosure M        16     8
## 5 1978-07-07 PP      exclosure F        23    10
```

---

```r
#Keeps those rodents that have a wgt greater than 60 who have been treated as control
rodents %>% 
  filter(wgt > 60 & treatment == "control")  %>%
  head(n = 5)
```

```r
#Count the number of rodents fitting the conditions above
rodents %>% 
  filter(wgt > 60 & treatment == "control") %>%
  count()
```

```
## # A tibble: 1 × 1
##       n
##   <int>
## 1   868
```

```r
#Keeps Species that contain a capital P
rodents %>% 
  filter(str_detect(species, "P"))  %>% head(n = 5)
```

```
## # A tibble: 5 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 PF      exclosure M        15     7
## 2 1978-02-18 PF      control   <NA>     NA    NA
## 3 1978-03-11 PF      exclosure M        15     8
## 4 1978-03-11 PF      exclosure F        NA    NA
## 5 1978-03-11 PF      control   <NA>     NA    NA
```

---

```r
# The filter function is helpful for dropping missing values
rodents %>% 
  filter(!is.na(hfl))  %>% head(n = 5)
```

```
## # A tibble: 5 × 6
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 PF      exclosure M        15     7
## 2 1978-01-08 DM      control   M        36    40
## 3 1978-01-08 DM      control   F        34    37
## 4 1978-01-08 DS      control   F        47   117
## 5 1978-01-08 DM      control   M        36    40
```

```r
# drop_na works in a similar manner, but be careful
rodents %>% 
  drop_na(hfl) %>% head(n = 5)
```

```r
 # This would drop any observation that has a single missing value
rodents %>% 
  drop_na() %>% head(n = 5)
```

---

```r
# You can also save the results of a pipeline. 
# Here I want to get the top 3 biggest rodents in weight
big_rodents <- rodents %>% 
  drop_na(species) %>% 
  arrange(-wgt) %>% 
  head(3)

big_rodents %>% View()
```

### the %in% operator

The %in% operator is a useful tool when you have a series of conditions that you want to check. Instead of asking whether cond1 or cond2 or cond3 .... are met, you can use the %in% operator to check them all at once.

```r
specs <- c("SF", "DS", "PB")
specs
```

```
## [1] "SF" "DS" "PB"
```

```r
#This code will look through the rodents$species column and will only 
#return the values that match any value in the specs vector.
rodents %>% 
  filter(species %in% specs) %>% 
  count(species)
```

```
## # A tibble: 3 × 2
##   species     n
##   <chr>   <int>
## 1 DS        569
## 2 PB       4095
## 3 SF        194
```

---

## select()

Whereas the `filter()` function allows you to return only certain _rows_ matching a condition, the `select()` function returns only certain _columns_. The first argument is the data, and subsequent arguments are the columns you want.

```r
# You can simply list the column names you want to keep 
rodents %>% 
  select(species)
```

```
## # A tibble: 28,364 × 1
##    species
##    <chr>  
##  1 OL     
##  2 <NA>   
##  3 PF     
##  4 OT     
##  5 DM     
##  6 DM     
##  7 DS     
##  8 DM     
##  9 DS     
## 10 OL     
## # … with 28,354 more rows
```

```r
# Multiple columns can be separated by a comma
rodents %>% 
  select(censusdate, species)
```

```
## # A tibble: 28,364 × 2
##    censusdate species
##    <date>     <chr>  
##  1 1978-01-08 OL     
##  2 1978-01-08 <NA>   
##  3 1978-01-08 PF     
##  4 1978-01-08 OT     
##  5 1978-01-08 DM     
##  6 1978-01-08 DM     
##  7 1978-01-08 DS     
##  8 1978-01-08 DM     
##  9 1978-01-08 DS     
## 10 1978-01-08 OL     
## # … with 28,354 more rows
```

---

```r
# Or you can use the - sign to tell dplyr the columns you want to drop
rodents %>% 
  select(-species, -treatment) %>% 
  arrange(desc(censusdate))
```

```
## # A tibble: 28,364 × 4
##    censusdate sex     hfl   wgt
##    <date>     <chr> <dbl> <dbl>
##  1 2022-02-26 M        36    46
##  2 2022-02-26 M        36    44
##  3 2022-02-26 M        37    51
##  4 2022-02-26 M        21    16
##  5 2022-02-26 M        21    19
##  6 2022-02-26 F        38    44
##  7 2022-02-26 F        21    18
##  8 2022-02-26 F        21    14
##  9 2022-02-26 F        35    28
## 10 2022-02-26 F        21    14
## # … with 28,354 more rows
```

```r
# You can also use the select function to rename variables
rodents %>% 
  select(species_name = species, date = censusdate, hfl, wgt)
```

```
## # A tibble: 28,364 × 4
##    species_name date         hfl   wgt
##    <chr>        <date>     <dbl> <dbl>
##  1 OL           1978-01-08    NA    NA
##  2 <NA>         1978-01-08    NA    NA
##  3 PF           1978-01-08    15     7
##  4 OT           1978-01-08    NA    NA
##  5 DM           1978-01-08    36    40
##  6 DM           1978-01-08    34    37
##  7 DS           1978-01-08    47   117
##  8 DM           1978-01-08    36    40
##  9 DS           1978-01-08    49   132
## 10 OL           1978-01-08    NA    NA
## # … with 28,354 more rows
```

```r
# There is also a rename() function that does something similar
```
---

### starts_with(), ends_with(), contains()

The `starts_with`, `ends_with` and `contains` functions provide very useful tools for dropping/keeping several variables at once without having to list each and every column you want to keep. The function will return columns that either start with a specific string of text, ends with a certain string of text, or contain a certain string of text.

```r
rodents %>% 
  select(starts_with("s"))
```

```
## # A tibble: 28,364 × 2
##    species sex  
##    <chr>   <chr>
##  1 OL      <NA> 
##  2 <NA>    <NA> 
##  3 PF      M    
##  4 OT      <NA> 
##  5 DM      M    
##  6 DM      F    
##  7 DS      F    
##  8 DM      M    
##  9 DS      M    
## 10 OL      <NA> 
## # … with 28,354 more rows
```

```r
rodents %>% 
  select(-ends_with("e"), -ends_with("t"))
```

```
## # A tibble: 28,364 × 3
##    species sex     hfl
##    <chr>   <chr> <dbl>
##  1 OL      <NA>     NA
##  2 <NA>    <NA>     NA
##  3 PF      M        15
##  4 OT      <NA>     NA
##  5 DM      M        36
##  6 DM      F        34
##  7 DS      F        47
##  8 DM      M        36
##  9 DS      M        49
## 10 OL      <NA>     NA
## # … with 28,354 more rows
```

```r
rodents %>% 
  select(contains("m"))
```

```
## # A tibble: 28,364 × 1
##    treatment
##    <chr>    
##  1 exclosure
##  2 exclosure
##  3 exclosure
##  4 control  
##  5 control  
##  6 control  
##  7 control  
##  8 control  
##  9 control  
## 10 control  
## # … with 28,354 more rows
```

---

```r
# You can also create a vector of column names you want to keep 
rodents %>% 
  names()
```

```
## [1] "censusdate" "species"    "treatment"  "sex"       
## [5] "hfl"        "wgt"
```

```r
vars <- c( "species", "treatment", "hfl", "wgt")

rodents %>% 
  select(all_of(vars))
```

```
## # A tibble: 28,364 × 4
##    species treatment   hfl   wgt
##    <chr>   <chr>     <dbl> <dbl>
##  1 OL      exclosure    NA    NA
##  2 <NA>    exclosure    NA    NA
##  3 PF      exclosure    15     7
##  4 OT      control      NA    NA
##  5 DM      control      36    40
##  6 DM      control      34    37
##  7 DS      control      47   117
##  8 DM      control      36    40
##  9 DS      control      49   132
## 10 OL      control      NA    NA
## # … with 28,354 more rows
```

```r
rodents %>% 
  select(-vars)
```

```
## # A tibble: 28,364 × 2
##    censusdate sex  
##    <date>     <chr>
##  1 1978-01-08 <NA> 
##  2 1978-01-08 <NA> 
##  3 1978-01-08 M    
##  4 1978-01-08 <NA> 
##  5 1978-01-08 M    
##  6 1978-01-08 F    
##  7 1978-01-08 F    
##  8 1978-01-08 M    
##  9 1978-01-08 M    
## 10 1978-01-08 <NA> 
## # … with 28,354 more rows
```
---

However, you can go ahead and combine `filter` and `select` to subset your data even further.

```r
rodents %>% 
  filter(species %in% specs) %>% 
  select(vars)
```

```
## # A tibble: 4,858 × 4
##    species treatment   hfl   wgt
##    <chr>   <chr>     <dbl> <dbl>
##  1 DS      control      47   117
##  2 DS      control      49   132
##  3 DS      control      50   126
##  4 DS      control      51   122
##  5 DS      control      48   116
##  6 DS      control      50   133
##  7 DS      control      53   136
##  8 DS      control      49    NA
##  9 DS      control      NA    NA
## 10 DS      control      49   124
## # … with 4,848 more rows
```

## Exercise

1. Using the `rodents` dataset, how many rodents have a hindfood length greater than 52.  Answer should be 32. `hint: use filter and nrow()`

2. Use filter to find the male rodents with a weight greater than 85, followed by select to show the species, censusdate, and weight for these observations, arranged in descending order by weight

3. If I tried to arrange the answer from question 2 by sex type instead of weight, would that work? Why or why not?

---

```r
#1. 
rodents %>% 
  filter(hfl > 52) %>% 
  nrow()
```

```
## [1] 32
```

```r
#2.
rodents %>% 
  filter(wgt > 85 & sex == "M") %>% 
  select(species, censusdate, wgt) %>% 
  arrange(-wgt) %>% 
  drop_na(species) 
```

```
## # A tibble: 307 × 3
##    species censusdate   wgt
##    <chr>   <date>     <dbl>
##  1 DS      1981-11-22   170
##  2 DS      1980-03-21   160
##  3 DS      1979-10-24   156
##  4 DS      1981-12-30   155
##  5 DS      1981-09-29   153
##  6 DS      1982-02-22   150
##  7 DS      1980-08-13   149
##  8 DS      1980-10-11   149
##  9 DS      1980-11-08   149
## 10 DS      1981-01-11   148
## # … with 297 more rows
```

```r
### May also want to only see those with an actual species value

#3. 
#It Would not work because sex is no longer a part of the dataset at this point
```

---

## summarize()

The `summarize()` function summarizes multiple values to a single value. On its own the `summarize()` function doesn't seem to be all that useful. The dplyr package provides a few convenience functions called `n()` and `n_distinct()` that tell you the number of observations or the number of distinct values of a particular variable.

Notice that summarize takes a data frame and returns a data frame. In this case it's a 1x1 data frame with a single row and a single column. The name of the column, by default is whatever the expression was used to summarize the data. This usually isn't pretty, and if we wanted to work with this resulting data frame later on, we'd want to name that returned value something easier to deal with.

```r
rodents %>% 
  summarize(mean(hfl))
```

```
## # A tibble: 1 × 1
##   `mean(hfl)`
##         <dbl>
## 1          NA
```

```r
#Gotta watch out for nas. Use na.rm = TRUE to run the calculation without nas being involved.
rodents %>% 
  summarize(mean(hfl, na.rm = TRUE))
```

```
## # A tibble: 1 × 1
##   `mean(hfl, na.rm = TRUE)`
##                       <dbl>
## 1                      26.5
```

---

```r
#The default column names do not look very aesthetically pleasing, let's take care of that.
rodents %>% 
  summarize(hfl_mean = mean(hfl, na.rm = TRUE))
```

```
## # A tibble: 1 × 1
##   hfl_mean
##      <dbl>
## 1     26.5
```

```r
#Again, summarize alone is not entirely useful
rodents %>% 
  summarize(n())
```

```
## # A tibble: 1 × 1
##   `n()`
##   <int>
## 1 28364
```

## group_by()

We saw that `summarize()` isn't that useful on its own. Neither is `group_by()` All this does is takes an existing data frame and coverts it into a grouped data frame where operations are performed by group.

```r
rodents %>% 
  group_by(species) %>% head()
```

```
## # A tibble: 6 × 6
## # Groups:   species [5]
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 OL      exclosure <NA>     NA    NA
## 2 1978-01-08 <NA>    exclosure <NA>     NA    NA
## 3 1978-01-08 PF      exclosure M        15     7
## 4 1978-01-08 OT      control   <NA>     NA    NA
## 5 1978-01-08 DM      control   M        36    40
## 6 1978-01-08 DM      control   F        34    37
```

```r
rodents %>% 
  group_by(species, sex) %>% head()
```

```
## # A tibble: 6 × 6
## # Groups:   species, sex [6]
##   censusdate species treatment sex     hfl   wgt
##   <date>     <chr>   <chr>     <chr> <dbl> <dbl>
## 1 1978-01-08 OL      exclosure <NA>     NA    NA
## 2 1978-01-08 <NA>    exclosure <NA>     NA    NA
## 3 1978-01-08 PF      exclosure M        15     7
## 4 1978-01-08 OT      control   <NA>     NA    NA
## 5 1978-01-08 DM      control   M        36    40
## 6 1978-01-08 DM      control   F        34    37
```

---

The real power comes in where `group_by()` and `summarize()` are used together. First, write the `group_by()` statement. Then pipe the result to a call to `summarize()`.

```r
#Let's summarize the mean hindfood length of the different species
rodents %>% 
  group_by(species) %>% 
  summarize(hfl = mean(hfl, na.rm = TRUE))
```

```
## # A tibble: 21 × 2
##    species   hfl
##    <chr>   <dbl>
##  1 BA       13.1
##  2 DM       35.9
##  3 DO       35.5
##  4 DS       49.7
##  5 OL       20.3
##  6 OT       20.0
##  7 PB       26.0
##  8 PE       19.9
##  9 PF       15.4
## 10 PH       24.9
## # … with 11 more rows
```

```r
rodents %>% 
  group_by(species) %>% 
  summarize(hfl = mean(hfl, na.rm = TRUE)) %>% 
  arrange(-hfl)
```

```
## # A tibble: 21 × 2
##    species   hfl
##    <chr>   <dbl>
##  1 DS       49.7
##  2 DM       35.9
##  3 DO       35.5
##  4 <NA>     32.2
##  5 SH       28.7
##  6 PB       26.0
##  7 SO       25.9
##  8 SF       25.8
##  9 PH       24.9
## 10 PI       21.8
## # … with 11 more rows
```

---

```r
# Filter/select are often used in conjunction with summarize
rodents %>% 
  drop_na(species) %>% 
  filter(sex == "F") %>% 
  group_by(species, sex) %>% 
  summarize(hfl = mean(hfl, na.rm = TRUE)) %>% 
  arrange(-hfl)
```

```
## # A tibble: 20 × 3
## # Groups:   species [20]
##    species sex     hfl
##    <chr>   <chr> <dbl>
##  1 DS      F      49.2
##  2 DM      F      35.6
##  3 DO      F      35.3
##  4 SH      F      28.8
##  5 SO      F      25.9
##  6 PB      F      25.8
##  7 SF      F      25.7
##  8 PH      F      24.6
##  9 PI      F      21.6
## 10 PP      F      21.3
## 11 PM      F      20.2
## 12 OL      F      20.1
## 13 OT      F      20.0
## 14 PE      F      20.0
## 15 PL      F      19.3
## 16 RF      F      17.6
## 17 RM      F      16.3
## 18 RO      F      15.5
## 19 PF      F      15.4
## 20 BA      F      13.2
```

---
## MUTATE ##

The `mutate` function allows you to add or modify variables to your dataset.

```r
# Let us calculate the hindfoot length to weight ratio

rodents %>% 
  mutate(hfl_wgt = (hfl/ wgt)) %>% 
  select(hfl, wgt, hfl_wgt) %>% head()
```

```
## # A tibble: 6 × 3
##     hfl   wgt hfl_wgt
##   <dbl> <dbl>   <dbl>
## 1    NA    NA  NA    
## 2    NA    NA  NA    
## 3    15     7   2.14 
## 4    NA    NA  NA    
## 5    36    40   0.9  
## 6    34    37   0.919
```

```r
# This new variable is only temporary, let's save a new object with this column
rodents_2 <- rodents %>% 
  mutate(hfl_wgt = (hfl/ wgt)) %>% head()
```

---

```r
# We can now look at our new dataset with our new variable.
rodents_2 %>% 
  group_by(species) %>%
  drop_na(species) %>% 
  summarize(hfl_wgt = mean(hfl_wgt, na.rm = TRUE)) %>% arrange(-hfl_wgt) %>% 
  select(species, hfl_wgt) 
```

```
## # A tibble: 4 × 2
##   species hfl_wgt
##   <chr>     <dbl>
## 1 PF        2.14 
## 2 DM        0.909
## 3 OL      NaN    
## 4 OT      NaN
```

```r
# mutate with if_else
  #if_else statements allow you to conditionally make changes to your data. 
  #if the condition is true, then the first option will be selected, if not, the 
  #second option will be selected.
rodents_2 %>% 
  mutate(control = ifelse(treatment == "control", 1, 0)) %>% select(treatment, control) %>% head()
```

```
## # A tibble: 6 × 2
##   treatment control
##   <chr>       <dbl>
## 1 exclosure       0
## 2 exclosure       0
## 3 exclosure       0
## 4 control         1
## 5 control         1
## 6 control         1
```

---

```r
##case_when
  # the case_when function allows you to complete multiple if_else statements in a single group of commands
rodents_2 <- rodents_2 %>%
  mutate(hfl_cat = case_when(
    hfl <  17 ~ "Very Small",
    hfl >= 17 & hfl < 22 ~ "Small",
    hfl >= 22 & hfl < 32 ~ "Medium",
    hfl >= 32 ~ "Large"
    )
  ) %>% head()
rodents_2 %>% 
  drop_na(hfl_cat) %>% 
  count(hfl_cat)
```

```
## # A tibble: 2 × 2
##   hfl_cat        n
##   <chr>      <int>
## 1 Large          2
## 2 Very Small     1
```
---

## Writing out data

Just like you can read in data from read_csv(), you can also write out a csv file. Let's create a new file for our `rodents_2`` data.

```r
#Include the name of the object to export (rodents_2) and the name of the file
write_csv(rodents_2, "rodents_2.csv")
```