class: left, middle, inverse, title-slide .title[ # Data Preparation with DPLYR ] --- # Data Preparation # Packages for Today Today we are going to be mainly using the dplyr package and the readr package. Since they are both contained within the tidyverse package, we will just load the tidyverse package. ```r #Run install.packages if you have never installed the tidyverse on your machine. #install.packages("tidyverse") library(tidyverse) # OR JUST library(dplyr) library(readr) library(stringr) library(tidyr) ``` --- Read in the rodent sightings dataset. This dataset is different from the previous class as it looks at individual rodent sightings instead of looking at an aggregated level where each row is the average for an entire species. We are going to be using the `read_csv()` package today, which will read in the data with some added functionality above `read.csv()`. ```r rodents <- read_csv('rodent_surveys.csv') # OR # rodents <- read.csv('rodent_surveys.csv') # rodents <- tibble(rodents) ``` --- ## dplyr dplyr is a package in R that allows you to work with and manipulate your data. It will allow us to focus in on the variables (columns) of interest and the observations (rows) of interest. ## The Pipe The pipe is an operator in R that allows you to chain together functions in dplyr. In the past, to use multiple functions you would have to nest your functions inside of each other, but using the pipe allows you to chain the functions together in a readable/reproducible format. The pipe character is `%>%` and essentially means "then" You can create the character using `CTRL+SHIFT+M` or `CMD+SHIFT+M` The pipe essentially takes whatever is before the pipe as input for whatever is after the pipe. ```r # Normal way to view your dataset View(rodents) # Using the pipe, the rodents dataframe is passed on to the View function rodents %>% View() ``` --- ```r # The head function looks at the top of the dataframe head(rodents) ``` ``` ## # A tibble: 6 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 OL exclosure <NA> NA NA ## 2 1978-01-08 <NA> exclosure <NA> NA NA ## 3 1978-01-08 PF exclosure M 15 7 ## 4 1978-01-08 OT control <NA> NA NA ## 5 1978-01-08 DM control M 36 40 ## 6 1978-01-08 DM control F 34 37 ``` ```r # When you have a function with multiple arguments, the first argument will # be taken from whatever is before the pipe operator. head(rodents, 3) ``` ``` ## # A tibble: 3 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 OL exclosure <NA> NA NA ## 2 1978-01-08 <NA> exclosure <NA> NA NA ## 3 1978-01-08 PF exclosure M 15 7 ``` ```r rodents %>% head(3) ``` ``` ## # A tibble: 3 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 OL exclosure <NA> NA NA ## 2 1978-01-08 <NA> exclosure <NA> NA NA ## 3 1978-01-08 PF exclosure M 15 7 ``` --- ```r rodents %>% head(5) %>% tail(1) ``` ``` ## # A tibble: 1 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 DM control M 36 40 ``` ```r # The pipe is useful for cases like this, where you have three nested functions head(tail(head(rodents, 36), 15), 3) ``` ``` ## # A tibble: 3 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 DS control M 53 136 ## 2 1978-01-08 OT exclosure <NA> NA NA ## 3 1978-01-08 OT exclosure <NA> NA NA ``` ```r rodents %>% head(36) %>% tail(15) %>% head(3) ``` ``` ## # A tibble: 3 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 DS control M 53 136 ## 2 1978-01-08 OT exclosure <NA> NA NA ## 3 1978-01-08 OT exclosure <NA> NA NA ``` --- ## distinct/count The distinct function will provide for you the distinct values of a column, while count provides you with both the distinct values of a column and then number of times each value shows up. The following example investigates the different species (`species`) in the `rodents` dataframe: ```r rodents %>% distinct(species) ``` ``` ## # A tibble: 21 × 1 ## species ## <chr> ## 1 OL ## 2 <NA> ## 3 PF ## 4 OT ## 5 DM ## 6 DS ## 7 RM ## 8 DO ## 9 PM ## 10 PP ## # … with 11 more rows ``` ```r #Pipe into nrow() to get the number of distinct species (other functions also exist to do something similar) rodents %>% distinct(species) %>% nrow() ``` ``` ## [1] 21 ``` --- ```r rodents %>% count(species) ``` ``` ## # A tibble: 21 × 2 ## species n ## <chr> <int> ## 1 BA 211 ## 2 DM 7207 ## 3 DO 1597 ## 4 DS 569 ## 5 OL 521 ## 6 OT 1593 ## 7 PB 4095 ## 8 PE 853 ## 9 PF 1082 ## 10 PH 17 ## # … with 11 more rows ``` --- ## Exercise Using the `rodents` dataframe: 1. Using the pipe, get the summary of the dataframe 2. How many distinct types of treatment values are in the `rodents` dataset? 3. How many times does `PB` come up in `species`? --- ```r #1. rodents %>% summary() ``` ``` ## censusdate species ## Min. :1978-01-08 Length:28364 ## 1st Qu.:1994-03-11 Class :character ## Median :2003-12-20 Mode :character ## Mean :2002-06-25 ## 3rd Qu.:2012-02-19 ## Max. :2022-02-26 ## ## treatment sex hfl ## Length:28364 Length:28364 Min. : 6.00 ## Class :character Class :character 1st Qu.:21.00 ## Mode :character Mode :character Median :24.00 ## Mean :26.55 ## 3rd Qu.:35.00 ## Max. :55.00 ## NA's :715 ## wgt ## Min. : 4.00 ## 1st Qu.: 17.00 ## Median : 26.00 ## Mean : 32.54 ## 3rd Qu.: 43.00 ## Max. :259.00 ## NA's :675 ``` ```r #2. rodents %>% distinct(treatment) %>% nrow() ``` ``` ## [1] 3 ``` ```r #3 #3. rodents %>% count(species) ``` ``` ## # A tibble: 21 × 2 ## species n ## <chr> <int> ## 1 BA 211 ## 2 DM 7207 ## 3 DO 1597 ## 4 DS 569 ## 5 OL 521 ## 6 OT 1593 ## 7 PB 4095 ## 8 PE 853 ## 9 PF 1082 ## 10 PH 17 ## # … with 11 more rows ``` ```r #4095 ``` --- ## Arrange The `arrange()` function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the `desc()` function to arrange by descending. The following code would get the number of times each species shows up in the `rodents` dataset: ```r rodents %>% count(species) %>% arrange(n) ``` ``` ## # A tibble: 21 × 2 ## species n ## <chr> <int> ## 1 PH 17 ## 2 SO 23 ## 3 RF 30 ## 4 PL 35 ## 5 RO 40 ## 6 PI 55 ## 7 SF 194 ## 8 BA 211 ## 9 SH 215 ## 10 <NA> 399 ## # … with 11 more rows ``` --- ```r # Since the default is ascending order, we are not getting the results that are probably useful, so let's use the desc() function rodents %>% count(species) %>% arrange(desc(n)) ``` ``` ## # A tibble: 21 × 2 ## species n ## <chr> <int> ## 1 PP 7270 ## 2 DM 7207 ## 3 PB 4095 ## 4 RM 1812 ## 5 DO 1597 ## 6 OT 1593 ## 7 PF 1082 ## 8 PE 853 ## 9 DS 569 ## 10 PM 546 ## # … with 11 more rows ``` ```r # Alternatively you can use the "-" to denote descending rodents %>% count(species) %>% arrange(-n) ``` ``` ## # A tibble: 21 × 2 ## species n ## <chr> <int> ## 1 PP 7270 ## 2 DM 7207 ## 3 PB 4095 ## 4 RM 1812 ## 5 DO 1597 ## 6 OT 1593 ## 7 PF 1082 ## 8 PE 853 ## 9 DS 569 ## 10 PM 546 ## # … with 11 more rows ``` --- ## filter If you want to filter **rows** of the data where some condition is true, use the `filter()` function. 1. The first argument is the data frame you want to filter, e.g. `filter(mydata, ...`. 2. The second argument is a condition you must satisfy, e.g. `filter(clean, variable == "levelA")`. - `==`: Equal to - `!=`: Not equal to - `>`, `>=`: Greater than, greater than or equal to - `<`, `<=`: Less than, less than or equal to If you want to satisfy *all* of multiple conditions, you can use the "and" operator, `&`. The "or" operator `|` (the pipe character, usually shift-backslash) will return a subset that meet *any* of the conditions. ```r #Keeps those rodents that have a weight greater than 60 rodents %>% filter(wgt > 60) %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 DS control F 47 117 ## 2 1978-01-08 DS control M 49 132 ## 3 1978-01-08 DS control F 50 126 ## 4 1978-01-08 DS control F 51 122 ## 5 1978-01-08 DS control F 48 116 ``` ```r # Keeps those rodents that weight less than or # equal to 10 rodents %>% filter(wgt <= 10) %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 PF exclosure M 15 7 ## 2 1978-03-11 PF exclosure M 15 8 ## 3 1978-06-08 PF exclosure M 15 8 ## 4 1978-07-07 PF exclosure M 16 8 ## 5 1978-07-07 PP exclosure F 23 10 ``` --- ```r #Keeps those rodents that have a wgt greater than 60 who have been treated as control rodents %>% filter(wgt > 60 & treatment == "control") %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 DS control F 47 117 ## 2 1978-01-08 DS control M 49 132 ## 3 1978-01-08 DS control F 50 126 ## 4 1978-01-08 DS control F 51 122 ## 5 1978-01-08 DS control F 48 116 ``` ```r #Count the number of rodents fitting the conditions above rodents %>% filter(wgt > 60 & treatment == "control") %>% count() ``` ``` ## # A tibble: 1 × 1 ## n ## <int> ## 1 868 ``` ```r #Keeps Species that contain a capital P rodents %>% filter(str_detect(species, "P")) %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 PF exclosure M 15 7 ## 2 1978-02-18 PF control <NA> NA NA ## 3 1978-03-11 PF exclosure M 15 8 ## 4 1978-03-11 PF exclosure F NA NA ## 5 1978-03-11 PF control <NA> NA NA ``` --- ```r # The filter function is helpful for dropping missing values rodents %>% filter(!is.na(hfl)) %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 PF exclosure M 15 7 ## 2 1978-01-08 DM control M 36 40 ## 3 1978-01-08 DM control F 34 37 ## 4 1978-01-08 DS control F 47 117 ## 5 1978-01-08 DM control M 36 40 ``` ```r # drop_na works in a similar manner, but be careful rodents %>% drop_na(hfl) %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 PF exclosure M 15 7 ## 2 1978-01-08 DM control M 36 40 ## 3 1978-01-08 DM control F 34 37 ## 4 1978-01-08 DS control F 47 117 ## 5 1978-01-08 DM control M 36 40 ``` ```r # This would drop any observation that has a single missing value rodents %>% drop_na() %>% head(n = 5) ``` ``` ## # A tibble: 5 × 6 ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 PF exclosure M 15 7 ## 2 1978-01-08 DM control M 36 40 ## 3 1978-01-08 DM control F 34 37 ## 4 1978-01-08 DS control F 47 117 ## 5 1978-01-08 DM control M 36 40 ``` --- ```r # You can also save the results of a pipeline. # Here I want to get the top 3 biggest rodents in weight big_rodents <- rodents %>% drop_na(species) %>% arrange(-wgt) %>% head(3) big_rodents %>% View() ``` ### the %in% operator The %in% operator is a useful tool when you have a series of conditions that you want to check. Instead of asking whether cond1 or cond2 or cond3 .... are met, you can use the %in% operator to check them all at once. ```r specs <- c("SF", "DS", "PB") specs ``` ``` ## [1] "SF" "DS" "PB" ``` ```r #This code will look through the rodents$species column and will only #return the values that match any value in the specs vector. rodents %>% filter(species %in% specs) %>% count(species) ``` ``` ## # A tibble: 3 × 2 ## species n ## <chr> <int> ## 1 DS 569 ## 2 PB 4095 ## 3 SF 194 ``` --- ## select() Whereas the `filter()` function allows you to return only certain _rows_ matching a condition, the `select()` function returns only certain _columns_. The first argument is the data, and subsequent arguments are the columns you want. ```r # You can simply list the column names you want to keep rodents %>% select(species) ``` ``` ## # A tibble: 28,364 × 1 ## species ## <chr> ## 1 OL ## 2 <NA> ## 3 PF ## 4 OT ## 5 DM ## 6 DM ## 7 DS ## 8 DM ## 9 DS ## 10 OL ## # … with 28,354 more rows ``` ```r # Multiple columns can be separated by a comma rodents %>% select(censusdate, species) ``` ``` ## # A tibble: 28,364 × 2 ## censusdate species ## <date> <chr> ## 1 1978-01-08 OL ## 2 1978-01-08 <NA> ## 3 1978-01-08 PF ## 4 1978-01-08 OT ## 5 1978-01-08 DM ## 6 1978-01-08 DM ## 7 1978-01-08 DS ## 8 1978-01-08 DM ## 9 1978-01-08 DS ## 10 1978-01-08 OL ## # … with 28,354 more rows ``` --- ```r # Or you can use the - sign to tell dplyr the columns you want to drop rodents %>% select(-species, -treatment) %>% arrange(desc(censusdate)) ``` ``` ## # A tibble: 28,364 × 4 ## censusdate sex hfl wgt ## <date> <chr> <dbl> <dbl> ## 1 2022-02-26 M 36 46 ## 2 2022-02-26 M 36 44 ## 3 2022-02-26 M 37 51 ## 4 2022-02-26 M 21 16 ## 5 2022-02-26 M 21 19 ## 6 2022-02-26 F 38 44 ## 7 2022-02-26 F 21 18 ## 8 2022-02-26 F 21 14 ## 9 2022-02-26 F 35 28 ## 10 2022-02-26 F 21 14 ## # … with 28,354 more rows ``` ```r # You can also use the select function to rename variables rodents %>% select(species_name = species, date = censusdate, hfl, wgt) ``` ``` ## # A tibble: 28,364 × 4 ## species_name date hfl wgt ## <chr> <date> <dbl> <dbl> ## 1 OL 1978-01-08 NA NA ## 2 <NA> 1978-01-08 NA NA ## 3 PF 1978-01-08 15 7 ## 4 OT 1978-01-08 NA NA ## 5 DM 1978-01-08 36 40 ## 6 DM 1978-01-08 34 37 ## 7 DS 1978-01-08 47 117 ## 8 DM 1978-01-08 36 40 ## 9 DS 1978-01-08 49 132 ## 10 OL 1978-01-08 NA NA ## # … with 28,354 more rows ``` ```r # There is also a rename() function that does something similar ``` --- ### starts_with(), ends_with(), contains() The `starts_with`, `ends_with` and `contains` functions provide very useful tools for dropping/keeping several variables at once without having to list each and every column you want to keep. The function will return columns that either start with a specific string of text, ends with a certain string of text, or contain a certain string of text. ```r rodents %>% select(starts_with("s")) ``` ``` ## # A tibble: 28,364 × 2 ## species sex ## <chr> <chr> ## 1 OL <NA> ## 2 <NA> <NA> ## 3 PF M ## 4 OT <NA> ## 5 DM M ## 6 DM F ## 7 DS F ## 8 DM M ## 9 DS M ## 10 OL <NA> ## # … with 28,354 more rows ``` ```r rodents %>% select(-ends_with("e"), -ends_with("t")) ``` ``` ## # A tibble: 28,364 × 3 ## species sex hfl ## <chr> <chr> <dbl> ## 1 OL <NA> NA ## 2 <NA> <NA> NA ## 3 PF M 15 ## 4 OT <NA> NA ## 5 DM M 36 ## 6 DM F 34 ## 7 DS F 47 ## 8 DM M 36 ## 9 DS M 49 ## 10 OL <NA> NA ## # … with 28,354 more rows ``` ```r rodents %>% select(contains("m")) ``` ``` ## # A tibble: 28,364 × 1 ## treatment ## <chr> ## 1 exclosure ## 2 exclosure ## 3 exclosure ## 4 control ## 5 control ## 6 control ## 7 control ## 8 control ## 9 control ## 10 control ## # … with 28,354 more rows ``` --- ```r # You can also create a vector of column names you want to keep rodents %>% names() ``` ``` ## [1] "censusdate" "species" "treatment" "sex" ## [5] "hfl" "wgt" ``` ```r vars <- c( "species", "treatment", "hfl", "wgt") rodents %>% select(all_of(vars)) ``` ``` ## # A tibble: 28,364 × 4 ## species treatment hfl wgt ## <chr> <chr> <dbl> <dbl> ## 1 OL exclosure NA NA ## 2 <NA> exclosure NA NA ## 3 PF exclosure 15 7 ## 4 OT control NA NA ## 5 DM control 36 40 ## 6 DM control 34 37 ## 7 DS control 47 117 ## 8 DM control 36 40 ## 9 DS control 49 132 ## 10 OL control NA NA ## # … with 28,354 more rows ``` ```r rodents %>% select(-vars) ``` ``` ## # A tibble: 28,364 × 2 ## censusdate sex ## <date> <chr> ## 1 1978-01-08 <NA> ## 2 1978-01-08 <NA> ## 3 1978-01-08 M ## 4 1978-01-08 <NA> ## 5 1978-01-08 M ## 6 1978-01-08 F ## 7 1978-01-08 F ## 8 1978-01-08 M ## 9 1978-01-08 M ## 10 1978-01-08 <NA> ## # … with 28,354 more rows ``` --- However, you can go ahead and combine `filter` and `select` to subset your data even further. ```r rodents %>% filter(species %in% specs) %>% select(vars) ``` ``` ## # A tibble: 4,858 × 4 ## species treatment hfl wgt ## <chr> <chr> <dbl> <dbl> ## 1 DS control 47 117 ## 2 DS control 49 132 ## 3 DS control 50 126 ## 4 DS control 51 122 ## 5 DS control 48 116 ## 6 DS control 50 133 ## 7 DS control 53 136 ## 8 DS control 49 NA ## 9 DS control NA NA ## 10 DS control 49 124 ## # … with 4,848 more rows ``` ## Exercise 1. Using the `rodents` dataset, how many rodents have a hindfood length greater than 52. Answer should be 32. `hint: use filter and nrow()` 2. Use filter to find the male rodents with a weight greater than 85, followed by select to show the species, censusdate, and weight for these observations, arranged in descending order by weight 3. If I tried to arrange the answer from question 2 by sex type instead of weight, would that work? Why or why not? --- ```r #1. rodents %>% filter(hfl > 52) %>% nrow() ``` ``` ## [1] 32 ``` ```r #2. rodents %>% filter(wgt > 85 & sex == "M") %>% select(species, censusdate, wgt) %>% arrange(-wgt) %>% drop_na(species) ``` ``` ## # A tibble: 307 × 3 ## species censusdate wgt ## <chr> <date> <dbl> ## 1 DS 1981-11-22 170 ## 2 DS 1980-03-21 160 ## 3 DS 1979-10-24 156 ## 4 DS 1981-12-30 155 ## 5 DS 1981-09-29 153 ## 6 DS 1982-02-22 150 ## 7 DS 1980-08-13 149 ## 8 DS 1980-10-11 149 ## 9 DS 1980-11-08 149 ## 10 DS 1981-01-11 148 ## # … with 297 more rows ``` ```r ### May also want to only see those with an actual species value #3. #It Would not work because sex is no longer a part of the dataset at this point ``` --- ## summarize() The `summarize()` function summarizes multiple values to a single value. On its own the `summarize()` function doesn't seem to be all that useful. The dplyr package provides a few convenience functions called `n()` and `n_distinct()` that tell you the number of observations or the number of distinct values of a particular variable. Notice that summarize takes a data frame and returns a data frame. In this case it's a 1x1 data frame with a single row and a single column. The name of the column, by default is whatever the expression was used to summarize the data. This usually isn't pretty, and if we wanted to work with this resulting data frame later on, we'd want to name that returned value something easier to deal with. ```r rodents %>% summarize(mean(hfl)) ``` ``` ## # A tibble: 1 × 1 ## `mean(hfl)` ## <dbl> ## 1 NA ``` ```r #Gotta watch out for nas. Use na.rm = TRUE to run the calculation without nas being involved. rodents %>% summarize(mean(hfl, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## `mean(hfl, na.rm = TRUE)` ## <dbl> ## 1 26.5 ``` --- ```r #The default column names do not look very aesthetically pleasing, let's take care of that. rodents %>% summarize(hfl_mean = mean(hfl, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## hfl_mean ## <dbl> ## 1 26.5 ``` ```r #Again, summarize alone is not entirely useful rodents %>% summarize(n()) ``` ``` ## # A tibble: 1 × 1 ## `n()` ## <int> ## 1 28364 ``` ## group_by() We saw that `summarize()` isn't that useful on its own. Neither is `group_by()` All this does is takes an existing data frame and coverts it into a grouped data frame where operations are performed by group. ```r rodents %>% group_by(species) %>% head() ``` ``` ## # A tibble: 6 × 6 ## # Groups: species [5] ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 OL exclosure <NA> NA NA ## 2 1978-01-08 <NA> exclosure <NA> NA NA ## 3 1978-01-08 PF exclosure M 15 7 ## 4 1978-01-08 OT control <NA> NA NA ## 5 1978-01-08 DM control M 36 40 ## 6 1978-01-08 DM control F 34 37 ``` ```r rodents %>% group_by(species, sex) %>% head() ``` ``` ## # A tibble: 6 × 6 ## # Groups: species, sex [6] ## censusdate species treatment sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 OL exclosure <NA> NA NA ## 2 1978-01-08 <NA> exclosure <NA> NA NA ## 3 1978-01-08 PF exclosure M 15 7 ## 4 1978-01-08 OT control <NA> NA NA ## 5 1978-01-08 DM control M 36 40 ## 6 1978-01-08 DM control F 34 37 ``` --- The real power comes in where `group_by()` and `summarize()` are used together. First, write the `group_by()` statement. Then pipe the result to a call to `summarize()`. ```r #Let's summarize the mean hindfood length of the different species rodents %>% group_by(species) %>% summarize(hfl = mean(hfl, na.rm = TRUE)) ``` ``` ## # A tibble: 21 × 2 ## species hfl ## <chr> <dbl> ## 1 BA 13.1 ## 2 DM 35.9 ## 3 DO 35.5 ## 4 DS 49.7 ## 5 OL 20.3 ## 6 OT 20.0 ## 7 PB 26.0 ## 8 PE 19.9 ## 9 PF 15.4 ## 10 PH 24.9 ## # … with 11 more rows ``` ```r rodents %>% group_by(species) %>% summarize(hfl = mean(hfl, na.rm = TRUE)) %>% arrange(-hfl) ``` ``` ## # A tibble: 21 × 2 ## species hfl ## <chr> <dbl> ## 1 DS 49.7 ## 2 DM 35.9 ## 3 DO 35.5 ## 4 <NA> 32.2 ## 5 SH 28.7 ## 6 PB 26.0 ## 7 SO 25.9 ## 8 SF 25.8 ## 9 PH 24.9 ## 10 PI 21.8 ## # … with 11 more rows ``` --- ```r # Filter/select are often used in conjunction with summarize rodents %>% drop_na(species) %>% filter(sex == "F") %>% group_by(species, sex) %>% summarize(hfl = mean(hfl, na.rm = TRUE)) %>% arrange(-hfl) ``` ``` ## # A tibble: 20 × 3 ## # Groups: species [20] ## species sex hfl ## <chr> <chr> <dbl> ## 1 DS F 49.2 ## 2 DM F 35.6 ## 3 DO F 35.3 ## 4 SH F 28.8 ## 5 SO F 25.9 ## 6 PB F 25.8 ## 7 SF F 25.7 ## 8 PH F 24.6 ## 9 PI F 21.6 ## 10 PP F 21.3 ## 11 PM F 20.2 ## 12 OL F 20.1 ## 13 OT F 20.0 ## 14 PE F 20.0 ## 15 PL F 19.3 ## 16 RF F 17.6 ## 17 RM F 16.3 ## 18 RO F 15.5 ## 19 PF F 15.4 ## 20 BA F 13.2 ``` --- ## MUTATE ## The `mutate` function allows you to add or modify variables to your dataset. ```r # Let us calculate the hindfoot length to weight ratio rodents %>% mutate(hfl_wgt = (hfl/ wgt)) %>% select(hfl, wgt, hfl_wgt) %>% head() ``` ``` ## # A tibble: 6 × 3 ## hfl wgt hfl_wgt ## <dbl> <dbl> <dbl> ## 1 NA NA NA ## 2 NA NA NA ## 3 15 7 2.14 ## 4 NA NA NA ## 5 36 40 0.9 ## 6 34 37 0.919 ``` ```r # This new variable is only temporary, let's save a new object with this column rodents_2 <- rodents %>% mutate(hfl_wgt = (hfl/ wgt)) %>% head() ``` --- ```r # We can now look at our new dataset with our new variable. rodents_2 %>% group_by(species) %>% drop_na(species) %>% summarize(hfl_wgt = mean(hfl_wgt, na.rm = TRUE)) %>% arrange(-hfl_wgt) %>% select(species, hfl_wgt) ``` ``` ## # A tibble: 4 × 2 ## species hfl_wgt ## <chr> <dbl> ## 1 PF 2.14 ## 2 DM 0.909 ## 3 OL NaN ## 4 OT NaN ``` ```r # mutate with if_else #if_else statements allow you to conditionally make changes to your data. #if the condition is true, then the first option will be selected, if not, the #second option will be selected. rodents_2 %>% mutate(control = ifelse(treatment == "control", 1, 0)) %>% select(treatment, control) %>% head() ``` ``` ## # A tibble: 6 × 2 ## treatment control ## <chr> <dbl> ## 1 exclosure 0 ## 2 exclosure 0 ## 3 exclosure 0 ## 4 control 1 ## 5 control 1 ## 6 control 1 ``` --- ```r ##case_when # the case_when function allows you to complete multiple if_else statements in a single group of commands rodents_2 <- rodents_2 %>% mutate(hfl_cat = case_when( hfl < 17 ~ "Very Small", hfl >= 17 & hfl < 22 ~ "Small", hfl >= 22 & hfl < 32 ~ "Medium", hfl >= 32 ~ "Large" ) ) %>% head() rodents_2 %>% drop_na(hfl_cat) %>% count(hfl_cat) ``` ``` ## # A tibble: 2 × 2 ## hfl_cat n ## <chr> <int> ## 1 Large 2 ## 2 Very Small 1 ``` --- ## Writing out data Just like you can read in data from read_csv(), you can also write out a csv file. Let's create a new file for our `rodents_2`` data. ```r #Include the name of the object to export (rodents_2) and the name of the file write_csv(rodents_2, "rodents_2.csv") ```