Data Wrangling in R

class: left, middle, inverse, title-slide

.title[
# Data Wrangling in R
]

---

Data analysis involves a large amount of [janitorwork](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) -- munging and cleaning data to facilitate downstream data analysis. In fact, data scientists say that around [80%](https://www.infoworld.com/article/3228245/data-science/the-80-20-data-science-dilemma.html) of their time is taken up by data cleaning tasks compared to just 20% for the actual analyses.

Our goal will be to produce "tidy data" that we can then use to derive some insights. [Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) is defined as:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

To transform the data into tidy data will take different steps depending on the nature of the untidyness. Hadley Wickham, who works for RStudio, says "tidy datasets are all alike but every messy dataset is messy in its own way."

**Recommended Resources** 
1. The [**_R for Data Science_ book**](http://r4ds.had.co.nz/tibbles.html) is a fabulous resource for learning to do data science in R.

2. There are cheatsheets available on the [RStudio website](https://www.rstudio.com/resources/cheatsheets/) for **tidyr**, **dplyr**, and **stringr**, among others. They are excellent quick reference guides for what we will learn today.
---

### Set up data

We need to load the readr, dplyr, tidyr, and stringr packages. All of these packages are contained in the tidyverse megapackage. Let's load those packages now - hopefully you have already installed them.

```r
#install.packages("tidyverse")
# Load packages
library(tidyverse)
```

The purpose of our class today is to figure out the relationship between rodent weights in the Arizona Desert and temperature/precipitation over time. The problem is that we do not have a single dataset with all of the variables that we will need. To get this dataset, we need to combine multiple disparate datasets.

```r
## desert_rodents contains summed up information for various 
## rodent species. 
species <- read_csv("desert_rodents.csv") %>% 
  select(species:granivore) %>% 
  mutate(species = if_else(is.na(species), "NA", species))

## The surveys data contains individual rodent information
## We will be getting this data from the tidy tuesday github
## site directly
surveys <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-02/surveys.csv') %>% 
  select(censusdate, treatment, species, sex, hfl, wgt) %>% 
  mutate(species = if_else(is.na(species), "NA", species))

## This weather data is taken from the Portal project and 
## grabs mean temperatures and precipitation over time.
weather <- read_csv("weather_data.csv")
```
---

```r
# Display the data
glimpse(species)
```

```
## Rows: 21
## Columns: 4
## $ species        <chr> "BA", "PB", "PH", "PI", "PP", "DM", "DO", "DS", "NA", "…
## $ scientificname <chr> "Baiomys taylori", "Chaetodipus baileyi", "Chaetodipus …
## $ commonname     <chr> "Northern pygmy mouse", "Bailey's pocket mouse", "Hispi…
## $ granivore      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0…
```

```r
glimpse(surveys)
```

```
## Rows: 28,364
## Columns: 6
## $ censusdate <date> 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08…
## $ treatment  <chr> "exclosure", "exclosure", "exclosure", "control", "control"…
## $ species    <chr> "OL", "NA", "PF", "OT", "DM", "DM", "DS", "DM", "DS", "OL",…
## $ sex        <chr> NA, NA, "M", NA, "M", "F", "F", "M", "M", NA, NA, "M", "F",…
## $ hfl        <dbl> NA, NA, 15, NA, 36, 34, 47, 36, 49, NA, NA, 38, 36, 50, 51,…
## $ wgt        <dbl> NA, NA, 7, NA, 40, 37, 117, 40, 132, NA, NA, 44, 40, 126, 1…
```

```r
glimpse(weather)
```

```
## Rows: 44
## Columns: 25
## $ year              <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988…
## $ meantemp_Jan      <dbl> 5.512903, 4.335484, 4.545161, 2.712903, 4.167742, 1.…
## $ meantemp_Feb      <dbl> 7.189655, 8.528571, 7.400000, 7.488889, 7.582759, 5.…
## $ meantemp_Mar      <dbl> 10.267742, 9.835484, 12.400000, 10.409677, 13.693548…
## $ meantemp_Apr      <dbl> 15.91667, 18.54333, 18.62000, 13.27333, 17.13000, 17…
## $ meantemp_May      <dbl> 21.78710, 22.47667, 22.27419, 21.66774, 25.83871, 22…
## $ meantemp_Jun      <dbl> 29.87333, 28.63333, 28.18333, 27.11333, 24.44333, 27…
## $ meantemp_Jul      <dbl> 25.52581, 23.97742, 25.81290, 25.49355, 23.81290, 25…
## $ meantemp_Aug      <dbl> 23.58710, 23.89355, 23.75806, 24.00968, 22.45484, 24…
## $ meantemp_Sep      <dbl> 19.50000, 19.29333, 21.26333, 21.23448, 20.24667, 18…
## $ meantemp_Oct      <dbl> 11.86774, 13.51290, 14.61935, 13.85806, 11.64516, 13…
## $ meantemp_Nov      <dbl> 5.400000, 7.443333, 6.993333, 7.296667, 6.200000, 7.…
## $ meantemp_Dec      <dbl> 3.6096774, 4.7806452, 2.0290323, 5.4838710, 4.412903…
## $ precipitation_Jan <dbl> 15.240, 30.226, 25.654, 6.364, 25.700, 22.104, 2.032…
## $ precipitation_Feb <dbl> 30.226, 10.668, 10.668, 24.892, 0.000, 35.814, 24.89…
## $ precipitation_Mar <dbl> 5.080, 34.036, 21.590, 32.766, 0.000, 9.398, 33.288,…
## $ precipitation_Apr <dbl> 0.508, 10.414, 4.572, 9.398, 3.556, 26.162, 0.000, 1…
## $ precipitation_May <dbl> 0.508, 3.048, 10.668, 0.254, 2.540, 0.000, 5.842, 9.…
## $ precipitation_Jun <dbl> 10.922, 77.724, 4.318, 0.000, 21.336, 38.608, 13.462…
## $ precipitation_Jul <dbl> 60.706, 81.280, 33.020, 88.900, 39.878, 93.980, 68.3…
## $ precipitation_Aug <dbl> 54.102, 15.494, 72.644, 4.572, 78.486, 81.280, 62.99…
## $ precipitation_Sep <dbl> 58.674, 56.134, 66.802, 128.868, 23.368, 71.120, 40.…
## $ precipitation_Oct <dbl> 1.778, 25.146, 0.000, 89.408, 55.880, 106.426, 62.73…
## $ precipitation_Nov <dbl> 14.478, 1.016, 67.056, 45.720, 18.796, 12.192, 6.096…
## $ precipitation_Dec <dbl> 2.286, 33.528, 140.678, 27.432, 78.740, 7.620, 11.17…
```
---
## Separate & Unite

```r
surveys %>% head(6)
```

```
## # A tibble: 6 × 6
##   censusdate treatment species sex     hfl   wgt
##   <date>     <chr>     <chr>   <chr> <dbl> <dbl>
## 1 1978-01-08 exclosure OL      <NA>     NA    NA
## 2 1978-01-08 exclosure NA      <NA>     NA    NA
## 3 1978-01-08 exclosure PF      M        15     7
## 4 1978-01-08 control   OT      <NA>     NA    NA
## 5 1978-01-08 control   DM      M        36    40
## 6 1978-01-08 control   DM      F        34    37
```

In looking at the output of the `surveys` dataset, we have some work to do to tidy the data for analysis.

One of the first problems I see is that there are multiple pieces of data encoded into the column called `censusdate`. We can work with date data using the `lubridate` package, but we are going to try it out a different way. To do this, we luckily have data separated neatly by `-`. Let's use the `separate` function from the tidyr package to create new columns from the original `censusdate` column.
---
# Separate

The `separate` function is used to create multiple columns from a single column. For the `separate` function you simply need the dataset, the column you want to separate into multiple columns, the name of the new columns, and the means by which the column is separated.

```r
#?separate
#censusdate has 3 pieces of information 
#(year-month-day)

surveys %>% 
  separate(col = censusdate, into = c("year", "month", "day"), sep = "-") %>% head(6)
```

```
## # A tibble: 6 × 8
##   year  month day   treatment species sex     hfl   wgt
##   <chr> <chr> <chr> <chr>     <chr>   <chr> <dbl> <dbl>
## 1 1978  01    08    exclosure OL      <NA>     NA    NA
## 2 1978  01    08    exclosure NA      <NA>     NA    NA
## 3 1978  01    08    exclosure PF      M        15     7
## 4 1978  01    08    control   OT      <NA>     NA    NA
## 5 1978  01    08    control   DM      M        36    40
## 6 1978  01    08    control   DM      F        34    37
```

---

Notice that our original variable `censusdate` no longer appears. This change may be what we want, but so far it is only in the console. The original dataframe is still unchanged.

```r
surveys %>% head(6)
```

Let's keep the change created by the `separate()` function by saving the new dataset and also by keeping the original variable

```r
## We are also going to use the remove = FALSE option to keep the 
## censusdate column
surveys_edit <- surveys %>% 
  separate(col = censusdate, into = c("year", "month", "day"), sep = "-", remove = FALSE)

#Make sure to look at the variable types after a separate, 
#they can often show everything as character/string/factor, 
#when you would expect it to be numeric, ie: month, day, year here.
surveys_edit %>% head(6)
```

```
## # A tibble: 6 × 9
##   censusdate year  month day   treatment species sex     hfl   wgt
##   <date>     <chr> <chr> <chr> <chr>     <chr>   <chr> <dbl> <dbl>
## 1 1978-01-08 1978  01    08    exclosure OL      <NA>     NA    NA
## 2 1978-01-08 1978  01    08    exclosure NA      <NA>     NA    NA
## 3 1978-01-08 1978  01    08    exclosure PF      M        15     7
## 4 1978-01-08 1978  01    08    control   OT      <NA>     NA    NA
## 5 1978-01-08 1978  01    08    control   DM      M        36    40
## 6 1978-01-08 1978  01    08    control   DM      F        34    37
```

---

If the separator was not as neat as this, you can input any [regular expression](https://en.wikipedia.org/wiki/Regular_expression) into the separator argument.

## Resources for Regular Expressions

For a nice cheatsheet for writing regular expressions in R, see a [Regex cheatsheet](http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf). Jenny Bryan has created a nice website tutorial for learning to use [Regular Expressions in R](http://stat545.com/block022_regular-expression.html).

```r
new_df <- tibble(a = c("Education#SEP#52", "Business#SEP#43"))
new_df
```

```
## # A tibble: 2 × 1
##   a               
##   <chr>           
## 1 Education#SEP#52
## 2 Business#SEP#43
```

```r
separate(data = new_df, col = a, into = c("Field", "Age"), sep = "#SEP#")
```

```
## # A tibble: 2 × 2
##   Field     Age  
##   <chr>     <chr>
## 1 Education 52   
## 2 Business  43
```

```r
new_df <- tibble(a = c("Education#SEP#52", "Business#IFJ#43"))

separate(data = new_df, col = a, into = c("Field", "Age"), sep = "#[A-Z]{3}#")
```

```
## # A tibble: 2 × 2
##   Field     Age  
##   <chr>     <chr>
## 1 Education 52   
## 2 Business  43
```
---

# Unite

Unite is essentially the opposite of separate. It acts much like paste would work, where it combines two (or more) separate columns into one while separating with a specific value

```r
# Let's pretend we want to unite the date columns back together in 
# a different format
surveys_edit %>% 
  unite(col = date, c("month", "day", "year"), sep = "/", remove = FALSE) %>% 
  select(date, month, day, year)
```

```
## # A tibble: 28,364 × 4
##    date       month day   year 
##    <chr>      <chr> <chr> <chr>
##  1 01/08/1978 01    08    1978 
##  2 01/08/1978 01    08    1978 
##  3 01/08/1978 01    08    1978 
##  4 01/08/1978 01    08    1978 
##  5 01/08/1978 01    08    1978 
##  6 01/08/1978 01    08    1978 
##  7 01/08/1978 01    08    1978 
##  8 01/08/1978 01    08    1978 
##  9 01/08/1978 01    08    1978 
## 10 01/08/1978 01    08    1978 
## # … with 28,354 more rows
```

---

### Exercise ###

1. Use Separate to separate the a column in `new_df` into separate columns.

2. Reunite the columns and use a separator of your choice.

```r
new_df <- tibble(a = c("112 Prince Henry Street, Charlottesville, VA", "5123 Miller Street, Richmond, VA"))
new_df
```

```
## # A tibble: 2 × 1
##   a                                           
##   <chr>                                       
## 1 112 Prince Henry Street, Charlottesville, VA
## 2 5123 Miller Street, Richmond, VA
```

```r
#1. 
n <- new_df %>% separate(col = a, into = c("street", "city", "state"), sep = ", ")
n
```

```
## # A tibble: 2 × 3
##   street                  city            state
##   <chr>                   <chr>           <chr>
## 1 112 Prince Henry Street Charlottesville VA   
## 2 5123 Miller Street      Richmond        VA
```

```r
#2. 
unite(n, new_var, c("street", "city", "state"), sep = "--##--")
```

```
## # A tibble: 2 × 1
##   new_var                                             
##   <chr>                                               
## 1 112 Prince Henry Street--##--Charlottesville--##--VA
## 2 5123 Miller Street--##--Richmond--##--VA
```
---
# Reshaping - pivot_wider & pivot_longer

The next problem we will tackle is reshaping the dataframe. Notice that we have several columns that seem to be representing the month and type of weather reading at a year level. At times, we want the dataset to be set up like this, but to be able to combine it to our other datasets, we will need to reshape.

```r
names(weather)
```

```
##  [1] "year"              "meantemp_Jan"      "meantemp_Feb"     
##  [4] "meantemp_Mar"      "meantemp_Apr"      "meantemp_May"     
##  [7] "meantemp_Jun"      "meantemp_Jul"      "meantemp_Aug"     
## [10] "meantemp_Sep"      "meantemp_Oct"      "meantemp_Nov"     
## [13] "meantemp_Dec"      "precipitation_Jan" "precipitation_Feb"
## [16] "precipitation_Mar" "precipitation_Apr" "precipitation_May"
## [19] "precipitation_Jun" "precipitation_Jul" "precipitation_Aug"
## [22] "precipitation_Sep" "precipitation_Oct" "precipitation_Nov"
## [25] "precipitation_Dec"
```

---

The columns here represent a month and reading value. This may be useful at times, but to match up to our previous datasets, we want our data to be at the year/month level instead of the year level. Our `weather` dataset is currently in a wide format, where we have more variables and fewer observations. We want to turn this dataset into more of a long or longitudinal dataset, where you have many observations and few columns/variables. This set up is nice for looking at time series or repeated measures.

We will use the `pivot_longer` function changes a dataset from `wide` to `long`.
Conversely the `pivot_wider` function to change our dataset from `long` to `wide`.

`pivot_longer()` is the inverse of `pivot_wider`. With `pivot_longer()` you tell R the columns that you want to gather and then the name for the new column you are creating. Notice here the `:` operator which basically means "through". The `weather` dataset is an example of a wide dataset. So here we are grabbing all variables from `meantemp_Jan` through `precipitation_Dec`.

```r
#?pivot_longer
weather_long <- weather %>% 
  pivot_longer(cols = meantemp_Jan:precipitation_Dec)  
```
---

After the dataframe, `pivot_wider()` needs 2 other arguments. The first one is where you are getting your new variable names from. Since you will be creating several variables, R needs a way to tell how to name them. The second argument, below tells us where to get the values for the new variables.

```r
weather_final <- weather_long %>% 
  separate(name, into = c("reading_type", "month"), sep = "_") %>% 
  pivot_wider(names_from = "reading_type", values_from = "value")

weather_long %>% 
  separate(name, into = c("reading_type", "month"), sep = "_") %>% 
  pivot_wider(names_from = "reading_type", values_from = "value", names_prefix = "test_") %>% head(6)
```

```
## # A tibble: 6 × 4
##    year month test_meantemp test_precipitation
##   <dbl> <chr>         <dbl>              <dbl>
## 1  1980 Jan            5.51             15.2  
## 2  1980 Feb            7.19             30.2  
## 3  1980 Mar           10.3               5.08 
## 4  1980 Apr           15.9               0.508
## 5  1980 May           21.8               0.508
## 6  1980 Jun           29.9              10.9
```

```r
weather_long %>% 
  separate(name, into = c("reading_type", "month"), sep = "_") %>% 
  pivot_wider(names_from = "reading_type", values_from = "value", names_glue= "pre_{reading_type}_post") %>% head(6)
```

```
## # A tibble: 6 × 4
##    year month pre_meantemp_post pre_precipitation_post
##   <dbl> <chr>             <dbl>                  <dbl>
## 1  1980 Jan                5.51                 15.2  
## 2  1980 Feb                7.19                 30.2  
## 3  1980 Mar               10.3                   5.08 
## 4  1980 Apr               15.9                   0.508
## 5  1980 May               21.8                   0.508
## 6  1980 Jun               29.9                  10.9
```

--- 
### Exercise ###

Using `new_df` below, create a new dataset that only has one observation per patient and contains both Systolic and Diastolic readings.

```r
# Run these lines to create new_df
a <- c(1, 1, 2, 2)
b <- c("Sys", "Dia", "Sys", "Dia")
c <- c(125, 75, 134, 85)

new_df <- tibble(patient = a, BP = b,reading = c)
new_df
```

```
## # A tibble: 4 × 3
##   patient BP    reading
##     <dbl> <chr>   <dbl>
## 1       1 Sys       125
## 2       1 Dia        75
## 3       2 Sys       134
## 4       2 Dia        85
```

```r
new_df %>% pivot_wider(names_from = BP, names_glue = "BP_{BP}", values_from = reading)
```

```
## # A tibble: 2 × 3
##   patient BP_Sys BP_Dia
##     <dbl>  <dbl>  <dbl>
## 1       1    125     75
## 2       2    134     85
```

```r
new_df %>% pivot_wider(names_from = BP, names_prefix = "BP_", values_from = reading)
```

```
## # A tibble: 2 × 3
##   patient BP_Sys BP_Dia
##     <dbl>  <dbl>  <dbl>
## 1       1    125     75
## 2       2    134     85
```

---

### Merge Datasets Together

I want to do a bit of prework to create dummy datasets that can be used to test out our join/merge functions.

```r
x <- surveys_edit %>%
  select(species, hfl, wgt) %>% 
  head(10) %>% 
  arrange(species) %>% 
  mutate(data = "x")

y <- species %>% 
  select(species, commonname) %>% 
  head(10) %>%  
  arrange(species) %>% 
  mutate(data = "y")

x
```

```
## # A tibble: 10 × 4
##    species   hfl   wgt data 
##    <chr>   <dbl> <dbl> <chr>
##  1 DM         36    40 x    
##  2 DM         34    37 x    
##  3 DM         36    40 x    
##  4 DS         47   117 x    
##  5 DS         49   132 x    
##  6 NA         NA    NA x    
##  7 OL         NA    NA x    
##  8 OL         NA    NA x    
##  9 OT         NA    NA x    
## 10 PF         15     7 x
```

```r
y
```

```
## # A tibble: 10 × 3
##    species commonname                 data 
##    <chr>   <chr>                      <chr>
##  1 BA      Northern pygmy mouse       y    
##  2 DM      Merriam's kangaroo rat     y    
##  3 DO      Ord's kangaroo rat         y    
##  4 DS      Banner-tailed kangaroo rat y    
##  5 NA      White-throated woodrat     y    
##  6 OL      Northern Grasshopper Mouse y    
##  7 PB      Bailey's pocket mouse      y    
##  8 PH      Hispid pocket mouse        y    
##  9 PI      Rock pocket mouse          y    
## 10 PP      Desert pocket mouse        y
```

---

There are several types of joins (merges) that we can use. The one you use depends upon your specific needs.

inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

```r
inner_join(x, y, by = c("species")) %>% head()
```

```
## # A tibble: 6 × 6
##   species   hfl   wgt data.x commonname                 data.y
##   <chr>   <dbl> <dbl> <chr>  <chr>                      <chr> 
## 1 DM         36    40 x      Merriam's kangaroo rat     y     
## 2 DM         34    37 x      Merriam's kangaroo rat     y     
## 3 DM         36    40 x      Merriam's kangaroo rat     y     
## 4 DS         47   117 x      Banner-tailed kangaroo rat y     
## 5 DS         49   132 x      Banner-tailed kangaroo rat y     
## 6 NA         NA    NA x      White-throated woodrat     y
```

left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

```r
left_join(x, y, by = c("species")) %>% head()
```

---

right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

```r
right_join(x, y, by = c("species")) %>% head()
```

full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

```r
full_join(x, y, by = c("species")) %>% head()
```

---

So now, let us try and merge together our datasets.

Let's try an inner join

```r
# Notice that surveys_edit has a unit of observation at the sighting level (each
# individual sighting gets a row).

# Let's try combining the two rodent datasets
inner_join(species, surveys_edit, by = c("species"))  %>% head()
```

```
## # A tibble: 6 × 12
##   species scientifi…¹ commo…² grani…³ censusdate year  month day   treat…⁴ sex  
##   <chr>   <chr>       <chr>     <dbl> <date>     <chr> <chr> <chr> <chr>   <chr>
## 1 BA      Baiomys ta… Northe…       1 1989-11-04 1989  11    04    exclos… M    
## 2 BA      Baiomys ta… Northe…       1 1989-11-04 1989  11    04    exclos… M    
## 3 BA      Baiomys ta… Northe…       1 1989-12-04 1989  12    04    exclos… M    
## 4 BA      Baiomys ta… Northe…       1 1990-04-24 1990  04    24    exclos… M    
## 5 BA      Baiomys ta… Northe…       1 1990-05-23 1990  05    23    exclos… F    
## 6 BA      Baiomys ta… Northe…       1 1990-06-21 1990  06    21    exclos… F    
## # … with 2 more variables: hfl <dbl>, wgt <dbl>, and abbreviated variable names
## #   ¹scientificname, ²commonname, ³granivore, ⁴treatment
```

```r
# Let's save this
species_final <- inner_join(species, surveys_edit, by = c("species"))
```

Combining weather to rodent data.

Let's first convert the month to an abbreviation and year to numeric in the `species_final` dataset and then join it to our weather_final data by year and month.

```r
weather_rodents <- species_final %>% mutate(month = month.abb[as.numeric(month)], 
                         year = as.numeric(year)) %>% 
  inner_join(weather_final, by = c("year", "month"))
```

---

## Other ways to combine variables and cases

The `join` functions have some intelligence to their actions. However, you can use a couple of `bind` functions to simply add variables or cases to existing datasets.

Use `bind_cols()` to paste tables beside each other as they are. 
Use `bind_rows()` to paste tables below each other as they are.

```r
a <- weather_rodents[1:3,]
b <- weather_rodents[800:802,]

bind_rows(a, b)  %>% head()
```

```
## # A tibble: 6 × 14
##   species scientifi…¹ commo…² grani…³ censusdate  year month day   treat…⁴ sex  
##   <chr>   <chr>       <chr>     <dbl> <date>     <dbl> <chr> <chr> <chr>   <chr>
## 1 BA      Baiomys ta… Northe…       1 1989-11-04  1989 Nov   04    exclos… M    
## 2 BA      Baiomys ta… Northe…       1 1989-11-04  1989 Nov   04    exclos… M    
## 3 BA      Baiomys ta… Northe…       1 1989-12-04  1989 Dec   04    exclos… M    
## 4 PB      Chaetodipu… Bailey…       1 2000-07-22  2000 Jul   22    control M    
## 5 PB      Chaetodipu… Bailey…       1 2000-07-22  2000 Jul   22    control F    
## 6 PB      Chaetodipu… Bailey…       1 2000-07-22  2000 Jul   22    control M    
## # … with 4 more variables: hfl <dbl>, wgt <dbl>, meantemp <dbl>,
## #   precipitation <dbl>, and abbreviated variable names ¹scientificname,
## #   ²commonname, ³granivore, ⁴treatment
```

```r
c <- weather_rodents[, 2]
d <- weather_rodents[, 8]

bind_cols(c, d) %>% head()
```

```
## # A tibble: 6 × 2
##   scientificname  day  
##   <chr>           <chr>
## 1 Baiomys taylori 04   
## 2 Baiomys taylori 04   
## 3 Baiomys taylori 04   
## 4 Baiomys taylori 24   
## 5 Baiomys taylori 23   
## 6 Baiomys taylori 21
```

---

## Visualizing our new dataset

Let's explore our new `weather_rodents` dataset.

Is the weight of rodents and temperature in the desert related?

Let's check at a specific species of rodent.

```r
weather_rodents %>% 
  filter(species == "DO") %>% 
  ggplot() +
  geom_point(aes(meantemp, wgt))
```

```r
# Unsure at this point. May need to work through this in 
# another way.
```
---

Before moving forward I want to aggregate our data to be per year and species. The `summarize` and `mutate` function have a series of helper functions that allow you to make changes to specific variable types or variable groups. Below, we are going to use the `across` function that will allow us to walk across the columns that are numeric and find the mean for each of these columns.

```r
final_data <- weather_rodents %>% 
  group_by(species, year) %>% 
  summarize(across(
    .cols = where(is.numeric),
    .fns = list(mean), na.rm = TRUE)
  ) %>% 
  arrange(species, year)
```

Here I am renaming a bunch of column names that were renamed above by using the `str_replace` function.

```r
names(final_data) <- str_replace(names(final_data), "_1", "")
```
---
Let us try and see if we can better visualize the data we have.

```r
final_data %>% 
  ggplot() +
  geom_point(aes(wgt, meantemp))
```

Seems like no real relationship here.
---

Let's visually look at this data using a new geom type, `geom_line`. The line graph works well on longitudinal/panel data.

```r
final_data %>% 
  filter(species == "DO") %>% 
  ggplot(aes(year, wgt)) +
  geom_point() + geom_line() 
```

<img src="datawrangling_slides_files/figure-html/unnamed-chunk-28-1.svg" width="50%" height="50%" style="display: block; margin: auto;" />
---
This works well, but I would like to be able to look across different species and compare. To be able to label each species, I want to use the `ggrepel` package which allows you to label portions of your graph.

```r
#install.packages("ggrepel")
library(ggrepel)

specs <- c("DO", "OT", "PE")
final_data %>% 
  filter(year > 2000 & species %in% specs) %>% 
  group_by(year, species) %>% 
  mutate(species_label = if_else(year == 2022, paste0(species), "")) %>%
  ggplot(aes(year, wgt, color= species)) +
  geom_point()  + geom_line() + geom_text_repel(aes(label = species_label)) + theme_classic() + 
  theme(legend.position = "none") 
```

<img src="datawrangling_slides_files/figure-html/unnamed-chunk-29-1.svg" width="50%" height="50%" style="display: block; margin: auto;" />
---
Finally, I want to show the relationship between rodent weight and mean temperature. To do this, I am adding a secondary axis to be able to put both of these variables on the plot at the same time.

```r
final_data  %>% 
    filter(species %in% specs) %>% 
  group_by(year, species) %>% 
  ggplot(aes(x=year, group = species)) +
  geom_line(aes(y=wgt, color = species), linewidth = 1) + 
  geom_line(aes(y=meantemp, group = 1), color = "black", linewidth = 1) +
  scale_y_continuous(
    # Features of the first axis
    name = "Weight of the rodent (g)",
    # Add a second axis and specify its features
    sec.axis = sec_axis(~., name="Temperature (C)")
  ) + theme_classic() + 
  theme(axis.line.y.right = element_line(color = "black"), 
        axis.ticks.y.right = element_line(color = "black"),
        axis.text.y.right = element_text(color = "black"),
        axis.title.y.right = element_text(color = "black"),
        axis.line.y.left = element_line(color = "blue"), 
        axis.ticks.y.left = element_line(color = "blue"),
        axis.text.y.left = element_text(color = "blue"),
        axis.title.y.left = element_text(color = "blue"))
```

Unclear conclusions here unfortunately.
---
# Extra Information to help clean data

The final piece I want to look at are similar to the `across` function and they allow you to make changes to many different variables at one time instead of having to make changes to each individual column. The `_at` series of functions allows you to specify the names or location of columns that you want to make changes to. The `_if` series of functions allows you to make changes to columns based off of a specific criteria (numeric, character, etc.). The `_all` series of functions will make changes across all of the variables in your datasets. These functions can be used in conjuction with our earlier dplyr functions like `mutate`, `select`, and `summarize` to make sweeping and efficient changes to your dataset.

```r
vars <- c("hfl", "wgt", "meantemp", "precipitation")
final_data %>% 
  group_by(year, granivore) %>% 
  mutate_at(vars, c(mean, sd), na.rm = TRUE) %>% 
  select_if(is.numeric) %>% head()
```

```
## # A tibble: 6 × 14
## # Groups:   year, granivore [6]
##    year granivore   hfl   wgt meantemp precipi…¹ hfl_fn1 wgt_fn1 meant…² preci…³
##   <dbl>     <dbl> <dbl> <dbl>    <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  1989         1  13    7       10.0      0.762    25.2    32.4    14.4   15.8 
## 2  1990         1  13.9  8.29    17.4     22.0      25.6    33.1    15.9   21.6 
## 3  1991         1  12.8  9.6     13.5     40.8      21.4    21.5    16.3   39.4 
## 4  1992         1  12    7.83     7.89    28.3      25.2    33.5    17.7   35.2 
## 5  2007         1  13.3  8.57    14.4     11.4      22.2    23.2    17.5   14.3 
## 6  2008         1  13.9  9.13    20.2      1.04     24.7    31.3    18.1    9.21
## # … with 4 more variables: hfl_fn2 <dbl>, wgt_fn2 <dbl>, meantemp_fn2 <dbl>,
## #   precipitation_fn2 <dbl>, and abbreviated variable names ¹precipitation,
## #   ²meantemp_fn1, ³precipitation_fn1
```

---

```r
final_data %>% 
  group_by(year, granivore) %>% 
  summarize_at(vars, c(mean, sd), na.rm = TRUE) %>% 
  select_if(is.numeric) %>% head()
```

```
## # A tibble: 6 × 10
## # Groups:   year [3]
##    year graniv…¹ hfl_fn1 wgt_fn1 meant…² preci…³ hfl_fn2 wgt_fn2 meant…⁴ preci…⁵
##   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  1980        0    24.3    63.4    13.1    17.5    7.02    60.7    3.65    8.94
## 2  1980        1    27.6    37.1    15.1    22.8   12.7     38.6    4.10    5.97
## 3  1981        0    24.9    70.7    14.0    37.0    6.78    72.4    3.09    7.54
## 4  1981        1    27.2    35.6    14.5    40.2   12.3     38.1    4.02   10.8 
## 5  1982        0    24.3    70.2    17.1    33.5    7.29    69.0    2.54    4.88
## 6  1982        1    25.6    31.4    17.5    33.0   11.8     31.7    3.29    7.28
## # … with abbreviated variable names ¹granivore, ²meantemp_fn1,
## #   ³precipitation_fn1, ⁴meantemp_fn2, ⁵precipitation_fn2
```

---

```r
my_function <- function(x) { 
  paste0("new ", x)
}

species_final %>% 
  ungroup() %>% 
  mutate_if(is.character, my_function) %>% head()
```

```
## # A tibble: 6 × 12
##   species scientifi…¹ commo…² grani…³ censusdate year  month day   treat…⁴ sex  
##   <chr>   <chr>       <chr>     <dbl> <date>     <chr> <chr> <chr> <chr>   <chr>
## 1 new BA  new Baiomy… new No…       1 1989-11-04 new … new … new … new ex… new M
## 2 new BA  new Baiomy… new No…       1 1989-11-04 new … new … new … new ex… new M
## 3 new BA  new Baiomy… new No…       1 1989-12-04 new … new … new … new ex… new M
## 4 new BA  new Baiomy… new No…       1 1990-04-24 new … new … new … new ex… new M
## 5 new BA  new Baiomy… new No…       1 1990-05-23 new … new … new … new ex… new F
## 6 new BA  new Baiomy… new No…       1 1990-06-21 new … new … new … new ex… new F
## # … with 2 more variables: hfl <dbl>, wgt <dbl>, and abbreviated variable names
## #   ¹scientificname, ²commonname, ³granivore, ⁴treatment
```

```r
final_data %>% 
  select(3:5) %>% 
  mutate_all(my_function) %>% head()
```

```
## # A tibble: 6 × 4
## # Groups:   species [1]
##   species granivore hfl                  wgt                 
##   <chr>   <chr>     <chr>                <chr>               
## 1 BA      new 1     new 13               new 7               
## 2 BA      new 1     new 13.8571428571429 new 8.28571428571429
## 3 BA      new 1     new 12.85            new 9.6             
## 4 BA      new 1     new 12               new 7.83333333333333
## 5 BA      new 1     new 13.2857142857143 new 8.57142857142857
## 6 BA      new 1     new 13.875           new 9.13333333333333
```

---

```r
replace_function <- function(x) {
  ifelse(is.na(x), -1, x)
}

surveys_edit %>% 
  select(7:9) %>% 
  mutate_all(replace_function)
```

```
## # A tibble: 28,364 × 3
##    sex     hfl   wgt
##    <chr> <dbl> <dbl>
##  1 -1       -1    -1
##  2 -1       -1    -1
##  3 M        15     7
##  4 -1       -1    -1
##  5 M        36    40
##  6 F        34    37
##  7 F        47   117
##  8 M        36    40
##  9 M        49   132
## 10 -1       -1    -1
## # … with 28,354 more rows
```

---
### Creating Multiple Plots

Finally, when doing a research study, you often want to look at many graphs of variables/observations. This could take a long time if you do it one by one. If, in this case, we want to look at a graph for each individual species of rodent, we would have to create 21 separate graphs, which would be many lines of code. Or, you can use a function or a loop to programmatically create and save graphs. Functions would probably be a better way to do this, but I am going to show how to use loops in R, given our time.

```r
final_data %>% 
  filter(species == "DO") %>% 
  ggplot(aes(year, wgt)) +
  geom_point() + geom_line() +
  labs(title = "DO")
```

---

```r
# This for loop will walk through each object in the specs vector and
# put it in the print() function one at a time. 
for (i in specs) { 
  print(i)
}

# The same concept applies here, except it is more complicated. But,
# essentially each time you see `i` it is being replaced with 
# the current object in the specs vector.
for (i in specs) { 
  temp <- final_data %>% 
    filter(species == i) %>% 
    ggplot(aes(year, wgt)) +
    geom_point() + geom_line() +
    labs(title = i)
  print(temp)
  ggsave(filename = paste0(i, ".png"), width = 8, height = 8)
}
```

```r
cols <- c("wgt", "precipitation", "hfl", "meantemp")
## Going over columns can be a bit more complicated, since you have to 
## use the .date[[i]] notation in order to reference the column 
## within the aes call.
for (i in cols) { 
  temp <- final_data %>% 
    filter(species == "DO") %>% 
    ggplot(aes(year, .data[[i]])) +
    geom_point() + geom_line() +
    labs(title = i)
  print(temp)
  ggsave(filename = paste0(i, ".png"), width = 8, height = 8)
}
```