class: left, middle, inverse, title-slide .title[ # Data Wrangling in R ] --- Data analysis involves a large amount of [janitorwork](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) -- munging and cleaning data to facilitate downstream data analysis. In fact, data scientists say that around [80%](https://www.infoworld.com/article/3228245/data-science/the-80-20-data-science-dilemma.html) of their time is taken up by data cleaning tasks compared to just 20% for the actual analyses. Our goal will be to produce "tidy data" that we can then use to derive some insights. [Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) is defined as: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. To transform the data into tidy data will take different steps depending on the nature of the untidyness. Hadley Wickham, who works for RStudio, says "tidy datasets are all alike but every messy dataset is messy in its own way." **Recommended Resources** 1. The [**_R for Data Science_ book**](http://r4ds.had.co.nz/tibbles.html) is a fabulous resource for learning to do data science in R. 2. There are cheatsheets available on the [RStudio website](https://www.rstudio.com/resources/cheatsheets/) for **tidyr**, **dplyr**, and **stringr**, among others. They are excellent quick reference guides for what we will learn today. --- ### Set up data We need to load the readr, dplyr, tidyr, and stringr packages. All of these packages are contained in the tidyverse megapackage. Let's load those packages now - hopefully you have already installed them. ```r #install.packages("tidyverse") # Load packages library(tidyverse) ``` The purpose of our class today is to figure out the relationship between rodent weights in the Arizona Desert and temperature/precipitation over time. The problem is that we do not have a single dataset with all of the variables that we will need. To get this dataset, we need to combine multiple disparate datasets. ```r ## desert_rodents contains summed up information for various ## rodent species. species <- read_csv("desert_rodents.csv") %>% select(species:granivore) %>% mutate(species = if_else(is.na(species), "NA", species)) ## The surveys data contains individual rodent information ## We will be getting this data from the tidy tuesday github ## site directly surveys <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-02/surveys.csv') %>% select(censusdate, treatment, species, sex, hfl, wgt) %>% mutate(species = if_else(is.na(species), "NA", species)) ## This weather data is taken from the Portal project and ## grabs mean temperatures and precipitation over time. weather <- read_csv("weather_data.csv") ``` --- ```r # Display the data glimpse(species) ``` ``` ## Rows: 21 ## Columns: 4 ## $ species <chr> "BA", "PB", "PH", "PI", "PP", "DM", "DO", "DS", "NA", "… ## $ scientificname <chr> "Baiomys taylori", "Chaetodipus baileyi", "Chaetodipus … ## $ commonname <chr> "Northern pygmy mouse", "Bailey's pocket mouse", "Hispi… ## $ granivore <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0… ``` ```r glimpse(surveys) ``` ``` ## Rows: 28,364 ## Columns: 6 ## $ censusdate <date> 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08… ## $ treatment <chr> "exclosure", "exclosure", "exclosure", "control", "control"… ## $ species <chr> "OL", "NA", "PF", "OT", "DM", "DM", "DS", "DM", "DS", "OL",… ## $ sex <chr> NA, NA, "M", NA, "M", "F", "F", "M", "M", NA, NA, "M", "F",… ## $ hfl <dbl> NA, NA, 15, NA, 36, 34, 47, 36, 49, NA, NA, 38, 36, 50, 51,… ## $ wgt <dbl> NA, NA, 7, NA, 40, 37, 117, 40, 132, NA, NA, 44, 40, 126, 1… ``` ```r glimpse(weather) ``` ``` ## Rows: 44 ## Columns: 25 ## $ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988… ## $ meantemp_Jan <dbl> 5.512903, 4.335484, 4.545161, 2.712903, 4.167742, 1.… ## $ meantemp_Feb <dbl> 7.189655, 8.528571, 7.400000, 7.488889, 7.582759, 5.… ## $ meantemp_Mar <dbl> 10.267742, 9.835484, 12.400000, 10.409677, 13.693548… ## $ meantemp_Apr <dbl> 15.91667, 18.54333, 18.62000, 13.27333, 17.13000, 17… ## $ meantemp_May <dbl> 21.78710, 22.47667, 22.27419, 21.66774, 25.83871, 22… ## $ meantemp_Jun <dbl> 29.87333, 28.63333, 28.18333, 27.11333, 24.44333, 27… ## $ meantemp_Jul <dbl> 25.52581, 23.97742, 25.81290, 25.49355, 23.81290, 25… ## $ meantemp_Aug <dbl> 23.58710, 23.89355, 23.75806, 24.00968, 22.45484, 24… ## $ meantemp_Sep <dbl> 19.50000, 19.29333, 21.26333, 21.23448, 20.24667, 18… ## $ meantemp_Oct <dbl> 11.86774, 13.51290, 14.61935, 13.85806, 11.64516, 13… ## $ meantemp_Nov <dbl> 5.400000, 7.443333, 6.993333, 7.296667, 6.200000, 7.… ## $ meantemp_Dec <dbl> 3.6096774, 4.7806452, 2.0290323, 5.4838710, 4.412903… ## $ precipitation_Jan <dbl> 15.240, 30.226, 25.654, 6.364, 25.700, 22.104, 2.032… ## $ precipitation_Feb <dbl> 30.226, 10.668, 10.668, 24.892, 0.000, 35.814, 24.89… ## $ precipitation_Mar <dbl> 5.080, 34.036, 21.590, 32.766, 0.000, 9.398, 33.288,… ## $ precipitation_Apr <dbl> 0.508, 10.414, 4.572, 9.398, 3.556, 26.162, 0.000, 1… ## $ precipitation_May <dbl> 0.508, 3.048, 10.668, 0.254, 2.540, 0.000, 5.842, 9.… ## $ precipitation_Jun <dbl> 10.922, 77.724, 4.318, 0.000, 21.336, 38.608, 13.462… ## $ precipitation_Jul <dbl> 60.706, 81.280, 33.020, 88.900, 39.878, 93.980, 68.3… ## $ precipitation_Aug <dbl> 54.102, 15.494, 72.644, 4.572, 78.486, 81.280, 62.99… ## $ precipitation_Sep <dbl> 58.674, 56.134, 66.802, 128.868, 23.368, 71.120, 40.… ## $ precipitation_Oct <dbl> 1.778, 25.146, 0.000, 89.408, 55.880, 106.426, 62.73… ## $ precipitation_Nov <dbl> 14.478, 1.016, 67.056, 45.720, 18.796, 12.192, 6.096… ## $ precipitation_Dec <dbl> 2.286, 33.528, 140.678, 27.432, 78.740, 7.620, 11.17… ``` --- ## Separate & Unite ```r surveys %>% head(6) ``` ``` ## # A tibble: 6 × 6 ## censusdate treatment species sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 exclosure OL <NA> NA NA ## 2 1978-01-08 exclosure NA <NA> NA NA ## 3 1978-01-08 exclosure PF M 15 7 ## 4 1978-01-08 control OT <NA> NA NA ## 5 1978-01-08 control DM M 36 40 ## 6 1978-01-08 control DM F 34 37 ``` In looking at the output of the `surveys` dataset, we have some work to do to tidy the data for analysis. One of the first problems I see is that there are multiple pieces of data encoded into the column called `censusdate`. We can work with date data using the `lubridate` package, but we are going to try it out a different way. To do this, we luckily have data separated neatly by `-`. Let's use the `separate` function from the tidyr package to create new columns from the original `censusdate` column. --- # Separate The `separate` function is used to create multiple columns from a single column. For the `separate` function you simply need the dataset, the column you want to separate into multiple columns, the name of the new columns, and the means by which the column is separated. ```r #?separate #censusdate has 3 pieces of information #(year-month-day) surveys %>% separate(col = censusdate, into = c("year", "month", "day"), sep = "-") %>% head(6) ``` ``` ## # A tibble: 6 × 8 ## year month day treatment species sex hfl wgt ## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978 01 08 exclosure OL <NA> NA NA ## 2 1978 01 08 exclosure NA <NA> NA NA ## 3 1978 01 08 exclosure PF M 15 7 ## 4 1978 01 08 control OT <NA> NA NA ## 5 1978 01 08 control DM M 36 40 ## 6 1978 01 08 control DM F 34 37 ``` --- Notice that our original variable `censusdate` no longer appears. This change may be what we want, but so far it is only in the console. The original dataframe is still unchanged. ```r surveys %>% head(6) ``` ``` ## # A tibble: 6 × 6 ## censusdate treatment species sex hfl wgt ## <date> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 exclosure OL <NA> NA NA ## 2 1978-01-08 exclosure NA <NA> NA NA ## 3 1978-01-08 exclosure PF M 15 7 ## 4 1978-01-08 control OT <NA> NA NA ## 5 1978-01-08 control DM M 36 40 ## 6 1978-01-08 control DM F 34 37 ``` Let's keep the change created by the `separate()` function by saving the new dataset and also by keeping the original variable ```r ## We are also going to use the remove = FALSE option to keep the ## censusdate column surveys_edit <- surveys %>% separate(col = censusdate, into = c("year", "month", "day"), sep = "-", remove = FALSE) #Make sure to look at the variable types after a separate, #they can often show everything as character/string/factor, #when you would expect it to be numeric, ie: month, day, year here. surveys_edit %>% head(6) ``` ``` ## # A tibble: 6 × 9 ## censusdate year month day treatment species sex hfl wgt ## <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 1978-01-08 1978 01 08 exclosure OL <NA> NA NA ## 2 1978-01-08 1978 01 08 exclosure NA <NA> NA NA ## 3 1978-01-08 1978 01 08 exclosure PF M 15 7 ## 4 1978-01-08 1978 01 08 control OT <NA> NA NA ## 5 1978-01-08 1978 01 08 control DM M 36 40 ## 6 1978-01-08 1978 01 08 control DM F 34 37 ``` --- If the separator was not as neat as this, you can input any [regular expression](https://en.wikipedia.org/wiki/Regular_expression) into the separator argument. ## Resources for Regular Expressions For a nice cheatsheet for writing regular expressions in R, see a [Regex cheatsheet](http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf). Jenny Bryan has created a nice website tutorial for learning to use [Regular Expressions in R](http://stat545.com/block022_regular-expression.html). ```r new_df <- tibble(a = c("Education#SEP#52", "Business#SEP#43")) new_df ``` ``` ## # A tibble: 2 × 1 ## a ## <chr> ## 1 Education#SEP#52 ## 2 Business#SEP#43 ``` ```r separate(data = new_df, col = a, into = c("Field", "Age"), sep = "#SEP#") ``` ``` ## # A tibble: 2 × 2 ## Field Age ## <chr> <chr> ## 1 Education 52 ## 2 Business 43 ``` ```r new_df <- tibble(a = c("Education#SEP#52", "Business#IFJ#43")) separate(data = new_df, col = a, into = c("Field", "Age"), sep = "#[A-Z]{3}#") ``` ``` ## # A tibble: 2 × 2 ## Field Age ## <chr> <chr> ## 1 Education 52 ## 2 Business 43 ``` --- # Unite Unite is essentially the opposite of separate. It acts much like paste would work, where it combines two (or more) separate columns into one while separating with a specific value ```r # Let's pretend we want to unite the date columns back together in # a different format surveys_edit %>% unite(col = date, c("month", "day", "year"), sep = "/", remove = FALSE) %>% select(date, month, day, year) ``` ``` ## # A tibble: 28,364 × 4 ## date month day year ## <chr> <chr> <chr> <chr> ## 1 01/08/1978 01 08 1978 ## 2 01/08/1978 01 08 1978 ## 3 01/08/1978 01 08 1978 ## 4 01/08/1978 01 08 1978 ## 5 01/08/1978 01 08 1978 ## 6 01/08/1978 01 08 1978 ## 7 01/08/1978 01 08 1978 ## 8 01/08/1978 01 08 1978 ## 9 01/08/1978 01 08 1978 ## 10 01/08/1978 01 08 1978 ## # … with 28,354 more rows ``` --- ### Exercise ### 1. Use Separate to separate the a column in `new_df` into separate columns. 2. Reunite the columns and use a separator of your choice. ```r new_df <- tibble(a = c("112 Prince Henry Street, Charlottesville, VA", "5123 Miller Street, Richmond, VA")) new_df ``` ``` ## # A tibble: 2 × 1 ## a ## <chr> ## 1 112 Prince Henry Street, Charlottesville, VA ## 2 5123 Miller Street, Richmond, VA ``` ```r #1. n <- new_df %>% separate(col = a, into = c("street", "city", "state"), sep = ", ") n ``` ``` ## # A tibble: 2 × 3 ## street city state ## <chr> <chr> <chr> ## 1 112 Prince Henry Street Charlottesville VA ## 2 5123 Miller Street Richmond VA ``` ```r #2. unite(n, new_var, c("street", "city", "state"), sep = "--##--") ``` ``` ## # A tibble: 2 × 1 ## new_var ## <chr> ## 1 112 Prince Henry Street--##--Charlottesville--##--VA ## 2 5123 Miller Street--##--Richmond--##--VA ``` --- # Reshaping - pivot_wider & pivot_longer The next problem we will tackle is reshaping the dataframe. Notice that we have several columns that seem to be representing the month and type of weather reading at a year level. At times, we want the dataset to be set up like this, but to be able to combine it to our other datasets, we will need to reshape. ```r names(weather) ``` ``` ## [1] "year" "meantemp_Jan" "meantemp_Feb" ## [4] "meantemp_Mar" "meantemp_Apr" "meantemp_May" ## [7] "meantemp_Jun" "meantemp_Jul" "meantemp_Aug" ## [10] "meantemp_Sep" "meantemp_Oct" "meantemp_Nov" ## [13] "meantemp_Dec" "precipitation_Jan" "precipitation_Feb" ## [16] "precipitation_Mar" "precipitation_Apr" "precipitation_May" ## [19] "precipitation_Jun" "precipitation_Jul" "precipitation_Aug" ## [22] "precipitation_Sep" "precipitation_Oct" "precipitation_Nov" ## [25] "precipitation_Dec" ``` --- The columns here represent a month and reading value. This may be useful at times, but to match up to our previous datasets, we want our data to be at the year/month level instead of the year level. Our `weather` dataset is currently in a wide format, where we have more variables and fewer observations. We want to turn this dataset into more of a long or longitudinal dataset, where you have many observations and few columns/variables. This set up is nice for looking at time series or repeated measures. We will use the `pivot_longer` function changes a dataset from `wide` to `long`. Conversely the `pivot_wider` function to change our dataset from `long` to `wide`. `pivot_longer()` is the inverse of `pivot_wider`. With `pivot_longer()` you tell R the columns that you want to gather and then the name for the new column you are creating. Notice here the `:` operator which basically means "through". The `weather` dataset is an example of a wide dataset. So here we are grabbing all variables from `meantemp_Jan` through `precipitation_Dec`. ```r #?pivot_longer weather_long <- weather %>% pivot_longer(cols = meantemp_Jan:precipitation_Dec) ``` --- After the dataframe, `pivot_wider()` needs 2 other arguments. The first one is where you are getting your new variable names from. Since you will be creating several variables, R needs a way to tell how to name them. The second argument, below tells us where to get the values for the new variables. ```r weather_final <- weather_long %>% separate(name, into = c("reading_type", "month"), sep = "_") %>% pivot_wider(names_from = "reading_type", values_from = "value") weather_long %>% separate(name, into = c("reading_type", "month"), sep = "_") %>% pivot_wider(names_from = "reading_type", values_from = "value", names_prefix = "test_") %>% head(6) ``` ``` ## # A tibble: 6 × 4 ## year month test_meantemp test_precipitation ## <dbl> <chr> <dbl> <dbl> ## 1 1980 Jan 5.51 15.2 ## 2 1980 Feb 7.19 30.2 ## 3 1980 Mar 10.3 5.08 ## 4 1980 Apr 15.9 0.508 ## 5 1980 May 21.8 0.508 ## 6 1980 Jun 29.9 10.9 ``` ```r weather_long %>% separate(name, into = c("reading_type", "month"), sep = "_") %>% pivot_wider(names_from = "reading_type", values_from = "value", names_glue= "pre_{reading_type}_post") %>% head(6) ``` ``` ## # A tibble: 6 × 4 ## year month pre_meantemp_post pre_precipitation_post ## <dbl> <chr> <dbl> <dbl> ## 1 1980 Jan 5.51 15.2 ## 2 1980 Feb 7.19 30.2 ## 3 1980 Mar 10.3 5.08 ## 4 1980 Apr 15.9 0.508 ## 5 1980 May 21.8 0.508 ## 6 1980 Jun 29.9 10.9 ``` --- ### Exercise ### Using `new_df` below, create a new dataset that only has one observation per patient and contains both Systolic and Diastolic readings. ```r # Run these lines to create new_df a <- c(1, 1, 2, 2) b <- c("Sys", "Dia", "Sys", "Dia") c <- c(125, 75, 134, 85) new_df <- tibble(patient = a, BP = b,reading = c) new_df ``` ``` ## # A tibble: 4 × 3 ## patient BP reading ## <dbl> <chr> <dbl> ## 1 1 Sys 125 ## 2 1 Dia 75 ## 3 2 Sys 134 ## 4 2 Dia 85 ``` ```r new_df %>% pivot_wider(names_from = BP, names_glue = "BP_{BP}", values_from = reading) ``` ``` ## # A tibble: 2 × 3 ## patient BP_Sys BP_Dia ## <dbl> <dbl> <dbl> ## 1 1 125 75 ## 2 2 134 85 ``` ```r new_df %>% pivot_wider(names_from = BP, names_prefix = "BP_", values_from = reading) ``` ``` ## # A tibble: 2 × 3 ## patient BP_Sys BP_Dia ## <dbl> <dbl> <dbl> ## 1 1 125 75 ## 2 2 134 85 ``` --- ### Merge Datasets Together I want to do a bit of prework to create dummy datasets that can be used to test out our join/merge functions. ```r x <- surveys_edit %>% select(species, hfl, wgt) %>% head(10) %>% arrange(species) %>% mutate(data = "x") y <- species %>% select(species, commonname) %>% head(10) %>% arrange(species) %>% mutate(data = "y") x ``` ``` ## # A tibble: 10 × 4 ## species hfl wgt data ## <chr> <dbl> <dbl> <chr> ## 1 DM 36 40 x ## 2 DM 34 37 x ## 3 DM 36 40 x ## 4 DS 47 117 x ## 5 DS 49 132 x ## 6 NA NA NA x ## 7 OL NA NA x ## 8 OL NA NA x ## 9 OT NA NA x ## 10 PF 15 7 x ``` ```r y ``` ``` ## # A tibble: 10 × 3 ## species commonname data ## <chr> <chr> <chr> ## 1 BA Northern pygmy mouse y ## 2 DM Merriam's kangaroo rat y ## 3 DO Ord's kangaroo rat y ## 4 DS Banner-tailed kangaroo rat y ## 5 NA White-throated woodrat y ## 6 OL Northern Grasshopper Mouse y ## 7 PB Bailey's pocket mouse y ## 8 PH Hispid pocket mouse y ## 9 PI Rock pocket mouse y ## 10 PP Desert pocket mouse y ``` --- There are several types of joins (merges) that we can use. The one you use depends upon your specific needs. inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. ```r inner_join(x, y, by = c("species")) %>% head() ``` ``` ## # A tibble: 6 × 6 ## species hfl wgt data.x commonname data.y ## <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 DM 36 40 x Merriam's kangaroo rat y ## 2 DM 34 37 x Merriam's kangaroo rat y ## 3 DM 36 40 x Merriam's kangaroo rat y ## 4 DS 47 117 x Banner-tailed kangaroo rat y ## 5 DS 49 132 x Banner-tailed kangaroo rat y ## 6 NA NA NA x White-throated woodrat y ``` left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. ```r left_join(x, y, by = c("species")) %>% head() ``` ``` ## # A tibble: 6 × 6 ## species hfl wgt data.x commonname data.y ## <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 DM 36 40 x Merriam's kangaroo rat y ## 2 DM 34 37 x Merriam's kangaroo rat y ## 3 DM 36 40 x Merriam's kangaroo rat y ## 4 DS 47 117 x Banner-tailed kangaroo rat y ## 5 DS 49 132 x Banner-tailed kangaroo rat y ## 6 NA NA NA x White-throated woodrat y ``` --- right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. ```r right_join(x, y, by = c("species")) %>% head() ``` ``` ## # A tibble: 6 × 6 ## species hfl wgt data.x commonname data.y ## <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 DM 36 40 x Merriam's kangaroo rat y ## 2 DM 34 37 x Merriam's kangaroo rat y ## 3 DM 36 40 x Merriam's kangaroo rat y ## 4 DS 47 117 x Banner-tailed kangaroo rat y ## 5 DS 49 132 x Banner-tailed kangaroo rat y ## 6 NA NA NA x White-throated woodrat y ``` full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. ```r full_join(x, y, by = c("species")) %>% head() ``` ``` ## # A tibble: 6 × 6 ## species hfl wgt data.x commonname data.y ## <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 DM 36 40 x Merriam's kangaroo rat y ## 2 DM 34 37 x Merriam's kangaroo rat y ## 3 DM 36 40 x Merriam's kangaroo rat y ## 4 DS 47 117 x Banner-tailed kangaroo rat y ## 5 DS 49 132 x Banner-tailed kangaroo rat y ## 6 NA NA NA x White-throated woodrat y ``` --- So now, let us try and merge together our datasets. Let's try an inner join ```r # Notice that surveys_edit has a unit of observation at the sighting level (each # individual sighting gets a row). # Let's try combining the two rodent datasets inner_join(species, surveys_edit, by = c("species")) %>% head() ``` ``` ## # A tibble: 6 × 12 ## species scientifi…¹ commo…² grani…³ censusdate year month day treat…⁴ sex ## <chr> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr> ## 1 BA Baiomys ta… Northe… 1 1989-11-04 1989 11 04 exclos… M ## 2 BA Baiomys ta… Northe… 1 1989-11-04 1989 11 04 exclos… M ## 3 BA Baiomys ta… Northe… 1 1989-12-04 1989 12 04 exclos… M ## 4 BA Baiomys ta… Northe… 1 1990-04-24 1990 04 24 exclos… M ## 5 BA Baiomys ta… Northe… 1 1990-05-23 1990 05 23 exclos… F ## 6 BA Baiomys ta… Northe… 1 1990-06-21 1990 06 21 exclos… F ## # … with 2 more variables: hfl <dbl>, wgt <dbl>, and abbreviated variable names ## # ¹scientificname, ²commonname, ³granivore, ⁴treatment ``` ```r # Let's save this species_final <- inner_join(species, surveys_edit, by = c("species")) ``` Combining weather to rodent data. Let's first convert the month to an abbreviation and year to numeric in the `species_final` dataset and then join it to our weather_final data by year and month. ```r weather_rodents <- species_final %>% mutate(month = month.abb[as.numeric(month)], year = as.numeric(year)) %>% inner_join(weather_final, by = c("year", "month")) ``` --- ## Other ways to combine variables and cases The `join` functions have some intelligence to their actions. However, you can use a couple of `bind` functions to simply add variables or cases to existing datasets. Use `bind_cols()` to paste tables beside each other as they are. Use `bind_rows()` to paste tables below each other as they are. ```r a <- weather_rodents[1:3,] b <- weather_rodents[800:802,] bind_rows(a, b) %>% head() ``` ``` ## # A tibble: 6 × 14 ## species scientifi…¹ commo…² grani…³ censusdate year month day treat…⁴ sex ## <chr> <chr> <chr> <dbl> <date> <dbl> <chr> <chr> <chr> <chr> ## 1 BA Baiomys ta… Northe… 1 1989-11-04 1989 Nov 04 exclos… M ## 2 BA Baiomys ta… Northe… 1 1989-11-04 1989 Nov 04 exclos… M ## 3 BA Baiomys ta… Northe… 1 1989-12-04 1989 Dec 04 exclos… M ## 4 PB Chaetodipu… Bailey… 1 2000-07-22 2000 Jul 22 control M ## 5 PB Chaetodipu… Bailey… 1 2000-07-22 2000 Jul 22 control F ## 6 PB Chaetodipu… Bailey… 1 2000-07-22 2000 Jul 22 control M ## # … with 4 more variables: hfl <dbl>, wgt <dbl>, meantemp <dbl>, ## # precipitation <dbl>, and abbreviated variable names ¹scientificname, ## # ²commonname, ³granivore, ⁴treatment ``` ```r c <- weather_rodents[, 2] d <- weather_rodents[, 8] bind_cols(c, d) %>% head() ``` ``` ## # A tibble: 6 × 2 ## scientificname day ## <chr> <chr> ## 1 Baiomys taylori 04 ## 2 Baiomys taylori 04 ## 3 Baiomys taylori 04 ## 4 Baiomys taylori 24 ## 5 Baiomys taylori 23 ## 6 Baiomys taylori 21 ``` --- ## Visualizing our new dataset Let's explore our new `weather_rodents` dataset. Is the weight of rodents and temperature in the desert related? Let's check at a specific species of rodent. ```r weather_rodents %>% filter(species == "DO") %>% ggplot() + geom_point(aes(meantemp, wgt)) ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-24-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /> ```r # Unsure at this point. May need to work through this in # another way. ``` --- Before moving forward I want to aggregate our data to be per year and species. The `summarize` and `mutate` function have a series of helper functions that allow you to make changes to specific variable types or variable groups. Below, we are going to use the `across` function that will allow us to walk across the columns that are numeric and find the mean for each of these columns. ```r final_data <- weather_rodents %>% group_by(species, year) %>% summarize(across( .cols = where(is.numeric), .fns = list(mean), na.rm = TRUE) ) %>% arrange(species, year) ``` Here I am renaming a bunch of column names that were renamed above by using the `str_replace` function. ```r names(final_data) <- str_replace(names(final_data), "_1", "") ``` --- Let us try and see if we can better visualize the data we have. ```r final_data %>% ggplot() + geom_point(aes(wgt, meantemp)) ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-27-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /> Seems like no real relationship here. --- Let's visually look at this data using a new geom type, `geom_line`. The line graph works well on longitudinal/panel data. ```r final_data %>% filter(species == "DO") %>% ggplot(aes(year, wgt)) + geom_point() + geom_line() ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-28-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /> --- This works well, but I would like to be able to look across different species and compare. To be able to label each species, I want to use the `ggrepel` package which allows you to label portions of your graph. ```r #install.packages("ggrepel") library(ggrepel) specs <- c("DO", "OT", "PE") final_data %>% filter(year > 2000 & species %in% specs) %>% group_by(year, species) %>% mutate(species_label = if_else(year == 2022, paste0(species), "")) %>% ggplot(aes(year, wgt, color= species)) + geom_point() + geom_line() + geom_text_repel(aes(label = species_label)) + theme_classic() + theme(legend.position = "none") ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-29-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /> --- Finally, I want to show the relationship between rodent weight and mean temperature. To do this, I am adding a secondary axis to be able to put both of these variables on the plot at the same time. ```r final_data %>% filter(species %in% specs) %>% group_by(year, species) %>% ggplot(aes(x=year, group = species)) + geom_line(aes(y=wgt, color = species), linewidth = 1) + geom_line(aes(y=meantemp, group = 1), color = "black", linewidth = 1) + scale_y_continuous( # Features of the first axis name = "Weight of the rodent (g)", # Add a second axis and specify its features sec.axis = sec_axis(~., name="Temperature (C)") ) + theme_classic() + theme(axis.line.y.right = element_line(color = "black"), axis.ticks.y.right = element_line(color = "black"), axis.text.y.right = element_text(color = "black"), axis.title.y.right = element_text(color = "black"), axis.line.y.left = element_line(color = "blue"), axis.ticks.y.left = element_line(color = "blue"), axis.text.y.left = element_text(color = "blue"), axis.title.y.left = element_text(color = "blue")) ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-30-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /> Unclear conclusions here unfortunately. --- # Extra Information to help clean data The final piece I want to look at are similar to the `across` function and they allow you to make changes to many different variables at one time instead of having to make changes to each individual column. The `_at` series of functions allows you to specify the names or location of columns that you want to make changes to. The `_if` series of functions allows you to make changes to columns based off of a specific criteria (numeric, character, etc.). The `_all` series of functions will make changes across all of the variables in your datasets. These functions can be used in conjuction with our earlier dplyr functions like `mutate`, `select`, and `summarize` to make sweeping and efficient changes to your dataset. ```r vars <- c("hfl", "wgt", "meantemp", "precipitation") final_data %>% group_by(year, granivore) %>% mutate_at(vars, c(mean, sd), na.rm = TRUE) %>% select_if(is.numeric) %>% head() ``` ``` ## # A tibble: 6 × 14 ## # Groups: year, granivore [6] ## year granivore hfl wgt meantemp precipi…¹ hfl_fn1 wgt_fn1 meant…² preci…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1989 1 13 7 10.0 0.762 25.2 32.4 14.4 15.8 ## 2 1990 1 13.9 8.29 17.4 22.0 25.6 33.1 15.9 21.6 ## 3 1991 1 12.8 9.6 13.5 40.8 21.4 21.5 16.3 39.4 ## 4 1992 1 12 7.83 7.89 28.3 25.2 33.5 17.7 35.2 ## 5 2007 1 13.3 8.57 14.4 11.4 22.2 23.2 17.5 14.3 ## 6 2008 1 13.9 9.13 20.2 1.04 24.7 31.3 18.1 9.21 ## # … with 4 more variables: hfl_fn2 <dbl>, wgt_fn2 <dbl>, meantemp_fn2 <dbl>, ## # precipitation_fn2 <dbl>, and abbreviated variable names ¹precipitation, ## # ²meantemp_fn1, ³precipitation_fn1 ``` --- ```r final_data %>% group_by(year, granivore) %>% summarize_at(vars, c(mean, sd), na.rm = TRUE) %>% select_if(is.numeric) %>% head() ``` ``` ## # A tibble: 6 × 10 ## # Groups: year [3] ## year graniv…¹ hfl_fn1 wgt_fn1 meant…² preci…³ hfl_fn2 wgt_fn2 meant…⁴ preci…⁵ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1980 0 24.3 63.4 13.1 17.5 7.02 60.7 3.65 8.94 ## 2 1980 1 27.6 37.1 15.1 22.8 12.7 38.6 4.10 5.97 ## 3 1981 0 24.9 70.7 14.0 37.0 6.78 72.4 3.09 7.54 ## 4 1981 1 27.2 35.6 14.5 40.2 12.3 38.1 4.02 10.8 ## 5 1982 0 24.3 70.2 17.1 33.5 7.29 69.0 2.54 4.88 ## 6 1982 1 25.6 31.4 17.5 33.0 11.8 31.7 3.29 7.28 ## # … with abbreviated variable names ¹granivore, ²meantemp_fn1, ## # ³precipitation_fn1, ⁴meantemp_fn2, ⁵precipitation_fn2 ``` --- ```r my_function <- function(x) { paste0("new ", x) } species_final %>% ungroup() %>% mutate_if(is.character, my_function) %>% head() ``` ``` ## # A tibble: 6 × 12 ## species scientifi…¹ commo…² grani…³ censusdate year month day treat…⁴ sex ## <chr> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr> ## 1 new BA new Baiomy… new No… 1 1989-11-04 new … new … new … new ex… new M ## 2 new BA new Baiomy… new No… 1 1989-11-04 new … new … new … new ex… new M ## 3 new BA new Baiomy… new No… 1 1989-12-04 new … new … new … new ex… new M ## 4 new BA new Baiomy… new No… 1 1990-04-24 new … new … new … new ex… new M ## 5 new BA new Baiomy… new No… 1 1990-05-23 new … new … new … new ex… new F ## 6 new BA new Baiomy… new No… 1 1990-06-21 new … new … new … new ex… new F ## # … with 2 more variables: hfl <dbl>, wgt <dbl>, and abbreviated variable names ## # ¹scientificname, ²commonname, ³granivore, ⁴treatment ``` ```r final_data %>% select(3:5) %>% mutate_all(my_function) %>% head() ``` ``` ## # A tibble: 6 × 4 ## # Groups: species [1] ## species granivore hfl wgt ## <chr> <chr> <chr> <chr> ## 1 BA new 1 new 13 new 7 ## 2 BA new 1 new 13.8571428571429 new 8.28571428571429 ## 3 BA new 1 new 12.85 new 9.6 ## 4 BA new 1 new 12 new 7.83333333333333 ## 5 BA new 1 new 13.2857142857143 new 8.57142857142857 ## 6 BA new 1 new 13.875 new 9.13333333333333 ``` --- ```r replace_function <- function(x) { ifelse(is.na(x), -1, x) } surveys_edit %>% select(7:9) %>% mutate_all(replace_function) ``` ``` ## # A tibble: 28,364 × 3 ## sex hfl wgt ## <chr> <dbl> <dbl> ## 1 -1 -1 -1 ## 2 -1 -1 -1 ## 3 M 15 7 ## 4 -1 -1 -1 ## 5 M 36 40 ## 6 F 34 37 ## 7 F 47 117 ## 8 M 36 40 ## 9 M 49 132 ## 10 -1 -1 -1 ## # … with 28,354 more rows ``` --- ### Creating Multiple Plots Finally, when doing a research study, you often want to look at many graphs of variables/observations. This could take a long time if you do it one by one. If, in this case, we want to look at a graph for each individual species of rodent, we would have to create 21 separate graphs, which would be many lines of code. Or, you can use a function or a loop to programmatically create and save graphs. Functions would probably be a better way to do this, but I am going to show how to use loops in R, given our time. ```r final_data %>% filter(species == "DO") %>% ggplot(aes(year, wgt)) + geom_point() + geom_line() + labs(title = "DO") ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-35-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /> --- ```r # This for loop will walk through each object in the specs vector and # put it in the print() function one at a time. for (i in specs) { print(i) } # The same concept applies here, except it is more complicated. But, # essentially each time you see `i` it is being replaced with # the current object in the specs vector. for (i in specs) { temp <- final_data %>% filter(species == i) %>% ggplot(aes(year, wgt)) + geom_point() + geom_line() + labs(title = i) print(temp) ggsave(filename = paste0(i, ".png"), width = 8, height = 8) } ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-36-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /><img src="datawrangling_slides_files/figure-html/unnamed-chunk-36-2.svg" width="50%" height="50%" style="display: block; margin: auto;" /><img src="datawrangling_slides_files/figure-html/unnamed-chunk-36-3.svg" width="50%" height="50%" style="display: block; margin: auto;" /> ```r cols <- c("wgt", "precipitation", "hfl", "meantemp") ## Going over columns can be a bit more complicated, since you have to ## use the .date[[i]] notation in order to reference the column ## within the aes call. for (i in cols) { temp <- final_data %>% filter(species == "DO") %>% ggplot(aes(year, .data[[i]])) + geom_point() + geom_line() + labs(title = i) print(temp) ggsave(filename = paste0(i, ".png"), width = 8, height = 8) } ``` <img src="datawrangling_slides_files/figure-html/unnamed-chunk-37-1.svg" width="50%" height="50%" style="display: block; margin: auto;" /><img src="datawrangling_slides_files/figure-html/unnamed-chunk-37-2.svg" width="50%" height="50%" style="display: block; margin: auto;" /><img src="datawrangling_slides_files/figure-html/unnamed-chunk-37-3.svg" width="50%" height="50%" style="display: block; margin: auto;" /><img src="datawrangling_slides_files/figure-html/unnamed-chunk-37-4.svg" width="50%" height="50%" style="display: block; margin: auto;" />