class: left, middle, inverse, title-slide .title[ # Data Visualization with ggplot2 ] --- # Data Viz in R This lesson will cover fundamental concepts for creating effective data visualization and will introduce tools and techniques for visualizing data using R. We will review fundamental concepts for visually displaying quantitative information, such as using series of small multiples, avoiding "chart-junk," and maximizing the data-ink ratio. We will cover the grammar of graphics (geoms, aesthetics, and faceting) using the ggplot2 package to create plots layer-by-layer. # Set up For today's lesson we will need packages in the tidyverse. Today we will be using the `dplyr`, `ggplot`, and `readr` packages. We will also be using a few other speciality packages, `ggmosaic` and `ggthemes`. ```r # Load up the tidyverse library(tidyverse) #install.packages("ggmosaic") #install.packages("ggthemes") library(ggmosaic) library(ggthemes) ``` --- # Read in data We are going to use the rodents dataset to look at the differences between three different species of rodents. Use the `read_csv` package to read in the data. ```r rodents <- read_csv("rodent_surveys.csv") %>% drop_na(sex) %>% filter(species == "DM" | species == "PE" | species == "PB") %>% mutate(name = case_when( species == "DM" ~ "Merriam's Kangaroo Rat", species == "PB" ~ "Bailey's Pocket Mouse", species == "PE" ~ "Cactus Mouse" )) %>% mutate(year = as.numeric(str_sub(censusdate, 1, 4))) ``` Let's use the `glimpse` function to look at the data that we read in. ```r glimpse(rodents) ``` ``` ## Rows: 12,091 ## Columns: 8 ## $ censusdate [3m[38;5;246m<date>[39m[23m 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08, 1978-01-08, 1978-0… ## $ species [3m[38;5;246m<chr>[39m[23m "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "… ## $ treatment [3m[38;5;246m<chr>[39m[23m "control", "control", "control", "control", "control", "control", "control", "c… ## $ sex [3m[38;5;246m<chr>[39m[23m "M", "F", "M", "M", "F", "F", "F", "M", "F", "M", "M", "M", "F", "M", "M", "M",… ## $ hfl [3m[38;5;246m<dbl>[39m[23m 36, 34, 36, 38, 36, 36, 37, 36, 34, 37, 36, 38, 39, 35, 38, 38, 35, 37, 35, NA,… ## $ wgt [3m[38;5;246m<dbl>[39m[23m 40, 37, 40, 44, 40, 38, 39, 46, 42, 39, 44, 48, 44, 39, 49, 50, 40, 46, 48, NA,… ## $ name [3m[38;5;246m<chr>[39m[23m "Merriam's Kangaroo Rat", "Merriam's Kangaroo Rat", "Merriam's Kangaroo Rat", "… ## $ year [3m[38;5;246m<dbl>[39m[23m 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1… ``` --- # Basics of ggplot2 **ggplot2** is a widely used R package that extends R's visualization capabilities. It takes the hassle out of things like creating legends, mapping other variables to scales like color, or faceting plots into small multiples. We'll learn about what all these things mean shortly. _Where does the "gg" in ggplot2 come from?_ The **ggplot2** package provides an R implementation of Leland Wilkinson's *Grammar of Graphics* (1999). The *Grammar of Graphics* allows you to think beyond the garden variety plot types (e.g. scatterplot, barplot) and the consider the components that make up a plot or graphic, such as how data are represented on the plot (as lines, points, etc.), how variables are mapped to coordinates or plotting shape or color, what transformation or statistical summary is required, and so on. Specifically, **ggplot2** allows you to build a plot layer-by-layer by specifying: - a **geom**, which specifies how the data are represented on the plot (points, lines, bars, etc.), - **aesthetics** that map variables in the data to axes on the plot or to plotting size, shape, color, etc., - **facets**, which we've already seen above, that allow the data to be divided into chunks on the basis of other categorical or continuous variables and the same plot drawn for each chunk. --- # Plotting bivariate data: continuous X and continuous Y The `ggplot` function has two required arguments: the *data* used for creating the plot, and an *aesthetic* mapping to describe how variables in said data are mapped to things we can see on the plot. First, let's lay out the plot. If we want to plot a continuous Y variable by a continuous X variable we're probably most interested in a scatter plot. Here, we're telling ggplot that we want to use the `rodents` dataset, and the aesthetic mapping will map `hfl` onto the x-axis and `wgt` onto the y-axis. Remember that the variable names are case sensitive! ```r # The ggplot function creates an initial canvas ggplot(data = rodents, aes(x = hfl, y = wgt)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- When we do that we get a blank canvas with no data showing. That's because all we've done is laid out a two-dimensional plot specifying what goes on the x and y axes, but we haven't told it what kind of geometric object to plot. The obvious choice here is a point. Check out [docs.ggplot2.org](http://docs.ggplot2.org/) to see what kind of geoms are available. Add `geom_point()` ```r ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- Let's add some color by changing all the points to be blue. By looking at the `geom_point` function in the help menu, you can see that their are several arguments/options that you can change within the geom. This is true for every geom that we will explore today and others that you might explore on your own. The `colors()` function will provide you all of the colors that are initially available in R. ```r # colors() # ?geom_point() # Here the color = "blue" argument will make every point blue. This argument will behave differently depending on the type of geom you are using. ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point(color = "blue") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- Now let's change the shape of the points. `shape = 8` is an *. ![ggplot shape codes](ggplot_shapes.png) ```r # We can combine multiple arguments by using a comma ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point(color = "blue", shape = 3) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- Instead of coloring by a static value, let's color by rodent name (`name`). Remember that _any time_ you want to map a variable onto the plot, it must be in a call to `aes()` ```r # Notice that color is no longer a single value (ie: blue), but the value of a given category. This will now create a different color point for each value found in a category. However, the shape argument is still outside of the aes call, making every point the same shape. ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point(aes(color = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- Let's add another layer showing a smoothed trend line. (and let's go back to the normal geom_point shape). Remember, that each new layer will go on top of the previous one, meaning that the smoothed line will go on top of the points. ```r # Use a + sign to connect together ggplot elements such as a geom. ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point(aes(color = name)) + geom_smooth() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- Notice that `geom_smooth()` is fitting the trend line using a gam Model. We can read more about that by calling the help menu for geom_smooth ```r # ?geom_smooth ``` Look at the Arguments section and read the bit about the method argument. We can specify a method, but if we allow the `method = "auto"` then it will fit loess for < 1000 data points and general additive models gam for >= 1000 Let's change the method to be "lm" to fit a straight line using the `method` argument ```r ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point(aes(color = name)) + geom_smooth(method = "lm") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- Let's fit a smoothed line (allowing the default method = "auto") separately for each rodent species ```r # Notice that we have aesthetic mappings (aes) in both the point geom and the smooth geom. ggplot(data = rodents, aes(x = hfl, y = wgt)) + geom_point(aes(color = name)) + geom_smooth(aes(color = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- In the above code we now have three calls to `aes()`. One in the canvas layer, one in the `geom_point()` layer, and one in the `geom_smooth()` layer. If we want ALL the layers to be colored by name, we can just specify `color = name` in the top canvas layer and all the subsequent layers will inherit that aesthetic mapping just like they inherit the x and y from the canvas layer. ```r # I've added linewidth = 2 to the geom_smooth() function to create a thicker fitted line. ggplot(data = rodents, aes(x = hfl, y = wgt, color = name)) + geom_point() + geom_smooth(linewidth = 2, se = FALSE) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- # Time for an exercise Create a scatter plot looking at the relationship between year and wgt Color the points by rodent name Add a smoothed line that fits all of the points ```r rodents %>% ggplot(aes(year, wgt)) + geom_point(aes(color = name)) + geom_smooth(se = FALSE) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- # Facets Facets display subsets of the data in different panels. There are a couple ways to do this, but `facet_wrap()` tries to sensibly wrap a series of facets into a 2-dimensional grid of small multiples. The facet functions take a formula specifying which variables to facet by. If you have a look at the help for `?facet_wrap()` you'll see that we can control how the wrapping is laid out. Along the way to teaching you about faceting, I'll show a strategy to build up a plot by saving layers you like to an object. This way, we can avoid having to copy-paste long lines of code. Let's start with a simple plot of `hfl` x and `wgt` on y. Let's create the canvas and then save that to object called `p`. ```r p <- ggplot(rodents, aes(x = hfl, y = wgt)) p + geom_point() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-15-1.png)<!-- --> --- Add `geom_point()` colored green and semi-transparent (alpha) ```r # The alpha argument provides a level of transparency (1 = opaque, 0 = completely transparent). # This is very useful when building several layers of geoms on top of each other. p + geom_point(color = "green", alpha = .5) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- We'll facet by `name` to show each name in its own plot. Notice that we use the `~` in front of `name`. `facet_wrap` uses a formula as input, and the `~` is used when creating a formula. ```r # You could put another category in front of the ~ to create factorial graphs p + geom_point(color = "green", alpha = .5) + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-17-1.png)<!-- --> --- ```r p + geom_point(color = "green", alpha = .5) + facet_wrap(sex~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- Let's add transparency to `geom_point()` using the argument `alpha`, add a `geom_smooth()` layer colored by `name` and remove the standard error bars with `se = FALSE`. ```r p + geom_point(color = "green", alpha = .5) + geom_smooth(aes(color = name), se = FALSE) + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- In these plots, the facet and color are redundant mappings of genre. We typically avoid redundant mappings, but there are reasons to do it (ie readability for color-blind) and in this case, I like the added color. ```r p + geom_point(aes(color = name), alpha = .5) + geom_smooth(aes(color = name), se = FALSE) + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- # Exercise 2 Create the same plot as exercise 1 (year vs weight), but color and facet wrap by name and turn off standard error bars. ```r rodents %>% ggplot(aes(year, wgt, color = name)) + geom_point() + geom_smooth() + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- # Plotting bivariate data: categorical X and continuous Y With the last example we examined the relationship between a continuous Y variable against a continuous X variable. A scatter plot was the obvious kind of data visualization. But what if we wanted to visualize a continuous Y variable against a categorical X variable? One example would be `wgt` and `name` ```r p <- rodents %>% ggplot(aes(x = name, y = wgt)) p ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-22-1.png)<!-- --> --- Add points to the canvas ```r p + geom_point() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-23-1.png)<!-- --> --- That's not terribly useful. There's a big overplotting problem, indicating that many points may be overlapping each other.. What if we spread things out by adding a little bit of horizontal noise (aka "jitter") to the data. ```r # The jitter in geom_jitter randomly moves the points horizontally. If you run this plot multiple times, you will notice that the points will not be in the same place after each run. p + geom_jitter() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- Note that the little bit of horizontal noise that's added to the jitter is random. If you run that command over and over again, each time it will look slightly different. The idea is to visualize the density at each vertical position, and spreading out the points horizontally allows you to do that. Because there is still overplotting, let's add some transparency by setting the `alpha=` value for the jitter. ```r p + geom_jitter(alpha = .5) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- Probably a more common visualization is to show a box plot: ```r p + geom_boxplot() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-26-1.png)<!-- --> --- Let's show the summary and the raw data. ```r p + geom_jitter() + geom_boxplot() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ```r p + geom_jitter() + geom_boxplot(outlier.color = NA) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-28-1.png)<!-- --> --- Here, I added the jitter layer then added the boxplot layer, so the boxplot is plotted on top of the jitter layer. There's another geom that's useful here, called a violin plot. This plot does not color the outliers and provides the shape of the overall distribution across the different categories. ```r p + geom_violin() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-29-1.png)<!-- --> --- ```r p + geom_violin() + geom_jitter() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-30-1.png)<!-- --> --- One issue with box plots is that they always order the x axis alphabetically, unless otherwise told to do so. Most plots are much more effective if the axes are sorted in a more meaningful, informative way. To do that, we'll have to go back to our basic build of the plot again and use the `reorder()` function in our original aesthetic mapping. `reorder()` takes the first variable, which is some categorical variable, and ordering it by the level of the mean of the second variable, which is a continuous variable. It looks like this. ```r # ?reorder p <- rodents %>% drop_na(name, wgt) %>% ### The reorder function has issues with missings ggplot(aes(x = reorder(name, wgt, FUN = median), y = wgt)) p ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-31-1.png)<!-- --> --- ```r #Notice how color works with boxplots p + geom_boxplot(aes(color = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-32-1.png)<!-- --> --- ```r #We may want to use fill instead p + geom_boxplot(aes(fill = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-33-1.png)<!-- --> --- ```r #If you wanted to break down by another factor p + geom_boxplot(aes(fill = as.factor(sex))) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-34-1.png)<!-- --> --- # Plotting univariate continuous data What if we just wanted to visualize distribution of a single continuous variable? A histogram is the usual go-to visualization. Here we only have one aesthetic mapping instead of two. ```r # Let's investigate the distribution of the weight variable # set this as the canvas p <- rodents %>% ggplot(aes(wgt)) p + geom_histogram() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-35-1.png)<!-- --> --- We change the number of bins that we have in our histograms. Generally, you want to do this utilizing prior knowledge or research. ```r # We can either use the bins option to specify the number of bins or you can use the bin.width option to specify the width of the bins. p + geom_histogram(bins = 50) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-36-1.png)<!-- --> --- ```r p + geom_histogram(bins = 10) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-37-1.png)<!-- --> --- ```r p + geom_histogram(binwidth = 1) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-38-1.png)<!-- --> --- We can also color the bars by our `name` variable as we did before. ```r p + geom_histogram(aes(color = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-39-1.png)<!-- --> --- Well, that is not exactly what I meant. I wanted the inside of the bars colored. That argument is fill instead of color, just like with box plots. ```r p + geom_histogram(aes(fill = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-40-1.png)<!-- --> --- If you look at the help for `?geom_histogram` you'll see that by default it stacks overlapping data, so if two rodents have the same weight, one is plotted on the bottom and the second is plotted stacked on top of the first. This is a misleading visualization for that reason. Let's change the position argument so that each month has a baseline of 0. ```r # Here we use position = "identity" p + geom_histogram(aes(fill = name), position = "identity") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-41-1.png)<!-- --> --- That's better, but still not great. Let's use `alpha` to change the transparancy ```r p + geom_histogram(aes(fill = name), position = "identity", alpha = .3) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-42-1.png)<!-- --> --- Maybe better, but let's try faceting instead. ```r p + geom_histogram(aes(fill = name), position = "identity") + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-43-1.png)<!-- --> Great! --- Let's start back at p and plot a smoothed density curve instead of a histogram. ```r p + geom_density() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-44-1.png)<!-- --> --- ```r p + geom_histogram() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-45-1.png)<!-- --> --- Now color that by the `name` ```r p + geom_density(aes(color = name)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-46-1.png)<!-- --> --- This could work for a couple of groups, but with 3 or more, it can get a bit busy. Let's facet ```r p + geom_density(aes(color = name)) + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-47-1.png)<!-- --> --- Change color to fill and you will see that instead of the outline being colored, the whole curve is filled ```r p + geom_density(aes(fill = name)) + facet_wrap(~name) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-48-1.png)<!-- --> --- # Saving plots There are a few ways to save ggplots. The quickest way, that works in an interactive session, is to use the `ggsave()` function. You give it a file name and by default it saves the last plot that was printed to the screen. ```r ggsave(file = "rodents_weight.png") ``` Where did it go? You will see that it went directly into your project folder. Typically, I would prefer to have a directory structure that has my code, images, results, etc. all separated out. --- In a script, the best way to do it is to pass `ggsave()` the object containing the plot that is meant to be saved. We can also adjust things like the width, height, and resolution. `ggsave()` also recognizes the name of the file extension and saves the appropriate kind of file. Let's save a PDF. ```r # I am first going to create a single column of graphs so that each graph in the facet wrap gets its own row pfinal <- p + geom_density(aes(fill = name)) + facet_wrap(~name, ncol = 1) pfinal ggsave(pfinal, file = "rodents_weight.pdf", width = 7, height = 12) ``` --- # Publication ready graphics Let's keep using our density plot and add some useful aesthetic elements. ```r # create the canvas # add the layers and save them onto p p <- rodents %>% ggplot(aes(wgt)) + geom_density(aes(fill = name)) + facet_wrap(~name) p ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-51-1.png)<!-- --> --- Let's add proper axes labels using labs(). Keep adding to create a title, subtitle, and better legend title (color). ```r # save the labs onto p p <- p + labs(x = "Weight (g)", y = "Density of Values", title = "Rodent Weights", subtitle = "For Three Species of Rodents", fill = "Rodent Name") p ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-52-1.png)<!-- --> --- The author of ggplot2, Hadley Wickham has chosen the "gray" theme as the default, but there are others that automatically make some aesthetic choices for you. ```r p + theme_bw() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-53-1.png)<!-- --> --- ```r p + theme_classic() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-54-1.png)<!-- --> --- ## Creating your own themes If you wanted to, you can also customize your own plots using the `theme` function. Within this function there are thousands of combinations that you can choose to aesthetically change your graphs, ranging from the color of your titles to the shape of the outline of the legend. I tend to use the a built-in theme, generally `theme_classic` and then make a few additions from there. I will only give a small example of how to do this, but to cover all of the things you could do in this function would take days/weeks. ```r #?theme mytheme <- theme(axis.title.x = element_text(color = "red"), panel.background = element_rect(color = "orange", linetype = "longdash"), plot.background = element_rect(fill = "skyblue"), text = element_text(face = "bold"), legend.position="none" ) p + theme_classic() + mytheme ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-55-1.png)<!-- --> --- There are packages out there that fit your plot to given themes (from TV shows, journals, etc). One that I find useful is the `ggthemes` package ```r p + theme_fivethirtyeight() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-56-1.png)<!-- --> --- ```r p + theme_economist() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-57-1.png)<!-- --> --- ```r p + theme_excel_new() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-58-1.png)<!-- --> --- Sometimes you may want to go outside of the colors that ggplot provides for you when filling in histograms, coloring points, etc. Fortunately, plenty of tools are provided to customize these color sets. ```r #colors() p + scale_fill_colorblind() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-59-1.png)<!-- --> --- ```r p + scale_fill_viridis_d() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-60-1.png)<!-- --> --- ```r p + scale_fill_brewer(palette = "Greens") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-61-1.png)<!-- --> --- ```r p + scale_fill_brewer(palette = "Set1") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-62-1.png)<!-- --> --- ```r #?scale_fill_brewer() p + scale_fill_manual(values = c("purple", "black", "gold")) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-63-1.png)<!-- --> --- ```r rodents %>% ggplot(aes(hfl, wgt, color = hfl)) + geom_point(size = 4) + scale_color_gradient(low = "blue", high = "red") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-64-1.png)<!-- --> --- ## Plotting Multiple Datasets Oftentimes you may need to combine datasets into one plot. This can be done by resetting the data option within the `geom_` functions. ```r ## This dataframe will create a new column saying ## whether a rodent is a mouse or rat rodents_extra <- read_csv("desert_rodents.csv") %>% mutate(rodent_type =if_else(str_detect(commonname, "rat"), "Rat", "Mouse")) # Let's create an object that only holds observations # for a specific species fav_rodent <- rodents_extra %>% filter(commonname == "Merriam's kangaroo rat") ``` --- ```r #The following code creates a somewhat complex plot. ggplot() + geom_boxplot(data = rodents_extra, aes(x = rodent_type, y = meanhfl, fill = rodent_type)) + geom_point(data = fav_rodent, aes(x = rodent_type, y = meanhfl), color = "blue", size = 5) + theme_classic() + labs(y = "Mean Hindfoot Length", x = "Rodent Type", fill = "Rodent Type") ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-66-1.png)<!-- --> --- ## Multiple Categorical Variables One of the more difficult type of plots to create is one that has multiple categorical variables. There are not too many visualizations that stand out and that are better than a simple table. One plot that works for categorical variables is a mosaic plot. A mosaic plot shows the proportion of the overall data that each value in a category takes up. ```r ## MOSAIC PLOTS # ?ggmosaic::geom_mosaic() rodents %>% ggplot() + geom_mosaic(aes(x = product(name, treatment))) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-67-1.png)<!-- --> --- ```r rodents %>% ggplot() + geom_mosaic(aes(x = product(name, treatment), fill = sex)) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-68-1.png)<!-- --> --- ## Alternative to a box plot Box plots are used quite often, but they are not always the best visualization for the job. They illustrate the median and IQR of the data, while most often we are worried about the mean and the spread of the data. The following plot shows the spread of the data, provides summary information for the data, and also keeps the raw data. ```r # plot results *A better way rodents %>% group_by(name) %>% drop_na(wgt) %>% summarize(sd = sd(wgt), wgt = mean(wgt)) %>% mutate(lowerBound = wgt - sd, upperBound = wgt + sd) %>% ggplot(aes(name, wgt, color = name)) + geom_errorbar(aes(ymin = lowerBound, ymax = upperBound), size = 2, width = .3) + geom_point(size = 6) + geom_jitter(data = rodents, aes(name, wgt), alpha = .1) ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-69-1.png)<!-- --> --- ## Bar Charts ```r rodents %>% filter(year > 2019) %>% mutate(year = as.factor(year)) %>% group_by(name, year) %>% summarize(wgt_mean = mean(wgt, na.rm = TRUE), sd = sd(wgt, na.rm = TRUE))%>% ggplot(aes(wgt_mean, fill = year, y = name)) + geom_bar(alpha = .5, position = "stack", stat = "identity") + coord_flip() ``` ![](data_viz_slides_files/figure-html/unnamed-chunk-70-1.png)<!-- --> --- The website below shows a whole bunch of different ways that you can use ggplot: [Extensions to ggplot2] (https://exts.ggplot2.tidyverse.org/gallery/) # Learn More ggplot2 - Chapters 3, 7 and 28 in [R 4 Data Science](http://r4ds.had.co.nz/) - [R Graphics Cookbook](http://www.cookbook-r.com/Graphs/) - [viridis color scale](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) - [ggthemes package](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/) - [Color in R](http://www.sthda.com/english/wiki/colors-in-r)