class: center, middle, inverse, title-slide # Day Seven-Eight: Grammar of Graphics ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- --- # Quiz 1 Posted .pull-left[ **What does it cover?** * tracing data provenance * reviewing data dictionaries and metadata * determining the unit of observation and variables * data fundamentals * types of variables * data object types in `R` * error resolution and problem-solving * fixing errors in code * reviewing help pages ] .pull-right[ **How should I prepare?** * review lectures slides and lecture notes, especially on the above topics * review problem sets, especially those you found challenging * visit office hours or schedule a time to meet with me on Moodle * visit drop-in tutoring at Spinelli Center with specific topics you'd like help on ] --- class: center, middle # What were some takeaways from class last Monday? --- # For Today * review of Monday's reading * grammar of graphics * cartesian plots in ggplot * visualization context --- # Elements of data graphics * visual cues * coordinate system * scale * context > Framework drawn from: Yau, Nathan. 2013. *Data Points: Visualization That Means Something*. 1st edition. Indianapolis, IN: Wiley. --- # Visual Cues Depending on the plots we design, the values in our dataset may be represented by combinations of the following visual cues: * Where is the data *positioned* on the plot? * What is the *length* of shapes on the plot? * How large is the *angle* between vectors? * In what *direction* are slopes on the plot? * What *symbols* appear on the plot? * How much *area* do shapes take up on a plot? * How intense is the *color* presented on the plot? --- # Scale * Linear * Numeric values are evenly spaced on axis. * Logarithmic * Numeric interval are spaced by a factor of the base of the logarithm. * Categorical * Categorical values are discretely placed on axis. * Ordinal * Categorical values are ordered on axis. * Percent * Percentages of a whole are evenly spaced on axis. * Time * Date/time values are placed on axis in years, months, days, hours, etc. --- # Examples .pull-left[ <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-2-1.png" width="360" /> <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-3-1.png" width="360" /> ] .pull-right[ <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-4-1.png" width="360" /> <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-5-1.png" width="360" /> ] --- # Context * Titles * A descriptive title is used to introduce the plot. * Labels * Axes and points are labeled to indicate what data is represented on the plot. * Legends * The meaning of varying colors, sizes, and shapes are represented in a legend. * Captions * Further detail about the plot is provided in explanatory text. --- class: center, middle # Today, and in the upcoming weeks, we are going to focus on creating Cartesian plots in `R`. --- # What visual cues, scales, and context are represented on this plot? <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-6-1.png" width="720" /> --- # What visual cues, scales, and context are represented on this plot? <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-7-1.png" width="720" /> --- # Let's create the following data frame to motivate today's lecture. > This dataset comes from [Pioneer Valley Data](https://pioneervalleydata.org/data-download-page/) and documents estimates of population characteristics for each municipality in the Pioneer Valley. ```r library(tidyverse) pioneer_valley_census_data <- read.csv("https://raw.githubusercontent.com/SDS-192-Intro/SDS-192-public-website/main/slides/datasets/pioneer_valley_census.csv") hampshire_census_data <- pioneer_valley_census_data %>% filter(COUNTY == "Hampshire") ``` --- # Data Ethics Questions 1. What assumptions and commitments informed the design of this dataset? * Definitions for certain census categories like "Unemployed" or "Poverty" may exclude certain people. * Shifting categorizations for race, gender, and other demographics may not represent how people identify. 2. Who has had a say in data collection and analysis regarding this dataset? Who has been excluded? * US Census Bureau and other governing bodies * Activists and social movements 3. What are the benefits and harms of this dataset, and how are they distributed amongst diverse social groups? * Concerns about surveillance or mistreatment amongst those counted. * Concerns about under-representation amongst those uncounted. --- # ggplot * Most of the plots we create in this course will rely on a package called `ggplot` * `ggplot` is included in the Tidyverse, which you installed on Monday. We can load `ggplot` by calling `library(ggplot2)` * Load `ggplot` in your environment. ```r library(ggplot2) ``` --- # Basic Formula `ggplot()` functions * To create plots, we can call the `ggplot()` function. * The first argument in the `ggplot()` function is the dataset we'd like to reference to create a plot. * The second argument is called `mapping` and indicates the variables/columns from the dataset we'd like to plot. We wrap a list of these variables in `aes()` * In a Cartesian plot, we must supply the variables/columns that will appear on the axes (via `x = ` and `y = `) --- ```r ggplot(data = hampshire_census_data, aes(x = COMMUNITY, y = CEN_EARLYED)) ``` <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-10-1.png" width="720" /> --- class: center, middle # When we supply a nominal variable in our dataset to the x-axis mapping, what will the scale be? --- class: center, middle # When we supply a continuous variable in our dataset to the x-axis mapping, what will the scale be? > If you were confused regarding these questions, you should study up on Data Fundamentals before starting Quiz 1! --- # Where's the data? * In the previous plot, we told `R` *what* variables to plot, but we didn't indicate *how* to plot them. * What visual marks should appear on the graph? * To do this, we need to add a *geom function* to our `ggplot` call, indicating the type of plot we'd like to create. Examples: * Bar plots: `geom_bar()` * Scatterplots: `geom_point()` * We append this function, along with additional functions for styling the plot, using a `+` sign. --- ```r ggplot(data = hampshire_census_data, aes(x = COMMUNITY, y = CEN_EARLYED)) + geom_col() + coord_flip() # Flipping the x and y coordinates here makes the labels more legible. ``` <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-11-1.png" width="720" /> --- # Adding Context to Plots * What context should *always* be included on a plot? * Unit of Observation * Variables Represented * Filters * Geographic Scope * Temporal Scope * We can add this context via titles and labels, using the `labs()` function. > Hint: This **will** make an appearance on your next quiz! --- ```r ggplot(data = hampshire_census_data, aes(x = COMMUNITY, y = CEN_EARLYED)) + geom_col() + coord_flip() + # Flipping the x and y coordinates here makes the labels more legible. labs(title = "Hampshire County Early Education Enrollment Rates, 2018", x = "Enrollment Rate for 3-4 yr old", y = "Municipality in Hampshire County, MA") ``` <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-12-1.png" width="720" /> --- ```r # Adjust the Scale ggplot(data = pioneer_valley_census_data, aes(x = COUNTY,y = CEN_WORKERS)) + geom_point() + coord_flip() + scale_y_log10() + labs(title = "Number of Workers Age 16+ in Pioneer Valley, MA Municipalities, 2018", x = "County", y = "Workers Age 16+") ``` <img src="Day7-8-GrammarGraphics_files/figure-html/unnamed-chunk-13-1.png" width="720" /> --- # For Friday * Reading on Visualization Conventions * Quiz 1 posted