class: center, middle, inverse, title-slide # Day Seventeen: Aggregating Data ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- # A few reminders! * Be sure to send me Slack messages vs. email. Feel free to follow-up if I haven't responded in 24 hours. ...but this only applies for weekdays! I try to focus on self-care on weekends. * Ask questions via our `#questions` channel whenever possible * Office hours on Mondays in McConnell 212; Wednesdays virtual * Extension policy for quizzes --- # Extra Credit * If you did not get full credit on the following three standards: * Tracing Data Provenance * Data Fundamentals * Error Resolution and Problem-Solving > I'm providing an extra credit opportunity to earn up to one point back per standard. Design a multiple choice quiz question that would demonstrate a student's understanding of that standard, using either the dataset from quiz 1 or quiz 2. Post it to the assignment in Moodle. Be sure to reprex() any code you produce for your question, and that there are five possible answers. Indicate the correct answer when submitting. --- # Preparing for Quiz 3 * From last quiz: Univariate Plotting, Visualization Aesthetics, Visualization Context * Multivariate Plotting (how to interpret boxplots/how to interpret different types of stacked barplots/how to effectively perceive differences across categories) * GitHub (GitHub vocabulary/how to resolve errors and conflicts) * RMarkdown (Markdown syntax/code and output presentation) --- # Tips for Quiz 3 * Be sure you understand *why* you may have not received full credit on questions in previous quizzes. Questions will be similar! * Review the data documentation before you start the quiz. * Know the dataset's unit of observation before you start the quiz. * Read the question carefully (e.g. true/not true; select four, etc.) * Whenever you can, copy and paste code to test it out in RStudio! Each question has everything you need to run the code in a code chunk. * Pay attention to my hints! Whenever there is risk that something might *trick* students, I've tried to provide a hint to steer you clear of that. --- # Data Challenge Results | Student | Plot | Award | Rationale | | --- | ------ | --- | ------------ | | Maggie Needham | Sleep Loss and Violent Crime by County | Best Caption | not only summarized but also helped us interpret the data responsibly | | Rachael An | Smoking vs. Life Expectancy | Most Effective Aesthetics | both color and position were used effectively in this plot | | Maggie McCoy | Correlation between Adult Smoking, Insufficient Sleep, and Frequent Mental Distress by County | Most Eye-Opening | shows a significant correlation between household income and mental distress | | Tseegi Nyamdorj | Smoking vs. Life Expectancy | Most Effective Connection to Peer-Reviewed Research | described an article linking sleep insufficiency to quality of life concerns | | Sonnet Lockheardt | Correlation of Median Household Income and Premature Mortality by County | Most news-ready | Provided a clear, effective, and straight-forward visual | | Claire Kelly | Thanks for content warning Claire! | Most thought-provoking | | --- # For today * Summarizing * Grouping and summarizing * Mutating data * Grouping and mutating * Ungrouping --- # Today's data * Hospital Cost Reports, 2018 ```r library(tidyverse) hospitals <- read.csv("https://raw.githubusercontent.com/SDS-192-Intro/SDS-192-public-website/main/slides/datasets/Hospital_Cost_Report_2018.csv") ``` --- class: center, middle # Data wrangling is a process for transforming a dataset from its original form into a more relevant form. --- class: center, middle # **Subsetting** data involves selecting relevant variables and observations for analysis. --- class: center, middle # **Aggregating** data involves compiling and summarizing data. --- # Six "verbs" for data wrangling * `select()` * `filter()` * `arrange()` * `summarize()` * `group_by()` * `mutate()` --- class: center, middle # Today, we are going to perform a data analysis of beds available at different MA hospitals. How do you anticipate we go about doing this? What assumptions must we make? --- # How many times does each hospital appear in this dataset? ```r length(unique(hospitals$Provider.CCN)) == nrow(hospitals) ``` ``` ## [1] FALSE ``` > Why might this be happening? --- class: center, middle # How are we going to deal with this issue? For the next few moments, talk with a neighbor about how you might approach this. Your approach can be high-level at first and the move towards technically specific. --- # `group_by()` * `group_by()` groups observations with a shared value in a variable * Grouping only changes the metadata of a data frame; we combine `group_by()` with other functions to transform the data frame * Values remain in groups unless we `ungroup()` it. This is important if we intend to run further operations on the resulting data. ```r hospitals <- hospitals %>% group_by(Provider.CCN) %>% filter(Fiscal.Year.End.Date == max(Fiscal.Year.End.Date)) %>% ungroup() ``` --- # Now let's filter to MA ```r hospitals_ma <- hospitals %>% filter(State.Code == "MA") ``` > Why am I storing my data in a new variable name? --- # Summary functions ![](https://d33wubrfki0l68.cloudfront.net/2b0d818e37178ef348ca8250a6112a2b13072e88/a810c/images/cheatsheets/summary.png) > Examples? --- # `summarize()` * `summarize()` computes a value across a vector of values and stores it in a new data frame ![](http://ohi-science.org/data-science-training/img/rstudio-cheatsheet-summarise.png) > How is this different than only applying a summary function to a vector? --- # `summarize()` * `summarize()` computes a value across a vector of values and stores it in a new data frame ```r hospitals_ma %>% summarize(total_beds = sum(Number.of.Beds)) ``` ``` ## # A tibble: 1 × 1 ## total_beds ## <int> ## 1 NA ``` > Why did we get an `NA` value? --- # `summarize()` * `summarize()` computes a value across a vector of values and stores it in a new data frame * `sum(is.na())` computes the total `NA` values in the vector ```r hospitals_ma %>% summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), total_beds_na = sum(is.na(Number.of.Beds))) ``` ``` ## # A tibble: 1 × 2 ## total_beds total_beds_na ## <int> <int> ## 1 18804 2 ``` > What if I wanted to know the percentage of rows with `NA` values in the `Number.of.Beds` column? --- # `summarize()` * `summarize()` computes a value across a vector of values and stores it in a new data frame * `n()` computes the total observations in the vector ```r hospitals_ma %>% summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), percent_beds_na = sum(is.na(Number.of.Beds)) / n() * 100) ``` ``` ## # A tibble: 1 × 2 ## total_beds percent_beds_na ## <int> <dbl> ## 1 18804 2.04 ``` > What if I wanted to know the percentage of rows with `NA` values in the `Number.of.Beds` column? --- class: center, middle # But there are very different types of hospitals represented in this dataset. Maybe I want to know the number of beds for each type of hospital. --- # `group_by()` %>% `summarize()` * `group_by()` groups observations with a shared value in a variable * When we combine `group_by()` and `summarize()` we can perform operations *within* groups ![](https://d33wubrfki0l68.cloudfront.net/902a3182822e6fb555f131a4e6d110272d2e242f/93d29/images/cheatsheets/group_summary.png) --- # `group_by()` %>% `summarize()` * `group_by()` groups observations with a shared value in a variable * When we combine `group_by()` and `summarize()` we can perform operations *within* groups .pull-left[ ```r hospitals_ma %>% group_by(CCN.Facility.Type) %>% summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), total_hopsitals = n(), percent_beds_na = sum(is.na(Number.of.Beds)) / n() * 100) ``` ] .pull-right[ ``` ## # A tibble: 7 × 4 ## CCN.Facility.Type total_beds total_hopsitals percent_beds_na ## <chr> <int> <int> <dbl> ## 1 CAH 74 3 0 ## 2 CH 527 4 50 ## 3 LTCH 2416 11 0 ## 4 PH 1704 17 0 ## 5 RH 752 6 0 ## 6 RNMHC 20 1 0 ## 7 STH 13311 56 0 ``` ] --- # `group_by()` %>% `summarize()` * `group_by()` groups observations with a shared value in a variable * When we combine `group_by()` and `summarize()` we can perform operations *within* groups ```r hospitals_ma %>% group_by(CCN.Facility.Type, Rural.Versus.Urban) %>% summarize(total_beds = sum(Number.of.Beds, na.rm = TRUE), total_hopsitals = n(), percent_beds_na = sum(is.na(Number.of.Beds)) / n() * 100) ``` --- ``` ## `summarise()` has grouped output by 'CCN.Facility.Type'. You can override using ## the `.groups` argument. ``` ``` ## # A tibble: 10 × 5 ## # Groups: CCN.Facility.Type [7] ## CCN.Facility.Type Rural.Versus.Ur… total_beds total_hopsitals percent_beds_na ## <chr> <chr> <int> <int> <dbl> ## 1 CAH R 25 1 0 ## 2 CAH U 49 2 0 ## 3 CH U 527 2 0 ## 4 CH <NA> 0 2 100 ## 5 LTCH U 2416 11 0 ## 6 PH U 1704 17 0 ## 7 RH U 752 6 0 ## 8 RNMHC U 20 1 0 ## 9 STH R 1163 5 0 ## 10 STH U 12148 51 0 ``` --- # `mutate()` * `mutate()` creates a new variable (column) in a data frame and fills values according to criteria we provide ![](https://d33wubrfki0l68.cloudfront.net/baeec2f985b2ec36388ba4d18960eebe59911ff3/bce5f/images/cheatsheets/mutate.png) --- # `mutate()` * `mutate()` creates a new variable (column) in a data frame and fills values according to criteria we provide > What would the following lines of code do? ```r hospitals_ma %>% mutate(Address = paste(Street.Address, City, State.Code, Zip.Code)) hospitals_ma %>% mutate(Hospital.Name.Lower = tolower(Hospital.Name)) hospitals_ma %>% mutate(Percentage.State.Beds = Number.of.Beds/sum(Number.of.Beds, na.rm = TRUE) * 100) ``` --- # `mutate()` ```r hospitals_ma %>% mutate(Address = paste(Street.Address, City, State.Code, Zip.Code)) %>% select(Street.Address, City, Address) %>% head() ``` ``` ## # A tibble: 6 × 3 ## Street.Address City Address ## <chr> <chr> <chr> ## 1 ONE HOSPITAL ROAD OAK BLUFFS ONE HOSPITAL ROAD OAK BLUFFS MA 2557 ## 2 2033 MAIN STREET ATHOL 2033 MAIN STREET ATHOL MA 1331 ## 3 450 BROOKLINE AVE BOSTON 450 BROOKLINE AVE BOSTON MA 2215 ## 4 145 WARD HILL AVENUE BRADFORD 145 WARD HILL AVENUE BRADFORD MA 1835 ## 5 222 STATE STREET LUDLOW 222 STATE STREET LUDLOW MA 1056 ## 6 1200 CENTRE STREET ROSLINDALE 1200 CENTRE STREET ROSLINDALE MA 2127 ``` --- # `mutate()` ```r hospitals_ma %>% mutate(Hospital.Name.Lower = tolower(Hospital.Name)) %>% select(Hospital.Name, Hospital.Name.Lower) %>% head() ``` ``` ## # A tibble: 6 × 2 ## Hospital.Name Hospital.Name.Lower ## <chr> <chr> ## 1 MARTHAS VINEYARD HOSPITAL marthas vineyard hospital ## 2 ATHOL MEMORIAL HOSPITAL athol memorial hospital ## 3 DANA-FARBER CANCER INSTITUTE dana-farber cancer institute ## 4 WHITTIER HOSPITAL-BRADFORD whittier hospital-bradford ## 5 HEALTHSOUTH REHABILITATION HOSPITAL healthsouth rehabilitation hospital ## 6 HEBREW REHABILITATION CENTER hebrew rehabilitation center ``` --- # `mutate()` ```r hospitals_ma %>% mutate(Percentage.State.Beds = Number.of.Beds/sum(Number.of.Beds, na.rm = TRUE) * 100) %>% select(Number.of.Beds, Percentage.State.Beds) %>% head() ``` ``` ## # A tibble: 6 × 2 ## Number.of.Beds Percentage.State.Beds ## <int> <dbl> ## 1 25 0.133 ## 2 25 0.133 ## 3 30 0.160 ## 4 60 0.319 ## 5 53 0.282 ## 6 667 3.55 ``` --- # `group_by()` %>% `mutate()` > How do you expect this to be different than `group_by()` %>% `summarize()`? Turn to a neighbor to discuss. > Hint: Check out the slides with the images for what `summarize()` and `mutate()` do to data frames. --- # `group_by()` %>% `mutate()` > What do you expect this line of code to do? ```r hospitals_ma %>% group_by(City) %>% mutate(Total.Beds.City = sum(Number.of.Beds)) ``` --- # `group_by()` %>% `mutate()` ```r hospitals_ma %>% group_by(City) %>% mutate(Total.Beds.City = sum(Number.of.Beds)) %>% arrange(desc(City)) %>% select(Provider.CCN, City, Total.Beds.City) %>% head() ``` ``` ## # A tibble: 6 × 3 ## # Groups: City [2] ## Provider.CCN City Total.Beds.City ## <int> <chr> <int> ## 1 224032 WORCESTER 1373 ## 2 223029 WORCESTER 1373 ## 3 220176 WORCESTER 1373 ## 4 220163 WORCESTER 1373 ## 5 220062 WORCESTER 1373 ## 6 223026 WOBURN 210 ``` --- # Why doesn't this return the city with the maximum total beds? ```r hospitals_ma %>% group_by(City) %>% mutate(Total.Beds.City = sum(Number.of.Beds)) %>% filter(Total.Beds.City == max(Total.Beds.City)) %>% select(City, Total.Beds.City) ``` ``` ## # A tibble: 80 × 2 ## # Groups: City [67] ## City Total.Beds.City ## <chr> <int> ## 1 OAK BLUFFS 25 ## 2 ATHOL 25 ## 3 BRADFORD 60 ## 4 LUDLOW 53 ## 5 ROSLINDALE 667 ## 6 WESTBOROUGH 208 ## 7 HAVERHILL 71 ## 8 TAUNTON 138 ## 9 TEWKSBURY 540 ## 10 JAMAICA PLAIN 407 ## # … with 70 more rows ``` --- # `ungroup()` * Once we've `group_by()` the data frame will remain grouped until we `ungroup()` it * All subsequent functions before `ungroup()` are applied within groups .pull-left[ ```r hospitals_ma %>% group_by(City) %>% mutate(Total.Beds.City = sum(Number.of.Beds, na.rm = TRUE)) %>% ungroup() %>% filter(Total.Beds.City == max(Total.Beds.City)) %>% select(City, County, Total.Beds.City) ``` > Why does Boston appear so many times here? How could I have approached this differently? ] .pull-right[ ``` ## # A tibble: 14 × 3 ## City County Total.Beds.City ## <chr> <chr> <int> ## 1 BOSTON "SUFFOLK" 4300 ## 2 BOSTON "" 4300 ## 3 BOSTON "SUFFOLK" 4300 ## 4 BOSTON "MIDDLESEX" 4300 ## 5 BOSTON "SUFFOLK" 4300 ## 6 BOSTON "NORFOLK" 4300 ## 7 BOSTON "SUFFOLK" 4300 ## 8 BOSTON "SUFFOLK" 4300 ## 9 BOSTON "SUFFOLK" 4300 ## 10 BOSTON "SUFFOLK" 4300 ## 11 BOSTON "SUFFOLK" 4300 ## 12 BOSTON "SUFFOLK" 4300 ## 13 BOSTON "SUFFORK" 4300 ## 14 BOSTON "SUFFOLK" 4300 ``` ] --- # An alternative approach > What do I lose here? ```r hospitals_ma %>% group_by(City) %>% summarize(Total.Beds.City = sum(Number.of.Beds, na.rm = TRUE)) %>% ungroup() %>% filter(Total.Beds.City == max(Total.Beds.City)) ``` ``` ## # A tibble: 1 × 2 ## City Total.Beds.City ## <chr> <int> ## 1 BOSTON 4300 ```