class: center, middle, inverse, title-slide # Day Sixteen: Subsetting Data ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- # For today * Piping Functions * `select()` * `filter()` * `arrange()` --- # Today's data * Hospital Cost Reports, 2018 ```r library(tidyverse) hospitals <- read.csv("https://raw.githubusercontent.com/SDS-192-Intro/SDS-192-public-website/main/slides/datasets/Hospital_Cost_Report_2018.csv") ``` --- class: center, middle # Data wrangling is a process for transforming a dataset from its original form into a more relevant form. --- # What is that funny symbol I keep seeing in labs? .pull-left[ * Up until now, we have been nesting arguments in functions. * Let's say I had the vector: `class1$shirt_colors` * `length(unique( class1$shirt_colors))` ] .pull-right[ * The pipe (`%>%`) is imported when we call `library(tidyverse)` * The pipe offers a way to string arguments together consecutively. * Reads more like a sentence. * `class1$shirt_colors %>% unique() %>% length()` ] --- # Six "verbs" for data wrangling * `select()` * `filter()` * `arrange()` * `summarize()` * `group_by()` * `mutate()` --- # `select()` * `select()` enables us to select variables (columns) of interest. ![](http://ohi-science.org/data-science-training/img/rstudio-cheatsheet-select.png) --- # `select()` * `select()` enables us to select variables (columns) of interest. ```r hospitals %>% select(Provider.CCN, State.Code, Number.of.Beds) %>% head() ``` ``` ## Provider.CCN State.Code Number.of.Beds ## 1 10032 AL 34 ## 2 250042 MS 181 ## 3 440235 TN 25 ## 4 50523 CA 145 ## 5 50305 CA 337 ## 6 50043 CA 296 ``` --- # `select()` * `select()` enables us to select variables (columns) of interest. * `:` enables us to select consecutive variables (columns) of interest. ```r hospitals %>% select(Provider.CCN, State.Code:County) %>% head() ``` ``` ## Provider.CCN State.Code Zip.Code County ## 1 10032 AL 36278 RANDOLPH ## 2 250042 MS 38614 COAHOMA ## 3 440235 TN 37841 SCOTT ## 4 50523 CA 94509- CONTRA COSTA ## 5 50305 CA 94705- ALAMEDA ## 6 50043 CA 94609- ALAMEDA ``` --- # `select()` * `select()` enables us to select variables (columns) of interest. * `:` enables us to select consecutive variables (columns) of interest. ```r hospitals %>% select(Provider.CCN, State.Code:County) %>% head() ``` ``` ## Provider.CCN State.Code Zip.Code County ## 1 10032 AL 36278 RANDOLPH ## 2 250042 MS 38614 COAHOMA ## 3 440235 TN 37841 SCOTT ## 4 50523 CA 94509- CONTRA COSTA ## 5 50305 CA 94705- ALAMEDA ## 6 50043 CA 94609- ALAMEDA ``` --- # `select()` * `select()` enables us to select variables (columns) of interest. * `!` enables us to negate selection of certain variables (columns). * `last_col()` serves as a reference to the last column. ```r hospitals %>% select(!(Hospital.Name:last_col())) %>% head() ``` ``` ## rpt_rec_num Provider.CCN ## 1 623132 10032 ## 2 628158 250042 ## 3 628833 440235 ## 4 631016 50523 ## 5 631094 50305 ## 6 631292 50043 ``` --- # `select()` * `select()` enables us to select variables (columns) of interest. * `starts_with()` is a helper function that enables us to select columns by matching string patterns at the start of their names ```r hospitals %>% select(Provider.CCN, starts_with("Fiscal")) %>% head() ``` ``` ## Provider.CCN Fiscal.Year.Begin.Date Fiscal.Year.End.Date ## 1 10032 2017-09-30 2017-11-12 ## 2 250042 2017-10-31 2017-12-30 ## 3 440235 2017-10-10 2017-12-30 ## 4 50523 2017-12-31 2018-02-27 ## 5 50305 2017-12-31 2018-02-27 ## 6 50043 2017-12-31 2018-02-27 ``` --- # `select()` * `select()` enables us to select variables (columns) of interest. * `ends_with()` is a helper function that enables us to select columns by matching string patterns at the end of their names ```r hospitals %>% select(Provider.CCN, ends_with("Improvements")) %>% head() ``` ``` ## Provider.CCN Land.Improvements Leasehold.Improvements ## 1 10032 ## 2 250042 ## 3 440235 ## 4 50523 $9,762,451 $3,077 ## 5 50305 $6,040,540 $4,268,998 ## 6 50043 $6,040,540 $6,352,542 ``` --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. ![](http://ohi-science.org/data-science-training/img/rstudio-cheatsheet-filter.png) --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * == checks for equivalence and != checks for inequivalence ```r hospitals %>% filter(City == "Springfield") %>% select(Provider.CCN, City, State.Code) ``` ``` ## [1] Provider.CCN City State.Code ## <0 rows> (or 0-length row.names) ``` > Review the data. Why does this return 0 rows? --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * == checks for equivalence and != checks for inequivalence ```r hospitals %>% filter(City == "SPRINGFIELD") %>% select(Provider.CCN, City, State.Code) ``` ``` ## Provider.CCN City State.Code ## 1 223303 SPRINGFIELD MA ## 2 111306 SPRINGFIELD GA ## 3 144021 SPRINGFIELD IL ## 4 364040 SPRINGFIELD OH ## 5 264024 SPRINGFIELD MO ## 6 263032 SPRINGFIELD MO ## 7 241352 SPRINGFIELD MN ## 8 220077 SPRINGFIELD MA ## 9 140148 SPRINGFIELD IL ## 10 260040 SPRINGFIELD MO ## 11 260065 SPRINGFIELD MO ## 12 262017 SPRINGFIELD MO ## 13 61311 SPRINGFIELD CO ## 14 380020 SPRINGFIELD OR ## 15 220066 SPRINGFIELD MA ## 16 380102 SPRINGFIELD OR ## 17 360086 SPRINGFIELD OH ## 18 140053 SPRINGFIELD IL ## 19 222046 SPRINGFIELD MA ## 20 471306 SPRINGFIELD VT ## 21 440065 SPRINGFIELD TN ## 22 142014 SPRINGFIELD IL ## 23 360355 SPRINGFIELD OH ``` > What do we need to do now? --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * == checks for equivalence and `&` returns rows that meet **both** criteria ```r hospitals %>% filter(City == "SPRINGFIELD" & State.Code == "Massachusetts") %>% select(Provider.CCN, City, State.Code) ``` ``` ## [1] Provider.CCN City State.Code ## <0 rows> (or 0-length row.names) ``` > Review the data. Why does this return 0 rows? --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * == checks for equivalence and `|` returns rows that meet **either** criteria ```r hospitals %>% filter(Rural.Versus.Urban == "R" | CCN.Facility.Type == "CAH") %>% select(Hospital.Name, City, State.Code) %>% head() ``` ``` ## Hospital.Name City State.Code ## 1 WEDOWEE HOSPITAL WEDOWEE AL ## 2 NORTHWEST MS MEDICAL CENTER CLARKSDALE MS ## 3 BIG SOUTH FORK MEDICAL CENTER ONEIDA TN ## 4 ETMC JACKSONVILLE JACKSONVILLE TX ## 5 TROY REGIONAL MEDICAL CENTER TROY AL ## 6 MONROE REGIONAL HOSPITAL ABERDEEN MS ``` > CAH refers to a "critical access hospital" --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * \>, <, >=, <= checks whether values are greater/less/equal to than other values ```r hospitals %>% filter(Number.of.Beds > 1500) %>% select(Provider.CCN, City, State.Code, Number.of.Beds) ``` ``` ## Provider.CCN City State.Code Number.of.Beds ## 1 450388 SAN ANTONIO TX 1560 ## 2 100007 ORLANDO FL 2753 ## 3 330101 NEW YORK NY 2272 ``` --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * `%in%` checks for values that match at least one value in a vector of values ```r hospitals %>% filter(CCN.Facility.Type %in% c("STH", "LTCH")) %>% select(Hospital.Name, City, State.Code) %>% head() ``` ``` ## Hospital.Name City State.Code ## 1 WEDOWEE HOSPITAL WEDOWEE AL ## 2 NORTHWEST MS MEDICAL CENTER CLARKSDALE MS ## 3 BIG SOUTH FORK MEDICAL CENTER ONEIDA TN ## 4 SUTTER DELTA MEDICAL CENTER ANTIOCH CA ## 5 ALTA BATES MEDICAL CENTER BERKELEY CA ## 6 SUMMIT MEDICAL CENTER OAKLAND CA ``` --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * `is.na()` checks for `NA` values and `!is.na()` checks for non-`NA` values ```r hospitals %>% filter(is.na(Number.of.Beds)) %>% select(Hospital.Name, City, State.Code) %>% nrow() ``` ``` ## [1] 83 ``` --- # `filter()` * `filter()` subsets observations (rows) according to a certain criteria that we provide. * `str_detect()` checks for values with strings that match a pattern we provide ```r hospitals %>% filter(str_detect(Hospital.Name, "SHRINERS")) %>% select(Hospital.Name, City, State.Code) %>% head() ``` ``` ## Hospital.Name City State.Code ## 1 SHRINERS HOSPITAL FOR CHILDREN ST LOUIS MO ## 2 SHRINERS HOSPITALS CINCINNATI OH ## 3 SHRINERS HOSPITAL FOR CHILDREN CHICAGO IL ## 4 SHRINERS HOSPITAL FOR CHILDREN MINNEAPOLIS MN ## 5 THE SHRINERS HOSPITAL FOR CHILDREN SPRINGFIELD MA ## 6 THE SHRINERS HOSPITAL FOR CHILDREN BOSTON MA ``` --- `arrange()` * `arrange()` sorts rows according to values in a column * Defaults to sorting from smallest to largest (numeric) or first character to last character (character). ```r hospitals %>% filter(City == "SPRINGFIELD" & State.Code == "MA") %>% arrange(Number.of.Beds) %>% select(Hospital.Name, Number.of.Beds) ``` ``` ## Hospital.Name Number.of.Beds ## 1 VIBRA HOSPITAL OF WESTERN MASSACHUSE 174 ## 2 THE MERCY HOSPITAL 237 ## 3 BAYSTATE MEDICAL CENTER 696 ## 4 THE SHRINERS HOSPITAL FOR CHILDREN NA ``` --- `arrange()` * `arrange()` sorts rows according to values in a column * `desc()` sorts from largest to smallest (numeric) or last character to first character (character). ```r hospitals %>% filter(City == "SPRINGFIELD" & State.Code == "MA") %>% arrange(desc(Number.of.Beds)) %>% select(Hospital.Name, Number.of.Beds) ``` ``` ## Hospital.Name Number.of.Beds ## 1 BAYSTATE MEDICAL CENTER 696 ## 2 THE MERCY HOSPITAL 237 ## 3 VIBRA HOSPITAL OF WESTERN MASSACHUSE 174 ## 4 THE SHRINERS HOSPITAL FOR CHILDREN NA ``` --- class: center, middle # Quiz Review --- # What is the y label? ```r hospitals %>% ggplot(aes(x = CCN.Facility.Type)) + geom_bar() + coord_flip() + labs(x = "Facility Type", y = "Number of ______") ``` ![](Day16-Subsetting_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- # What is the y label? ```r hospitals %>% ggplot(aes(x = Number.of.Beds)) + geom_histogram(binwidth = 100) + labs(x = "", y = "Number of ______") ``` ![](Day16-Subsetting_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- class: center, middle # Other than `CCN.Facility.Type` what other variables would be appropriate to apply to the x aesthetic when calling `geom_bar()`? --- # Which of visual cues are represented on this plot? Overplotting? Palette? ```r ggplot(hospitals, aes(x = CCN.Facility.Type, Number.of.Beds, col = Rural.Versus.Urban)) + geom_point() + coord_flip() ``` ![](Day16-Subsetting_files/figure-html/unnamed-chunk-21-1.png)<!-- -->