class: center, middle, inverse, title-slide # Day Twenty-Two: Pivoting Data ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- # Reminders * Quiz 3 review first 15 minutes of office hours on Wednesday * Quiz 4 will be posted Wednesday --- # For Today * Tidy data * Pivot longer * Recoding values * Pivot wider * Renaming columns --- # How would you produce this plot? ![](Day22-Pivoting_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- # Observations vs. Variables vs. Values .pull-left[ * Observations refer to individual units or cases of the data being collected. * Variables describe something about an observation. * Values refer to the actual value associated with a variable for a given observation. ] .pull-right[ ![](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png) Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/. ] --- # What is tidy data? * Every row is an observation. * Every column is a variable. * Every cell contains a single value. --- # Is this tidy? > What variables are displayed on this plot? ``` ## Date Nrthmptn_AQI NYC_AQI Bstn_AQI ## 1 03/19/2022 70 72 43 ## 2 03/18/2022 69 60 59 ``` --- # What will it look like when tidy? ```r df %>% pivot_longer(-Date, names_to = "City", values_to = "AQI") %>% mutate(City = str_replace(City, "_AQI", "")) ``` ``` ## # A tibble: 6 × 3 ## Date City AQI ## <chr> <chr> <dbl> ## 1 03/19/2022 Nrthmptn 70 ## 2 03/19/2022 NYC 72 ## 3 03/19/2022 Bstn 43 ## 4 03/18/2022 Nrthmptn 69 ## 5 03/18/2022 NYC 60 ## 6 03/18/2022 Bstn 59 ``` --- # Pivoting Longer * We use `pivot_longer()` to pivot a datasets from wider to longer format: * `pivot_longer()` takes the following arguments: 1. `cols = `: Identify a series of columns to pivot * The names of those columns will become repeated rows in the pivoted data frame, and the values in those columns will be stored in a new column. 2. `names_to = `: Identify a name for the column where the column names will be store 3. `values_to = `: Identify a name for the column were the values associated with those names will be stored 4. Various arguments to support transformations to names --- # Example ``` ## Date Nrthmptn_AQI NYC_AQI Bstn_AQI ## 1 03/19/2022 70 72 43 ## 2 03/18/2022 69 60 59 ``` .pull-left[ ```r df %>% pivot_longer(cols = -Date, names_to = "City", values_to = "AQI") %>% mutate(City = str_replace(City, "_AQI", "")) ``` ] .pull-right[ ``` ## # A tibble: 6 × 3 ## Date City AQI ## <chr> <chr> <dbl> ## 1 03/19/2022 Nrthmptn 70 ## 2 03/19/2022 NYC 72 ## 3 03/19/2022 Bstn 43 ## 4 03/18/2022 Nrthmptn 69 ## 5 03/18/2022 NYC 60 ## 6 03/18/2022 Bstn 59 ``` ] --- # Recoding Values * `case_when()`: allows us to assign new values to a cell when a certain condition is met * Vectorized 'if/else' statement * Argument format: CONDITION ~ REPLACEMENT VALUE WHEN CONDITION MET .pull-left[ ```r df %>% pivot_longer(cols = -Date, names_to = "City", values_to = "AQI") %>% mutate(City = str_replace(City, "_AQI", "")) %>% mutate(City = case_when( City == "Nrthmptn" ~ "Northampton", City == "NYC" ~ "New York City", City == "Bstn" ~ "Boston" )) %>% select(Date, City) ``` ] .pull-right[ ``` ## # A tibble: 6 × 2 ## Date City ## <chr> <chr> ## 1 03/19/2022 Northampton ## 2 03/19/2022 New York City ## 3 03/19/2022 Boston ## 4 03/18/2022 Northampton ## 5 03/18/2022 New York City ## 6 03/18/2022 Boston ``` ] --- # Pivoting Wider > Note: I use this far less often than `pivot_longer()` * We use `pivot_wider()` to pivot a datasets from longer to wider format: * `pivot_wider()` takes the following arguments: 1. `names_from = `: Identify the column to get the new column names from 2. `values_from = `: Identify the column to get the cell values from 3. Various arguments to support transformations to names --- # Example .pull-left[ ``` ## # A tibble: 6 × 3 ## Date City AQI ## <chr> <chr> <dbl> ## 1 03/19/2022 Nrthmptn 70 ## 2 03/19/2022 NYC 72 ## 3 03/19/2022 Bstn 43 ## 4 03/18/2022 Nrthmptn 69 ## 5 03/18/2022 NYC 60 ## 6 03/18/2022 Bstn 59 ``` ] .pull-right[ ```r df %>% pivot_wider(names_from = "Date", values_from = "AQI", names_repair = make.names) ``` ] ``` ## # A tibble: 3 × 3 ## City X03.19.2022 X03.18.2022 ## <chr> <dbl> <dbl> ## 1 Nrthmptn 70 69 ## 2 NYC 72 60 ## 3 Bstn 43 59 ``` --- # Renaming Columns > Use `rename(NEW_COLUMN_NAME = EXISTING_COLUMN_NAME)` ```r df %>% pivot_wider(names_from = "Date", values_from = "AQI", names_repair = make.names) %>% rename(Mar19 = X03.19.2022, Mar18 = X03.18.2022) ``` ``` ## # A tibble: 3 × 3 ## City Mar19 Mar18 ## <chr> <dbl> <dbl> ## 1 Nrthmptn 70 69 ## 2 NYC 72 60 ## 3 Bstn 43 59 ``` --- # CT School Attendance Example ```r library(tidyverse) ct_school_attendance <- read_csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") head(ct_school_attendance) ``` ``` ## # A tibble: 6 × 12 ## reportingdistrictcode reportingdistric… category studentgroup studentcount_cu… ## <chr> <chr> <chr> <chr> <dbl> ## 1 00000CT Connecticut <NA> All Students 494920 ## 2 00000CT Connecticut Homeles… Students Ex… 1801 ## 3 00000CT Connecticut Student… Students Wi… 78084 ## 4 00000CT Connecticut Free/Re… Free Meal E… 167481 ## 5 00000CT Connecticut Free/Re… Reduced Pri… 29573 ## 6 00000CT Connecticut Free/Re… Free/Reduce… 197054 ## # … with 7 more variables: attrate_ytd <dbl>, studentcount_202021 <dbl>, ## # attrate_202021 <dbl>, studentcount_201920 <dbl>, attrate_201920 <dbl>, ## # reportingperiod <dttm>, date_update <dttm> ``` --- # CT School Attendance Example ```r library(tidyverse) ct_school_attendance <- read_csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") %>% filter(studentgroup == "All Students" & reportingdistrictname != "Connecticut") head(ct_school_attendance) ``` ``` ## # A tibble: 6 × 12 ## reportingdistrictcode reportingdistric… category studentgroup studentcount_cu… ## <chr> <chr> <chr> <chr> <dbl> ## 1 0010011 Andover School D… <NA> All Students 162 ## 2 0020011 Ansonia School D… <NA> All Students 2156 ## 3 0030011 Ashford School D… <NA> All Students 344 ## 4 0040011 Avon School Dist… <NA> All Students 3056 ## 5 0050011 Barkhamsted Scho… <NA> All Students 199 ## 6 0070011 Berlin School Di… <NA> All Students 2605 ## # … with 7 more variables: attrate_ytd <dbl>, studentcount_202021 <dbl>, ## # attrate_202021 <dbl>, studentcount_201920 <dbl>, attrate_201920 <dbl>, ## # reportingperiod <dttm>, date_update <dttm> ``` --- ```r ct_school_attendance %>% select(reportingdistrictname, starts_with("attrate_")) %>% pivot_longer(cols = starts_with("attrate_"), names_to = "school_year", values_to = "attendance_rates", names_prefix = "attrate_") %>% head() ``` ``` ## # A tibble: 6 × 3 ## reportingdistrictname school_year attendance_rates ## <chr> <chr> <dbl> ## 1 Andover School District ytd 0.945 ## 2 Andover School District 202021 0.968 ## 3 Andover School District 201920 0.950 ## 4 Ansonia School District ytd 0.897 ## 5 Ansonia School District 202021 0.892 ## 6 Ansonia School District 201920 0.941 ``` --- ```r ct_school_attendance_pivoted <- ct_school_attendance %>% select(reportingdistrictname, starts_with("attrate_")) %>% pivot_longer(cols = starts_with("attrate_"), names_to = "school_year", values_to = "attendance_rates", names_prefix = "attrate_") head(ct_school_attendance_pivoted) ``` ``` ## # A tibble: 6 × 3 ## reportingdistrictname school_year attendance_rates ## <chr> <chr> <dbl> ## 1 Andover School District ytd 0.945 ## 2 Andover School District 202021 0.968 ## 3 Andover School District 201920 0.950 ## 4 Ansonia School District ytd 0.897 ## 5 Ansonia School District 202021 0.892 ## 6 Ansonia School District 201920 0.941 ``` --- ```r ct_school_attendance_pivoted <- ct_school_attendance %>% select(reportingdistrictname, starts_with("attrate_")) %>% pivot_longer(cols = starts_with("attrate_"), names_to = "school_year", values_to = "attendance_rates", names_prefix = "attrate_") %>% mutate(school_year = case_when( school_year == "ytd" ~ "Year to Date", school_year == "202021" ~ "2020-2021", school_year == "201920" ~ "2019-2020")) ``` --- ```r ct_school_attendance_pivoted %>% ggplot(aes(x = attendance_rates)) + geom_histogram(binwidth = 0.05, color = "white") + facet_wrap(vars(school_year)) + labs(x = "Attendance Rates", y = "Count of Schools in CT") ``` ![](Day22-Pivoting_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- ```r library(tidyverse) ct_school_attendance <- read_csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") %>% filter(reportingdistrictname != "Connecticut") %>% select(reportingdistrictname, studentgroup, attrate_202021) head(ct_school_attendance) ``` ``` ## # A tibble: 6 × 3 ## reportingdistrictname studentgroup attrate_202021 ## <chr> <chr> <dbl> ## 1 Andover School District All Students 0.968 ## 2 Andover School District Students With Disabilities NA ## 3 Andover School District Free/Reduced Price Meal Eligible 0.941 ## 4 Andover School District White 0.970 ## 5 Andover School District Students Without High Needs 0.975 ## 6 Andover School District Students With High Needs 0.951 ``` --- ```r ct_school_attendance %>% pivot_wider(names_from = studentgroup, values_from = attrate_202021, names_repair = make.names) %>% head() ``` ``` ## # A tibble: 6 × 14 ## reportingdistrictname All.Students Students.With.D… Free.Reduced.Pr… White ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Andover School District 0.968 NA 0.941 0.970 ## 2 Ansonia School District 0.892 0.869 0.874 0.922 ## 3 Ashford School District 0.955 0.957 0.931 0.960 ## 4 Avon School District 0.962 0.931 0.920 0.962 ## 5 Barkhamsted School Distr… 0.970 0.958 0.963 0.969 ## 6 Berlin School District 0.954 0.925 0.925 0.96 ## # … with 9 more variables: Students.Without.High.Needs <dbl>, ## # Students.With.High.Needs <dbl>, Free.Meal.Eligible <dbl>, ## # Reduced.Price.Meal.Eligible <dbl>, English.Learners <dbl>, ## # All.other.races <dbl>, Black.or.African.American <dbl>, ## # Hispanic.Latino.of.any.race <dbl>, Students.Experiencing.Homelessness <dbl> ```