class: center, middle, inverse, title-slide # Day Nineteen: Importing Data ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- # Reminders * Lab solutions posted on Perusall * Extra credit due Wednesday * Mini-Project due Wednesday * Quiz 2 review first 15 minutes of office hours on Wednesday * Quiz 3 due Friday --- # For Today * File paths * CSVs and other files types * Reading files locally and from the Web * Data cleaning strategies --- class:small # What is a working directory? ```r getwd() ``` ``` ## [1] "/Users/lpoirier_1/Documents/GitHub/R_Projects/SDS-192-public-website/slides" ``` ```r normalizePath(".") ``` ``` ## [1] "/Users/lpoirier_1/Documents/GitHub/R_Projects/SDS-192-public-website/slides" ``` ```r normalizePath("..") ``` ``` ## [1] "/Users/lpoirier_1/Documents/GitHub/R_Projects/SDS-192-public-website" ``` ```r normalizePath("~") ``` ``` ## [1] "/Users/lpoirier_1" ``` --- # `R` Projects * Associated with an `R` working directory * Have their own workspace and history * Can be associated with version control repository (e.g. Git) --- # What is a CSV? .pull-left[ * stands for comma-separated values * separates rows by a return/enter * separates columns by a comma * often contains a header row * open format (not tied to proprietary software) ] .pull-right[ dog_name, date, score Madison, 1/2/1990, 5 Skip, 3/4/1990, 4 Cooper, 4/5/1990, 6 ] --- # Read Lines of Data from a CSV ```r library(readr) pioneer_valley_census <- read_lines("datasets/pioneer_valley_census.csv") head(pioneer_valley_census) ``` ``` ## [1] "LEVEL_CD_NAME,STATE,COUNTY,COMMUNITY,YEAR,TIME_TYPE,CEN_POP,CEN_WHITE,CEN_HISPANIC,CEN_BLACK,CEN_ASIAN,CEN_MARRHOU,CEN_SINGPARHOU,CEN_HOUSEHOLDS,CEN_POP_U18,CEN_CHILD_POV,CEN_PAINC,CEN_RETINC,CEN_SOCSECINC,CEN_MEDHHINC,CEN_POOR,CEN_POVRATE,CEN_EARLYED,EDUCATT_COLLEGE,EDUCATT_HS,CEN_WORKERS,CEN_PUBLICTRANS" ## [2] "Municipality,MA,Franklin,Ashfield,2018,5-Year-Estimates,1539.00,1481.00,35.00,2.00,6.00,358.00,35.00,675.00,203.00,0.10,3.00,105.00,234.00,70956.00,130.00,0.10,0.70,0.50,1.00,944.00,39.00" ## [3] "Municipality,MA,Franklin,Bernardston,2018,5-Year-Estimates,2051.00,2028.00,0.00,3.00,10.00,450.00,73.00,879.00,418.00,0.10,21.00,222.00,335.00,64940.00,174.00,0.10,0.40,0.20,1.00,1087.00,14.00" ## [4] "Municipality,MA,Franklin,Buckland,2018,5-Year-Estimates,1908.00,1760.00,5.00,0.00,75.00,372.00,82.00,858.00,310.00,0.20,0.00,210.00,325.00,53716.00,194.00,0.10,1.00,0.40,0.90,1133.00,98.00" ## [5] "Municipality,MA,Franklin,Charlemont,2018,5-Year-Estimates,1130.00,1071.00,16.00,0.00,14.00,217.00,46.00,467.00,194.00,0.20,22.00,112.00,206.00,57159.00,155.00,0.10,1.00,0.30,0.90,528.00,35.00" ## [6] "Municipality,MA,Franklin,Colrain,2018,5-Year-Estimates,1800.00,1744.00,10.00,13.00,5.00,373.00,49.00,738.00,329.00,0.30,18.00,158.00,305.00,56149.00,257.00,0.10,0.50,0.30,0.90,990.00,44.00" ``` --- # Importing Data from a CSV ```r library(dplyr) library(readr) hospital_costs_2018 <- read_csv("datasets/Hospital_Cost_Report_2018.csv") hospital_costs_2018 %>% select(c(1:2)) %>% head() ``` ``` ## # A tibble: 6 × 2 ## rpt_rec_num `Provider CCN` ## <dbl> <dbl> ## 1 623132 10032 ## 2 628158 250042 ## 3 628833 440235 ## 4 631016 50523 ## 5 631094 50305 ## 6 631292 50043 ``` --- # Relative Paths * Absolute paths list all folders from your `root` folder to the location of the file * Avoid this! With different file systems, this may work on your computer, but not when you share with classmates or me * Relative paths direct to the desired file from your *current location* * Move up a directory: `../` * Move up two directories: `../` * Move up a directory and into `datasets` folders: `../datasets/` --- # Reading Data in from Excel > What do we need to note about the following dataset? ![](img/excel.png) --- # Reading Data in from Excel ```r library(readr) grad_rates <- readxl::read_excel("datasets/gradrates.xlsx", skip = 1) grad_rates %>% select(1) %>% head() ``` ``` ## # A tibble: 6 × 1 ## `District Name` ## <chr> ## 1 Abby Kelley Foster Charter Public (District) ## 2 Abington ## 3 Academy Of the Pacific Rim Charter Public (District) ## 4 Acton-Boxborough ## 5 Advanced Math and Science Academy Charter (District) ## 6 Agawam ``` --- # Reading Data in from URL ```r library(readr) hospital_costs_2018 <- read_csv("https://raw.githubusercontent.com/SDS-192-Intro/SDS-192-public-website/main/slides/datasets/Hospital_Cost_Report_2018.csv") hospital_costs_2018 %>% select(c(1:2)) %>% head() ``` ``` ## # A tibble: 6 × 2 ## rpt_rec_num `Provider CCN` ## <dbl> <dbl> ## 1 623132 10032 ## 2 628158 250042 ## 3 628833 440235 ## 4 631016 50523 ## 5 631094 50305 ## 6 631292 50043 ``` --- # Importing Tips * Data should be rectangular! * No merged columns/rows * Values in every cell * Avoid Excel and other proprietary formats * Avoid spaces in column names if possible --- # Speaking of which... ```r library(readr) hospital_costs_2018 <- read_csv("datasets/Hospital_Cost_Report_2018.csv", name_repair = make.names) hospital_costs_2018 %>% select(c(1:3)) %>% head() ``` ``` ## # A tibble: 6 × 3 ## rpt_rec_num Provider.CCN Hospital.Name ## <dbl> <dbl> <chr> ## 1 623132 10032 WEDOWEE HOSPITAL ## 2 628158 250042 NORTHWEST MS MEDICAL CENTER ## 3 628833 440235 BIG SOUTH FORK MEDICAL CENTER ## 4 631016 50523 SUTTER DELTA MEDICAL CENTER ## 5 631094 50305 ALTA BATES MEDICAL CENTER ## 6 631292 50043 SUMMIT MEDICAL CENTER ``` --- # Repairing Names * If spaces in column name, must refer to in backticks: \`Column name with spaces\` * With `name_repair = make.names`, `R` will convert each punctuation mark and space to a period --- # Cleaning up data types > What do you notice about the following output? ```r library(readr) hospital_costs_2018 <- read_csv("datasets/Hospital_Cost_Report_2018.csv", name_repair = make.names) hospital_costs_2018 %>% select(Total.Salaries..adjusted.:Contract.Labor) %>% head() ``` ``` ## # A tibble: 6 × 2 ## Total.Salaries..adjusted. Contract.Labor ## <chr> <chr> ## 1 $526,707 <NA> ## 2 $3,013,251 $60,272 ## 3 $1,009,269 <NA> ## 4 $14,212,723 $350,340 ## 5 $37,400,149 $173,424 ## 6 $38,470,054 $78,054 ``` --- # `mutate()` for data cleaning * We can use the `mutate()` data wrangling function to fix some issues with columns in our data frames * Remember that `mutate()` creates a new column based on criteria we provide * We can use `mutate()` to overwrite columsn with cleaned up data --- # Cleaning up data types ```r library(stringr) # Package for string manipulation hospital_costs_2018 <- hospital_costs_2018 %>% mutate(Total.Salaries..adjusted. = str_replace_all(Total.Salaries..adjusted., "[$,]", "")) %>% mutate(Total.Salaries..adjusted. = as.numeric(Total.Salaries..adjusted.)) hospital_costs_2018 %>% select(Total.Salaries..adjusted.:Contract.Labor) %>% head() ``` ``` ## # A tibble: 6 × 2 ## Total.Salaries..adjusted. Contract.Labor ## <dbl> <chr> ## 1 526707 <NA> ## 2 3013251 $60,272 ## 3 1009269 <NA> ## 4 14212723 $350,340 ## 5 37400149 $173,424 ## 6 38470054 $78,054 ``` --- # Do I really have to do this for every column with these issues in the dataset?! * No. * Reference for mutating multiple columns [here](https://dplyr.tidyverse.org/reference/mutate_all.html) --- # Parsing Dates * Dates can be converted to a date format using the `lubridate` package * Step 1: Check how dates are formatted * Step 2: Find corresponding conversion code on `lubridate` cheatsheet .pull-left[ ![](img/dates.png) ] .pull-right[ ![](img/lubridate.png) ] --- # What format is this date? ```r prison_boundaries <- read_csv("https://raw.githubusercontent.com/SDS-237-Data-Ethnography-Spring-22/group-project/main/checkpoints/datasets/Prison_Boundaries.csv") prison_boundaries %>% select(SOURCEDATE, VAL_DATE) %>% head() ``` ``` ## # A tibble: 6 × 2 ## SOURCEDATE VAL_DATE ## <chr> <chr> ## 1 2018/04/30 00:00:00 2020/02/27 00:00:00 ## 2 2018/04/30 00:00:00 2020/02/11 00:00:00 ## 3 2018/03/01 00:00:00 2020/03/04 00:00:00 ## 4 2018/03/23 00:00:00 2020/01/02 00:00:00 ## 5 2018/03/23 00:00:00 2020/01/02 00:00:00 ## 6 2018/04/12 00:00:00 2020/02/20 00:00:00 ``` --- # Parsing Dates ```r library(lubridate) # Package for working with dates prison_boundaries <- prison_boundaries %>% mutate(SOURCEDATE = ymd_hms(SOURCEDATE), VAL_DATE = ymd_hms(VAL_DATE)) prison_boundaries %>% select(SOURCEDATE,VAL_DATE) %>% head() ``` ``` ## # A tibble: 6 × 2 ## SOURCEDATE VAL_DATE ## <dttm> <dttm> ## 1 2018-04-30 00:00:00 2020-02-27 00:00:00 ## 2 2018-04-30 00:00:00 2020-02-11 00:00:00 ## 3 2018-03-01 00:00:00 2020-03-04 00:00:00 ## 4 2018-03-23 00:00:00 2020-01-02 00:00:00 ## 5 2018-03-23 00:00:00 2020-01-02 00:00:00 ## 6 2018-04-12 00:00:00 2020-02-20 00:00:00 ``` --- # Converting to `NA` ```r prison_boundaries %>% select(POPULATION, CAPACITY) %>% head() ``` ``` ## # A tibble: 6 × 2 ## POPULATION CAPACITY ## <dbl> <dbl> ## 1 438 498 ## 2 -999 24 ## 3 96 144 ## 4 27 36 ## 5 2 5 ## 6 79 70 ``` --- # Converting to `NA` ```r prison_boundaries <- prison_boundaries %>% mutate(POPULATION = na_if(POPULATION, -999), CAPACITY = na_if(CAPACITY, -999)) prison_boundaries %>% select(POPULATION, CAPACITY) %>% head() ``` ``` ## # A tibble: 6 × 2 ## POPULATION CAPACITY ## <dbl> <dbl> ## 1 438 498 ## 2 NA 24 ## 3 96 144 ## 4 27 36 ## 5 2 5 ## 6 79 70 ```