Cleaning Data

SDS 192: Introduction to Data Science

Professor Lindsay Poirier

For Today

  • Mutating to reformat vectors
  • Working with NA columns
  • Practice cleaning a dataset

Whenever formatting columns, we will use mutate to overwrite a variable with a new cleaned up variable.

library(tidyverse)
prisons <- read_csv("../data/Prison_Boundaries.csv")

Converting Types

  • as.character(), as.numeric(), as.logical() all convert a variable from an original type to a new type
typeof(prisons$COUNTYFIPS)
[1] "character"
prisons <- 
  prisons |> 
  mutate(COUNTYFIPS = as.numeric(COUNTYFIPS))
typeof(prisons$COUNTYFIPS)
[1] "double"

Parsing Dates

  • Dates can be converted to a date format using the lubridate package
    • Step 1: Check how dates are formatted
    • Step 2: Find corresponding conversion code on lubridate cheatsheet

Setting Dates

  • ymd_hms() will take a date formatted as year, month, day, hour, minute, second and convert it to a date time format
prisons |> 
  select(NAME, SOURCEDATE) |> 
  head(3)
library(lubridate)

prisons <- 
  prisons |> 
  mutate(SOURCEDATE = ymd_hms(SOURCEDATE))

Setting NA values

  • na_if() will take a variable and set specified values to NA
prisons |> 
  select(NAME, POPULATION) |> 
  head(3)
sum(is.na(prisons$POPULATION))
[1] 0
prisons <- 
  prisons |> 
  mutate(POPULATION = na_if(POPULATION, -999))
prisons |> 
  select(NAME, POPULATION) |> 
  head(3)
sum(is.na(prisons$POPULATION))
[1] 4047

Replacing Strings

  • str_replace() will take a variable and replace an existing string with a new string
prisons |> 
  select(NAME, ADDRESS) |> 
  head(3)
prisons <- 
  prisons |> 
  mutate(ADDRESS = str_replace(ADDRESS, 
                               "DR", 
                               "DRIVE"))

Removing Strings

  • str_replace() will take a variable and replace an existing string with a new string
prisons |> select(NAME, WEBSITE) |> head(3)
prisons <- 
  prisons |> 
  mutate(WEBSITE = str_replace(WEBSITE, 
                               "http://", 
                               ""))

Conditionals

  • case_when() allows us to set values when conditions are met
prisons |> 
  select(SECURELVL) |> 
  distinct()
prisons <- 
  prisons |> 
  mutate(JUVENILE = 
           case_when(
             SECURELVL == "JUVENILE" ~ "Juvenile",
             TRUE ~ "Not Juvenile")) 

Identifying Prisons in ECHO