class: center, middle, inverse, title-slide # Day Twenty-Three: Joining Data ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- # Reminders * Mini-project 2 due next Wednesday. * Quiz 4 will be posted today and due at 5PM next Wednesday. * Be sure to start early; there are no extensions for quizzes. --- # For Today * Review of Monday's lab * Joining datasets --- # Load the following two data frames into RStudio. > What do you notice about them? Why might they be useful to consider together? ```r library(tidyverse) ct_school_attendance <- read.csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") %>% filter(studentgroup == "All Students" & reportingdistrictname != "Connecticut") ct_school_learning_model <- read.csv("https://data.ct.gov/resource/5q7h-u2ac.csv?$where=update_date=%272020-09-17%27") ``` --- # Joining Datasets * Up until this point, we have been working with data in single tables. * Sometimes related data can be spread across multiple tables that we wish to bring together for different kinds of analysis. --- # Example: North American Bird Breeding Survey .pull-left[ * Annual count of bird populations at select locations along transportation routes in North America. Data tables include: * Bird counts along routes during certain observation periods * Information about each route * Weather conditions at certain observation periods ] .pull-right[ ![](https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/styles/side_image/public/thumbnails/image/EVGRTrendMap_Lutmerding.jpg?itok=AarLqN51) ] --- # Example: EPA Facility Registry Service .pull-left[ * Records data about every facility regulated by different EPA programs (e.g. Clean Air Act, Safe Drinking Water Act, etc). Data tables include: * Location and contact information about each facility * Results Safe Drinking Water Act inspections, including violations and fines for the facility ] .pull-right[ ![](https://cdn.pixabay.com/photo/2016/07/06/17/25/chemical-1500882_1280.jpg) ] --- # Keys * To join data tables, we need a column shared across the tables that we can join on. * Often this is some form of a unique key. .pull-left[ ``` ## reportingdistrictcode ## 1 0010011 ## 2 0020011 ## 3 0030011 ## 4 0040011 ## 5 0050011 ## 6 0070011 ``` ] .pull-right[ ``` ## district_code ## 1 2190012 ## 2 80011 ## 3 10011 ## 4 20011 ## 5 30011 ## 6 40011 ``` ] > What might be an issue with joining on this key? --- # Keys * Keys must be formatted the same way and of the same type. .pull-left[ ```r ct_school_attendance <- ct_school_attendance %>% mutate(reportingdistrictcode = as.integer(reportingdistrictcode)) ct_school_attendance %>% select(reportingdistrictcode) %>% head() ``` ``` ## reportingdistrictcode ## 1 10011 ## 2 20011 ## 3 30011 ## 4 40011 ## 5 50011 ## 6 70011 ``` ] .pull-right[ ```r ct_school_learning_model %>% select(district_code) %>% head() ``` ``` ## district_code ## 1 2190012 ## 2 80011 ## 3 10011 ## 4 20011 ## 5 30011 ## 6 40011 ``` ] --- # Join ```r ct_school_attendance %>% inner_join(ct_school_learning_model, by = c("reportingdistrictcode" = "district_code")) %>% rename(fully_remote = percent_students_fully_remote) %>% #to fit on slide select(reportingdistrictcode, attrate_202021, fully_remote) %>% head() ``` ``` ## reportingdistrictcode attrate_202021 fully_remote ## 1 10011 0.9680 13% ## 2 20011 0.8923 32% ## 3 30011 0.9550 30% ## 4 40011 0.9621 19% ## 5 50011 0.9698 12% ## 6 70011 0.9542 13% ``` --- # Different Kinds of Joins * `inner_join()` * `left_join()` * `right_join()` * `full_join()` * `anti_join()` --- # Inner Join ![](https://d33wubrfki0l68.cloudfront.net/3abea0b730526c3f053a3838953c35a0ccbe8980/7f29b/diagrams/join-inner.png) --- # Outer Joins <img src="https://d33wubrfki0l68.cloudfront.net/9c12ca9e12ed26a7c5d2aa08e36d2ac4fb593f1e/79980/diagrams/join-outer.png" width="350" /> --- # One-to-many relationships ![](https://d33wubrfki0l68.cloudfront.net/6faac3e996263827cb57fc5803df6192541a9a4b/c7d74/diagrams/join-one-to-many.png) --- # Many-to-many relationships ![](https://d33wubrfki0l68.cloudfront.net/d37530bbf7749f48c02684013ae72b2996b07e25/37510/diagrams/join-many-to-many.png) --- # Anti-Join * Filters to entries in the left table that are not in the right table .pull-left[ ```r ct_school_attendance %>% anti_join(ct_school_learning_model, by = c("reportingdistrictcode" = "district_code")) %>% select(reportingdistrictcode) ``` ``` ## reportingdistrictcode ## 1 2310018 ## 2 3360015 ## 3 3470015 ``` ] .pull-right[ ```r ct_school_learning_model %>% anti_join(ct_school_attendance, by = c("district_code" = "reportingdistrictcode")) %>% select(district_code) %>% head() ``` ``` ## district_code ## 1 46821 ## 2 46921 ## 3 100161 ## 4 170561 ## 5 170821 ## 6 190161 ``` ]