Joining Tables

SDS 192: Introduction to Data Science

Professor Lindsay Poirier

Create Two Data Tables

  1. Separate the index card cutouts into two data tables - one with three rows and one with four rows.
  2. Each person in your group fill out the first table with:
  1. Your Smith ID
  2. Your first name
  3. Your last name
  1. One person from your group rotate clockwise to the next group
  2. Fill out the second table with:
  1. Your Smith ID
  2. Today’s date
  3. The mood you are in
  1. The person that rotated, fill out a second row in the second table with:
  1. Your Smith ID
  2. Yesterday’s date
  3. The mood you were in yesterday

Joining Datasets

  • Up until this point, we have been working with data in single tables.
  • Sometimes related data can be spread across multiple tables that we wish to bring together for different kinds of analysis.

Keys

  • A column shared across the tables that we can join on.

Different Kinds of Joins

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • anti_join()

Inner Join

Outer Joins

One-to-many relationships

Many-to-many relationships

Civilian Complaint Review Board

The New York City Civilian Complaint Review Board (CCRB) is an independent agency. It is empowered to receive, investigate, mediate, hear, make findings, and recommend action on complaints against New York City police officers alleging the use of excessive or unnecessary force, abuse of authority, discourtesy, or the use of offensive language. The Board’s investigative staff, composed entirely of civilian employees, conducts investigations in an impartial fashion. The Board forwards its findings to the police commissioner. (https://www.nyc.gov/site/ccrb/about/about.page)

Let’s Say I Have Two Data Frames

… and I want to create this plot

Aggregating the Data

sqf_2024 <- sqf_2024 |>
  group_by(STOP_LOCATION_PRECINCT) |>
  summarize(TOTAL_STOPS = n())

ccrd_2024 <- ccrd_2024 |>
  group_by(precinct_of_incident_occurrence) |>
  summarize(TOTAL_REPORTS = n())

sqf_2024 |> head()
ccrd_2024 |> head()

Align Keys

ccrd_2024 <- ccrd_2024 |>
  mutate(precinct_of_incident_occurrence = as.numeric(precinct_of_incident_occurrence))
  
sqf_2024 <- sqf_2024 |>
  mutate(STOP_LOCATION_PRECINCT = as.numeric(STOP_LOCATION_PRECINCT))

Join Datasets

sqf_2024 |>
  left_join(ccrd_2024, by = c("STOP_LOCATION_PRECINCT" = "precinct_of_incident_occurrence"))

What if I were to look at the same data for only December 31, 2024?

Different Join Types

sqf_12_31_2024 |>
  inner_join(ccrd_12_31_2024, by = c("STOP_LOCATION_PRECINCT" = "precinct_of_incident_occurrence"))
sqf_12_31_2024 |>
  left_join(ccrd_12_31_2024, by = c("STOP_LOCATION_PRECINCT" = "precinct_of_incident_occurrence"))
sqf_12_31_2024 |>
  right_join(ccrd_12_31_2024, by = c("STOP_LOCATION_PRECINCT" = "precinct_of_incident_occurrence"))
sqf_12_31_2024 |>
  full_join(ccrd_12_31_2024, by = c("STOP_LOCATION_PRECINCT" = "precinct_of_incident_occurrence"))