class: center, middle, inverse, title-slide # Day Twenty-Five: Acquiring Data from APIs ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- # For today 1. Code styling 2. API queries --- # Code styling guidelines * Styling makes code much easier to read * Why should we care? * [Tidyverse Style Guide](https://style.tidyverse.org/index.html) --- # File and object names * File names should be meaningful * File names shouldn't have spaces * File names should only include `_` and/or `-` as characters --- # Variable Names * Variable names should be meaningful (e.g. not `data2`) * Variable names should be lowercase * Variable names should separate words with `_` (i.e. snake case) * Avoid using names of common functions (e.g. `sum` is not a great variable name) --- # Spacing * Add spaces after commas but not before (e.g. `(3, 2)`) * No spacing around parentheses for function calls (e.g. not this: `paste ( "a", "b" ) `) * Add spaces around operators like `+`, `<-`, and `==` (e.g `2 + 3` and `var == 3`) * ...but no space when using `$` to identify columns * Add spaces before and after `%>%` --- # New Lines Place each piped function call on a new indented line: ```r df %>% group_by(var) %>% summarize(count = n()) ``` If function arguments don't fit on one line, then put each on its own line: ```r df %>% summarize( count = n(), count_missing = sum(is.na(var)), percent_missing = sum(is.na(var)) / n() ) ``` --- # Commenting * The mythology of "self-documenting" code * Context is critical to any coding practice! * Comments should indicate: * Why a developer has opted to take a certain approach, * how they came to know to take that approach, * assumptions that they made in the process, * shortcomings of the approach, and * steps that might be taken to improve the code. * Outline your thinking/working through process to other developers and your future self **Any code copied/modified from another source should be attributed to the individuals that wrote it via in-line comments. A link should be provided to the original code.** --- # API * Stands for Application Programming Interface * Allows programmers or other systems (or users) to communicate with an online data service * Clients (other programmers) expose part of the data service they've used to construct their databases * This is called an *endpoint* * Clients also publish documentation about how to communicate with the endpoint * Users build URLs or HTTP services to request computer-readable data from the endpoint --- # HTTP and GET Requests * Hypertext Transfer Protocol (HTTP) is what enables communication between servers hosting web pages and browsers * GET requests enable us to access a resource from a server (only receives data; doesn't change it on the server) * Entering [https://smith.edu](https://smith.edu/) into a browser isssues a GET request to access the home page of the Smith website --- # API Calls .pull-left[ * Sends an HTTP request URI for a certain resource to a server * URI includes parameters about what data we wish receive and in what format (e.g. all colleges in MA in the format CSV) * Servers send that information back via HTTP via response ] .pull-right[ ![](https://www.seobility.net/en/wiki/images/thumb/f/f1/Rest-API.png/900px-Rest-API.png) > Figure: REST API - Author: Seobility - License: [CC BY-SA 4.0](https://www.seobility.net/en/wiki/Creative_Commons_License_BY-SA_4.0)] --- # API Keys * Many services require you to request and reference an API key before accessing data from their API * Allows systems to track abuse of the service and sometimes limit requests * Usually free * API key gets included in call --- # Motivating Example: NYC 311 Service Requests ![](img/nyc311.png) --- # Constructing a Query Base URL is the API Endpoint:[https://data.cityofnewyork.us/resource/erm2-nwe9.csv](https://data.cityofnewyork.us/resource/erm2-nwe9.csv) ![](img/endpoint.png) --- # Filtering [https://data.cityofnewyork.us/resource/erm2-nwe9.csv](https://data.cityofnewyork.us/resource/erm2-nwe9.csv) * Filters appended after a `?` * Multiple filters combined with `&` * `$limit=` limits the number of rows downloaded to a certain number [https://data.cityofnewyork.us/resource/erm2-nwe9.csv?unique_key=10693408](https://data.cityofnewyork.us/resource/erm2-nwe9.csv?unique_key=10693408) [https://data.cityofnewyork.us/resource/erm2-nwe9.json?complaint_type=Obstruction&$limit=100](https://data.cityofnewyork.us/resource/erm2-nwe9.json?complaint_type=Obstruction&$limit=100) --- # API Documentation * Indicates how to sign up for an API key * Indicates possible output formats (e.g. CSV, JSON, XML, etc.) * Lists field names and descriptions * Provides example API calls * Outlines error messages and solutions > [https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9](https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9) --- # Spaces and Special Characters Internet protocols don't know how to interpret spaces or other special characters (i.e. non-ASCII), so we replace those characters with special codes that they do recognize: * space ` `: %20 * `!`: %21 * `"`: %22 * `%`: %25 * `'`: %27 * `-`: %2D There are many resources online for identifying these. --- # Example Request with Special Characters [https://data.cityofnewyork.us/resource/erm2-nwe9.csv?complaint_type=Noise%20%2D%20Commercial&$limit=200](https://data.cityofnewyork.us/resource/erm2-nwe9.csv?complaint_type=Noise%20%2D%20Commercial&$limit=200) ![](img/url.png) --- # Reading API Output into R * When API data can be output as a CSV, the URL can be provided directly into `read_csv()` ```r library(readr) nyc_recent_noise <- read_csv("https://data.cityofnewyork.us/resource/erm2-nwe9.csv?complaint_type=Noise%20%2D%20Commercial&$limit=200") head(nyc_recent_noise) ``` ``` ## # A tibble: 6 × 41 ## unique_key created_date closed_date agency agency_name ## <dbl> <dttm> <dttm> <chr> <chr> ## 1 29141294 2014-10-25 00:39:25 2014-10-25 04:48:29 NYPD New York City Polic… ## 2 29148351 2014-10-26 00:39:56 2014-10-26 15:31:41 NYPD New York City Polic… ## 3 45750796 2020-03-04 00:17:57 2020-03-04 01:36:10 NYPD New York City Polic… ## 4 45751284 2020-03-04 20:53:05 2020-03-05 01:33:57 NYPD New York City Polic… ## 5 29238705 2014-11-08 01:34:10 2014-11-08 23:50:53 NYPD New York City Polic… ## 6 45828733 2020-03-08 01:29:49 2020-03-08 03:06:49 NYPD New York City Polic… ## # … with 36 more variables: complaint_type <chr>, descriptor <chr>, ## # location_type <chr>, incident_zip <dbl>, incident_address <chr>, ## # street_name <chr>, cross_street_1 <chr>, cross_street_2 <chr>, ## # intersection_street_1 <chr>, intersection_street_2 <chr>, ## # address_type <chr>, city <chr>, landmark <chr>, facility_type <chr>, ## # status <chr>, due_date <dttm>, resolution_description <chr>, ## # resolution_action_updated_date <dttm>, community_board <chr>, bbl <dbl>, … ```