# A tibble: 5 × 3
player assists points
<chr> <dbl> <dbl>
1 A 6 12
2 B 7 19
3 C 14 7
4 D 4 6
5 E 5 10
But sometimes getting data stored on the web is more complicated, or you are accessing massive databases that would take forever to read the whole thing into R.
When downloading data from NYC Open Data and other sources, you can use a special set of commands to filter and manipulate the data before downloading it to your local computer. These commands, unlike ones you’d write in an R script, are actually written in the URL itself. This kind of command is called an “API Call”— API stands for “Application programming interface” and is just a special system and language for users to communicate with the back end data servers. API Calls to APIs use a special syntax, which will be included in an API’s documentation. For NYC’s Open data, that syntax is very similar to a popular language called SQL or Structured Query Language.
Some APIs will require registration of a “token” which will be an input in your URL. We’ll go over those with an example using the census API.
Writing API Calls to NYC Open Data
Using API Calls is particularly useful when you’re dealing with a **huge dataset** that would otherwise be a hassle to download in full. For this example, we’re going to use the HPD’s Housing Maintinance Code Violations dataset on NYC Open Data, which has ~7 million rows.
Grabbing the “API Endpoint”
Let’s first grab the beginning part of the URL that we’re going to use to write our API Call. This is called the “API Endpoint” and can be found by clicking the “API” tab on the NYC Open Data page for the dataset you’re working with.
Before copying the URL, make sure you set the data format toggle from “JSON” to “CSV”, as that is the format we’re going to want our data in.
Copy the API Endpoint into a text editor - or an R script! - (I prefer Sublime Text or Visual Studio Code— others like Word or Pages have a tendency to “auto-correct” certain letters and syntax which may mess you up).
Now, to initiate our query, we are going to add ?$query= to the end of our URL:
At the end our URL, we can now add some special code to filter the violations data for our download.
To do this, we’re going to want to first take a look at the API Documentation for our dataset of choice, which can be found by clicking on the “API” tab again on the dataset’s Open Data page and clicking the “API Docs” button. Specifically, this documentation gives us a run down of all of the columns in the data and how we can reference them by name in our API call.
For this example, we want to look at the most serious (class C) HPD Violations within the past month. So, we’re going to write out our query as such:
SELECT * – this selects all columns of the data
WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000' AND class='C' – this filters only rows where the inspectiondate value is between June 1st and Aug 31st, and the class of the violation is 'C'. The AND operator here allows us to include multiple filtering conditions at once, and could even include conditions on other columns. Note the special format that the dates come in… we were able to spot this by looking at the Documentation.
LIMIT 100000 — this sets the maximum number of downloadable rows to 100,000. It’s good practice to set a limit here so we don’t accidentially try downloading millions of rows at once. Note: if you don’t specify, the default limit is just 1,000 rows!
You can find more information on the types of queries you can write on the Socrata Developers Portal (Socrata is the special “flavor” of API that NYC Open Data uses).
Running our API Call
We add the above pieces in that order to our URL:
https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000' AND class='C' LIMIT 100000
Now, you can copy this full url into your browser and press ENTER— your special download should begin!
We can also use string operators to create and quickly modify the different components of our API call.
base_url <-"https://data.cityofnewyork.us/resource/wvxf-dwi5.csv"inspectiondate_range <-c("2021-06-01T00:00:01.000","2021-09-01T00:00:01.000")class <-"C"limit <-c("100000")full_api_call <-paste0(base_url, "?$query=SELECT * WHERE inspectiondate>='",inspectiondate_range[1],"' AND inspectiondate<'",inspectiondate_range[2],"' AND class='",class,"' LIMIT ",limit)
Importing your data directly into R Studio
Once you have your data downloaded via API Call, you can feel free to import it into your R project like any other CSV. If you want to use the URL you created to import it directly, you can do that as well:
library(tidyverse)library(fs)# R doesn't like weird characters like spaces and carats, so we need the `URLencode` function here to encode those symbols properlyurl_hpd_viol <-URLencode("https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000' AND class='C' LIMIT 100000")# Now, we can use our formatted url inside our `read_csv` functionsummer_hpd_viols <-read_csv(url_hpd_viol)
Note: Always check the size of your output
Sometimes, the limit on your API Call may make your data export smaller than your desired outcome, and you won’t necessarily be notified. Therefore, it is always very important to check the number of rows of your data from your API Call before proceeding with analysis— if the number of rows matches the exact number of your limit (or is 1000, the default limit), it’s very likely that your data got cut off and you don’t have the complete set of data that you wanted.
The below example illustrates this problem and shows how to diagnose. For the example, imagine that we didn’t include a LIMIT clause in our API Call query:
https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000'
url_viol_no_limit =URLencode("https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$query=SELECT * WHERE inspectiondate>='2021-06-01T00:00:01.000' AND inspectiondate<'2021-09-01T00:00:01.000'")summer_violations_cut_off <-read_csv(url_viol_no_limit)# Using the head() function won't actually reveal the cut-off problem:head(summer_violations_cut_off)
By looking at the head of your dataset, things appear to be fine. However, let’s use the nrow() function to get a sense of how many rows we have:
nrow(summer_violations_cut_off)
[1] 1000
Given that our data output is 1,000 rows, which is exactly the default limit for API Calls to NYC Open Data, it’s very likely that our data got cut off and there are more rows within our filtering conditions that we want.
Our next step would be to increase our LIMIT in our API CAll until get a number of outputs rows below the limit value. In our first example, you can see we’ve done just that— our LIMIT was set to 100,000 rows and we only received around 20K or so rows. Safe to say we got all of the rows that fit our filtering criteria…
Other APIs
Thousands of sites have APIs that work similarly. Each API will have specific documentation and formatting. Once you read up on the documentation for one, the others will be easier to decipher. Here are some examples
A number of developers have also created R packages that make certain APIs easier to use. The example we will look at next, tidycensus, is an example of a wrapper for the US Census API. It will construct API calls based on parameters we set in a function. Here are some other API wrappers.
RSocrata (Many Open Data Portals, including NYC’s use Socrata’s data system)