library(tidycensus)
library(tidyverse)
library(janitor)
16. Census Data with tidycensus
Video Tutorial
Intro to TidyCensus
Tidycensus is incredible powerful and gives you access to a ton of census data. Once you get it down you’ll be able to use it to quickly grab a bunch of data.
https://walker-data.com/census-r/an-introduction-to-tidycensus.html
Request your own API key here: https://api.census.gov/data/key_signup.html
Install the API key:
census_api_key("8524147f6edf7fe4b7c85681397fe5acd6993d62")
To install your API key for use in future sessions, run this function with `install = TRUE`.
You can use this key for practicing this demo, but please request your own for your projects and future use.
When you get your own key install it, so you won’t have to call this function again with census_api_key("key", install = TRUE)
Browse available variables:
Use load variables
to browse and find the variables of interest
<- load_variables(2020, "acs5", cache = TRUE) v20
AGE BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER
B16007_001 Estimate!!Total:
B16007_002 Estimate!!Total:!!5 to 17 years:
B16007_003 Estimate!!Total:!!5 to 17 years:!!Speak only English
B16007_004 Estimate!!Total:!!5 to 17 years:!!Speak Spanish
B16007_005 Estimate!!Total:!!5 to 17 years:!!Speak other Indo-European languages
B16007_006 Estimate!!Total:!!5 to 17 years:!!Speak Asian and Pacific Island languages
B16007_007 Estimate!!Total:!!5 to 17 years:!!Speak other languages
Pull data
I use the basic usage of tidycensus webpage to find the right argument names to use.
<- get_acs(
langs_by_puma geography = "public use microdata area",
variables = c( totalkids = "B16007_002",
englishkids = "B16007_003",
spanishkids = "B16007_004",
indoeurkids = "B16007_005",
apikids = "B16007_006",
otherkids = "B16007_007" ),
state = "New York",
year = 2020, survey = "acs5" ) %>%
filter(str_detect(NAME, "NYC")) %>%
mutate(moeshare = moe / estimate)
Getting data from the 2016-2020 5-year ACS
<- get_acs(
langs_by_boro geography = "county",
variables = c(
totalkids = "B16007_002",
englishkids = "B16007_003",
spanishkids = "B16007_004",
indoeurkids = "B16007_005",
apikids = "B16007_006",
otherkids = "B16007_007"
),state = "New York",
year = 2020,
survey = "acs5"
%>%
) filter(
== "Kings County, New York" |
NAME == "Queens County, New York" |
NAME == "New York County, New York" |
NAME == "Bronx County, New York" |
NAME == "Richmond County, New York"
NAME %>%
) clean_names()
Getting data from the 2016-2020 5-year ACS
Write langs_by_boro as a CSV file into the same folder
write_csv(langs_by_boro, 'langs_by_boro.csv')
#this output might look familiar from our ggplot lesson!
Tidycensus also has a “wide” option to make calculating percentages, for example, easier.
get_acs(
geography = "county",
variables = c(
totalkids = "B16007_002",
englishkids = "B16007_003",
spanishkids = "B16007_004",
indoeurkids = "B16007_005",
apikids = "B16007_006",
otherkids = "B16007_007"
),state = "New York",
year = 2020,
survey = "acs5",
output = "wide" #look here!
%>%
) filter(
== "Kings County, New York" |
NAME == "Queens County, New York" |
NAME == "New York County, New York" |
NAME == "Bronx County, New York" |
NAME == "Richmond County, New York"
NAME %>%
) clean_names() %>%
mutate(
pct_englishkids = englishkids_e / totalkids_e, #_u is for the estimate and _m is the margin or error
pct_spanishkids = spanishkids_e / totalkids_e,
pct_indoeurkids = indoeurkids_e / totalkids_e,
pct_aapikids = apikids_e / totalkids_e,
pct_otherkids = otherkids_e / totalkids_e
%>%
) select(name, starts_with("pct"))
Getting data from the 2016-2020 5-year ACS
# A tibble: 5 × 6
name pct_englishkids pct_spanishkids pct_indoeurkids pct_aapikids
<chr> <dbl> <dbl> <dbl> <dbl>
1 New York County,… 0.577 0.265 0.0734 0.0545
2 Richmond County,… 0.722 0.108 0.0837 0.0487
3 Bronx County, Ne… 0.466 0.437 0.0391 0.00797
4 Kings County, Ne… 0.555 0.147 0.191 0.0686
5 Queens County, N… 0.495 0.251 0.129 0.105
# ℹ 1 more variable: pct_otherkids <dbl>
Crosswalks and matching census geographies
Let’s say I wanted to compare a number of statistics by Brooklyn neighborhood. But oh no! The census doesn’t provide data at the neighborhood level. Fear not, you can use something called a crosswalk to match data from different geographies. Census data makes it easy - they have standard GEOID
columns that allow you to match data at different geographies.
Census data is powerful because you can join it with data on whatever you are interested in at almost any geographic level.
First I’ll grab some statistics at the tract level in Brooklyn.
<- get_acs(
pop_data geography = "tract",
variables = c(total_population = "B01003_001",
median_household_income = "B19019_001",
race_eth_denom = "B03002_001",
white_nonhsp = "B03002_003"
),year = 2021,
state = "New York",
county = "Kings"
)
Getting data from the 2017-2021 5-year ACS
Then I’ll read in this crosswalk between neighborhoods and tracts from the NYC Open data portal. By joining the two on GEOID I can summarize at the neighborhood level
<- read_csv("https://data.cityofnewyork.us/resource/hm78-6dwm.csv",
nta_tract col_types = cols(geoid = col_character())) %>%
filter(boroname == "Brooklyn")
<- nta_tract %>%
pop_data_joined full_join(pop_data, by = c("geoid" = "GEOID"))
%>%
pop_data_joined filter(variable == "total_population") %>%
group_by(ntaname) %>%
summarize(total_population = sum(estimate))
# A tibble: 53 × 2
ntaname total_population
<chr> <dbl>
1 Bath Beach 33765
2 Bay Ridge 85796
3 Bedford-Stuyvesant (East) 92298
4 Bedford-Stuyvesant (West) 89224
5 Bensonhurst 102241
6 Borough Park 84493
7 Brighton Beach 31295
8 Brooklyn Heights 24775
9 Brooklyn Navy Yard 0
10 Bushwick (East) 62987
# ℹ 43 more rows
What if I just wanted to compare one neighborhood to the rest of Brooklyn? Using the pull()
function which rips a list from a dataframe variable I can mutate my way to that summary.
<- nta_tract %>%
gowanus_tracts mutate(geoid = as.character(geoid)) %>%
filter(ntaname == "Carroll Gardens-Cobble Hill-Gowanus-Red Hook") %>%
pull(geoid)
<- pop_data %>%
pop_data_clean mutate(gowanus = if_else(GEOID %in% gowanus_tracts, "Gowanus", "Rest of BK"))
%>%
pop_data_clean filter(variable == "total_population") %>%
group_by(gowanus) %>%
summarize(total = sum(estimate))
# A tibble: 2 × 2
gowanus total
<chr> <dbl>
1 Gowanus 63167
2 Rest of BK 2649193