12. ggplot

Author

Patrick Spauster

Video Tutorial

Preparing Data for ggplot

Data for ggplot needs to be in “long” format, meaning that all the values need to be in the same column and all the variables need to be in another column.

Let’s use that rent stabilization data to create a plot. In this case the data is wide by year, meaning there is a column for each year with different values. We want a dataset where each year has its own row - where each row is a year-building combination - that we can plot.

library(tidyverse)
library(janitor)

rent_stab_long <- read_csv("https://taxbillsnyc.s3.amazonaws.com/joined.csv") %>% 
  select(borough, ucbbl, ends_with("uc")) %>% 
  pivot_longer(
    ends_with("uc"),  # The multiple column names we want to mush into one column
    names_to = "year", # The title for the new column of names we're generating
    values_to = "units" # The title for the new column of values we're generating
  )

ggplot syntax

ggplots use similar syntax to regular R operations - they are groups of functions beginning with ggplot(). Instead of the pipe, ggplot uses + between different functions to build layers on top of each other.

Typically, ggplot will start with ggplot(dataframe) + a geometric function like geom_col() and an aesthetics or aes argument that indicates which variables to plot.

After that, ggplot has a ton of options to specify the labels, scales, axes, themes, legends,and more. It’s best shown through examples

Sample Line Chart

Line charts are often a good fit for time-series data. Let’s summarize the long data by year and borough and count the number of units.

rs_long_manhattan_summary <- rent_stab_long %>% 
  filter(borough %in% c("MN","BK") # Filter only Manhattan and Brooklyn values
          & !is.na(units)) %>% # Filter out null unit count values
  mutate(year = as.numeric(gsub("uc","", year))) %>% # Remove "uc" from year values
  select(year, borough, units) %>% 
  # Grouping by 2 columns means each row will have a unique pair of the two columns' values.
  # Our rows will look like: 2007 MN, 2007 BK, 2008 MN... 
  group_by(year, borough) %>% 
  summarise(total_units = sum(units))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Now let’s plot it. We start with the data, then add aesthetics, our chart type (line), axis specifications, and labels.

rs_over_time_graph <- ggplot(rs_long_manhattan_summary) +
    # Note these arguments inside 'geom_line' :
  geom_line(aes(x=year, y=total_units, color=borough)) +
    # Restyle the Y-axis labels: 
  scale_y_continuous(
    limits = c(0,300000),
    labels = scales::unit_format(scale = 1/1000, unit="K")) +
  scale_x_continuous(breaks = seq(2007, 2017, by = 1))+
    # Restyle the Legend: 
  scale_fill_discrete(
    name="Borough",
    breaks=c("BK", "MN"),
    labels=c("Brooklyn", "Manhattan")) +
  labs(
    title = "Total Rent Stabilized Units over Time",
    subtitle = "Manhattan and Brooklyn, 2007 to 2017",
    x = "Year",
    y = "Total Rent Stabilized Units",
    caption = "Source: taxbills.nyc"
  )+
  theme_minimal()

rs_over_time_graph

Sample Bar Chart + Faceting

Let’s make some bar charts out of the table of # of children by the languages they speak at home by borough.

We’ll learn how we pulled this data in an upcoming lesson, but for now just download this csv. Or run read_csv("https://raw.githubusercontent.com/pspauster/learning_R/master/langs_by_boro.csv")

langs_by_boro_for_graphing <- read_csv("langs_by_boro.csv") %>%
  mutate(
    labels = # Add a new column with neat category labels
      case_when(
        variable == 'englishkids' ~ 'English',
        variable == 'spanishkids' ~ 'Spanish',
        variable == 'indoeurkids' ~ 'Indo-European',
        variable == 'apikids' ~ 'Asian & Pacific Islander',
        variable == 'otherkids' ~ 'Other Languages'
      ),
    labels = fct_reorder(labels, estimate)
  ) %>% # Overwrite the variable column of langs_by_boro with a function that tells R to order the variable column by the estimate column when it gets plotted (like a sort). If I wanted it ordered in the other direction I would put estimate inside of desc().
  filter(variable != 'totalkids')

Produce a horizontal bar chart

# Note - this builds progressively; you can run all the code before any + sign as a step toward the final result.

barchart <- ggplot(langs_by_boro_for_graphing) +
  aes(x = estimate, y = labels) +
  geom_col() +
  scale_x_continuous(labels = scales::comma) + # format count labels with commas and thousands
  theme_minimal() +
  theme(panel.grid.major.y = element_blank()) +
  labs(
    title = 'What languages do NYC kids speak at home?',
    subtitle = 'Number of children ages 5-17 by the language spoken at home',
    x = NULL,
    y = NULL,
    caption = "Source: Census American Community Survey 2020 5-Year Estimates, Table B16007"
  )

barchart

Now let’s facet it by borough

barchart +
  facet_wrap( ~ name, ncol = 1)

Additional Resources

This guide from the Urban Institute has a number of helpful tutorials for how to create a wide number of graphs with ggplot.