11. Pivoting

Author

Patrick Spauster

Video Tutorial

“Wide” and “Long” Data

A dataset can be written as wide and long. A wide format contains values that do not repeat in the first column, whereas long format contains values that do repeat in the first column. Imagine keeping statistics for a basket ball game. You could do it two ways:

Long data is tidy - Every column is a variable Every row is an observation. Every cell is a single value.

We can use pivoting to convert data between wide and long formats. Here’s a visual for what pivoting does.

Pivoting Longer

For this demo, we want to make a timeline of rent stabilized units in NYC.

So, let’s start by loading in some data from https://github.com/talos/nyc-stabilization-unit-counts.

The dataset we want is here.

You can save this file in your project folder and read it in how we’re used to. Or we can read it directly into our environment, a cool feature of read_csv() that lets you pull directly from a weblink.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

rent_stab_raw <- read_csv("https://taxbillsnyc.s3.amazonaws.com/joined.csv")

Rows: 46461 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): borough, 2007est, 2008est, 2009est, 2009dhcr, 2009abat, 2010est, 2...
dbl (25): ucbbl, 2007uc, 2008uc, 2009uc, 2010uc, 2011uc, 2012uc, 2013uc, 201...
lgl  (9): 2007dhcr, 2007abat, 2008dhcr, 2008abat, 2010dhcr, 2014dhcr, 2015dh...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This project scraped data from PDFs of property tax documents to get estimates for rent stabilized units counts in buildings across NYC. You can read up on the various field names at the Github project page:

https://github.com/talos/nyc-stabilization-unit-counts#user-content-data-usage.

For this demo, we only want to look at rent stabilized unit counts, which according to the Github doccumentation corresponds to column names that end in “uc”. Let’s also grab BBL (which is a unique identifier for NYC buildings) and Borough while we’re at it:

rent_stab <- rent_stab_raw %>% select(borough, ucbbl, ends_with("uc"))

# starts_with(…) and ends_with(…) are neat selector functions to help you 
# grab names that fit a certain pattern

Annoyingly, the data separates unit counts for different years into different columns… to make a timeline, we need all of the yearly data to be stored in one column.

We can use the pivot_longer function included in tidyverse to transform our data accordingly. Here is how we apply the `pivot_longer` function to our data:

rs_long <- rent_stab %>% 
  pivot_longer(
    ends_with("uc"),  # The multiple column names we want to mush into one column
    names_to = "year", # The title for the new column of names we're generating
    values_to = "units" # The title for the new column of values we're generating
  )

Now we have data that is “tidy” there is one row for each year for each building. So each observation is a bbl-year pair.

Pivoting Wider

Additionally, you may have data that is in this “long” format and wish to transform it into the “wide” format we are used to. Luckily, there is an analogous function called `pivot_wider` that does just that:

rs_wide <- rs_long %>%
  pivot_wider(
  names_from = year, # The current column containing our future column names
  values_from = units # The current column containing the values for our future columns
  )