A dataset can be written as wide and long. A wide format contains values that do not repeat in the first column, whereas long format contains values that do repeat in the first column. Imagine keeping statistics for a basket ball game. You could do it two ways:
Long data is tidy - Every column is a variable Every row is an observation. Every cell is a single value.
We can use pivoting to convert data between wide and long formats. Here’s a visual for what pivoting does.
Pivoting Longer
For this demo, we want to make a timeline of rent stabilized units in NYC.
So, let’s start by loading in some data from https://github.com/talos/nyc-stabilization-unit-counts.
You can save this file in your project folder and read it in how we’re used to. Or we can read it directly into our environment, a cool feature of read_csv() that lets you pull directly from a weblink.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 46461 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): borough, 2007est, 2008est, 2009est, 2009dhcr, 2009abat, 2010est, 2...
dbl (25): ucbbl, 2007uc, 2008uc, 2009uc, 2010uc, 2011uc, 2012uc, 2013uc, 201...
lgl (9): 2007dhcr, 2007abat, 2008dhcr, 2008abat, 2010dhcr, 2014dhcr, 2015dh...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This project scraped data from PDFs of property tax documents to get estimates for rent stabilized units counts in buildings across NYC. You can read up on the various field names at the Github project page:
For this demo, we only want to look at rent stabilized unit counts, which according to the Github doccumentation corresponds to column names that end in “uc”. Let’s also grab BBL (which is a unique identifier for NYC buildings) and Borough while we’re at it:
rent_stab <- rent_stab_raw %>%select(borough, ucbbl, ends_with("uc"))# starts_with(…) and ends_with(…) are neat selector functions to help you # grab names that fit a certain pattern
Annoyingly, the data separates unit counts for different years into different columns… to make a timeline, we need all of the yearly data to be stored in one column.
We can use the pivot_longer function included in tidyverse to transform our data accordingly. Here is how we apply the `pivot_longer` function to our data:
rs_long <- rent_stab %>%pivot_longer(ends_with("uc"), # The multiple column names we want to mush into one columnnames_to ="year", # The title for the new column of names we're generatingvalues_to ="units"# The title for the new column of values we're generating )
Now we have data that is “tidy” there is one row for each year for each building. So each observation is a bbl-year pair.
Pivoting Wider
Additionally, you may have data that is in this “long” format and wish to transform it into the “wide” format we are used to. Luckily, there is an analogous function called `pivot_wider` that does just that:
rs_wide <- rs_long %>%pivot_wider(names_from = year, # The current column containing our future column namesvalues_from = units # The current column containing the values for our future columns )