5. Data cleaning and the pipe

Author

Patrick Spauster

Video Tutorial

Cleaning Data

Let’s read in data like we did last time. We’re going to “clean” it, which just means making it easier to use and getting it into tidy format.

library(tidyverse)
library(janitor)
fhv <- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv")

When we read this in, we have some unfriendly names of variables with spaces in them. To access those variables we have to use `backticks` which are clunky. The janitor package has helpful data cleaning functions. Install it and take a look at the clean_names() function.

For many functions, the first argument is always the name of a dataframe. In this case we want to clean the names of our fhv dataframe.

fhv_clean <- clean_names(fhv)

Now our names are clean - they are all lowercase, and have replaced all spaces with underscores. This will make it easier to refer to our column names as we transform data going forward.

But instead of assigning a new dataframe each time we want to apply a function, we should apply more than one function at once.

The Pipe

Let’s say we also wanted to change the name of a variable. Using the pipe %>% we can apply multiple functions to the same dataframe. Use the shortcut shift+command+m on mac or shift+ctrl+m on windows

Let’s try rename() a function to change the names of columns. I don’t know what the “veh” column means so I’m going to look it up in the data dictionary on the open data page.

It’s an indicator for whether the vehicle is hybrid, so i’m going to rename it “hybrid”. Use the documentation for rename() to figure out the right syntax.

To use the pipe, start with the name of the data frame you want to edit, and then chain the pipes after each function using some indenting to organize your code.

fhv_clean <- fhv %>% 
  clean_names() %>% 
  rename(hybrid = veh)

#clean_names() is empty because the first argument is just the name of the dataframe, which has been piped in for us

Now we have a dataset with clean names and a renamed column “hybrid”

As we learn more and more functions, we’ll have longer chains of pipes to clean and construct datasets.

fhv_clean

# A tibble: 98,318 × 23
   active vehicle_license_number name               license_type expiration_date
   <chr>  <chr>                  <chr>              <chr>        <chr>          
 1 YES    5608977                AMERICAN,UNITED,T… FOR HIRE VE… 04/30/2025     
 2 YES    5645622                RAMA,ILIR          FOR HIRE VE… 09/11/2023     
 3 YES    5192507                ORDONEZ,ELIAS      FOR HIRE VE… 03/08/2025     
 4 YES    5378856                RIVERA,ENMA        FOR HIRE VE… 11/12/2024     
 5 YES    5852121                A/VA,SERVICE,CORP  FOR HIRE VE… 04/11/2024     
 6 YES    5415237                REYES,JUAN,E       FOR HIRE VE… 10/31/2023     
 7 YES    5643301                BEGUM,TAZMINUR     FOR HIRE VE… 09/30/2025     
 8 YES    5701439                GONZALEZALVARADO,L FOR HIRE VE… 06/13/2024     
 9 YES    5790931                GOMEZ,JOSE,A       FOR HIRE VE… 05/23/2025     
10 YES    5743759                HOSSAIN,SM,KAMAL   FOR HIRE VE… 12/08/2024     
# ℹ 98,308 more rows
# ℹ 18 more variables: permit_license_number <chr>,
#   dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
#   wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
#   vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
#   hybrid <chr>, base_telephone_number <chr>, website <chr>,
#   base_address <chr>, reason <chr>, order_date <lgl>, …