library(tidyverse)
library(janitor)
<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") fhv
5. Data cleaning and the pipe
Video Tutorial
Cleaning Data
Let’s read in data like we did last time. We’re going to “clean” it, which just means making it easier to use and getting it into tidy format.
When we read this in, we have some unfriendly names of variables with spaces in them. To access those variables we have to use `backticks`
which are clunky. The janitor package has helpful data cleaning functions. Install it and take a look at the clean_names()
function.
For many functions, the first argument is always the name of a dataframe. In this case we want to clean the names of our fhv
dataframe.
<- clean_names(fhv) fhv_clean
Now our names are clean - they are all lowercase, and have replaced all spaces with underscores. This will make it easier to refer to our column names as we transform data going forward.
But instead of assigning a new dataframe each time we want to apply a function, we should apply more than one function at once.
The Pipe
Let’s say we also wanted to change the name of a variable. Using the pipe %>%
we can apply multiple functions to the same dataframe. Use the shortcut shift+command+m
on mac or shift+ctrl+m on windows
Let’s try rename()
a function to change the names of columns. I don’t know what the “veh” column means so I’m going to look it up in the data dictionary on the open data page.
It’s an indicator for whether the vehicle is hybrid, so i’m going to rename it “hybrid”. Use the documentation for rename()
to figure out the right syntax.
To use the pipe, start with the name of the data frame you want to edit, and then chain the pipes after each function using some indenting to organize your code.
<- fhv %>%
fhv_clean clean_names() %>%
rename(hybrid = veh)
#clean_names() is empty because the first argument is just the name of the dataframe, which has been piped in for us
Now we have a dataset with clean names and a renamed column “hybrid”
As we learn more and more functions, we’ll have longer chains of pipes to clean and construct datasets.
fhv_clean
# A tibble: 98,318 × 23
active vehicle_license_number name license_type expiration_date
<chr> <chr> <chr> <chr> <chr>
1 YES 5608977 AMERICAN,UNITED,T… FOR HIRE VE… 04/30/2025
2 YES 5645622 RAMA,ILIR FOR HIRE VE… 09/11/2023
3 YES 5192507 ORDONEZ,ELIAS FOR HIRE VE… 03/08/2025
4 YES 5378856 RIVERA,ENMA FOR HIRE VE… 11/12/2024
5 YES 5852121 A/VA,SERVICE,CORP FOR HIRE VE… 04/11/2024
6 YES 5415237 REYES,JUAN,E FOR HIRE VE… 10/31/2023
7 YES 5643301 BEGUM,TAZMINUR FOR HIRE VE… 09/30/2025
8 YES 5701439 GONZALEZALVARADO,L FOR HIRE VE… 06/13/2024
9 YES 5790931 GOMEZ,JOSE,A FOR HIRE VE… 05/23/2025
10 YES 5743759 HOSSAIN,SM,KAMAL FOR HIRE VE… 12/08/2024
# ℹ 98,308 more rows
# ℹ 18 more variables: permit_license_number <chr>,
# dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
# wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
# vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
# hybrid <chr>, base_telephone_number <chr>, website <chr>,
# base_address <chr>, reason <chr>, order_date <lgl>, …