library(tidyverse)
library(janitor)
<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>%
fhv_clean clean_names() %>%
rename(hybrid = veh)
6. Select and Filter
Video Tutorial
Select
Let’s read in our data and do some cleaning up of the names with the pipe
We have a lot of information in this data frame. What if we want to look at just a few rows and columns. Two core dplyr
functions, select
and filter
, help us do so. dplyr
is a core part of the tidyverse, and it has functions that modify dataframes (think of the pipe!)
Let’s try just keeping active
, vehicle_lisence_number
, name
, license_type
, vehicle_year
, base_name
, and base_type
. Select’s first argument is the dataframe, and the following arguments are all the names of columns. In R documentation, an ellipses argument ...
means that the function takes a list of arguments. In this case, a list of variables to select
%>%
fhv_clean select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type)
# A tibble: 98,318 × 7
active vehicle_license_number name license_type vehicle_year base_name
<chr> <chr> <chr> <chr> <dbl> <chr>
1 YES 5608977 AMERICAN,U… FOR HIRE VE… 2015 UBER USA…
2 YES 5645622 RAMA,ILIR FOR HIRE VE… 2022 UBER USA…
3 YES 5192507 ORDONEZ,EL… FOR HIRE VE… 2016 UBER USA…
4 YES 5378856 RIVERA,ENMA FOR HIRE VE… 2018 BELL LX …
5 YES 5852121 A/VA,SERVI… FOR HIRE VE… 2019 BAYRIDGE…
6 YES 5415237 REYES,JUAN… FOR HIRE VE… 2012 FIRST CL…
7 YES 5643301 BEGUM,TAZM… FOR HIRE VE… 2015 UBER USA…
8 YES 5701439 GONZALEZAL… FOR HIRE VE… 2016 UBER USA…
9 YES 5790931 GOMEZ,JOSE… FOR HIRE VE… 2017 UBER USA…
10 YES 5743759 HOSSAIN,SM… FOR HIRE VE… 2021 TRI-CITY…
# ℹ 98,308 more rows
# ℹ 1 more variable: base_type <chr>
#this dataframe has all our observations, but only 6 variables (columns)
For more advanced selection, check out the logical operations using the tidy-select
expressions. Check what -
does, for instance.
%>%
fhv_clean select(-active)
# A tibble: 98,318 × 22
vehicle_license_number name license_type expiration_date
<chr> <chr> <chr> <chr>
1 5608977 AMERICAN,UNITED,TRANSPOR… FOR HIRE VE… 04/30/2025
2 5645622 RAMA,ILIR FOR HIRE VE… 09/11/2023
3 5192507 ORDONEZ,ELIAS FOR HIRE VE… 03/08/2025
4 5378856 RIVERA,ENMA FOR HIRE VE… 11/12/2024
5 5852121 A/VA,SERVICE,CORP FOR HIRE VE… 04/11/2024
6 5415237 REYES,JUAN,E FOR HIRE VE… 10/31/2023
7 5643301 BEGUM,TAZMINUR FOR HIRE VE… 09/30/2025
8 5701439 GONZALEZALVARADO,L FOR HIRE VE… 06/13/2024
9 5790931 GOMEZ,JOSE,A FOR HIRE VE… 05/23/2025
10 5743759 HOSSAIN,SM,KAMAL FOR HIRE VE… 12/08/2024
# ℹ 98,308 more rows
# ℹ 18 more variables: permit_license_number <chr>,
# dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
# wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
# vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
# hybrid <chr>, base_telephone_number <chr>, website <chr>,
# base_address <chr>, reason <chr>, order_date <lgl>, …
Filter
Filter does the same thing as select, but for rows that meet certain logical conditions. Let’s get all the uber vehicles. The first argument of filter is the dataframe. The second is a logical expression.
%>%
fhv_clean filter(base_name == "UBER USA, LLC")
# A tibble: 76,710 × 23
active vehicle_license_number name license_type expiration_date
<chr> <chr> <chr> <chr> <chr>
1 YES 5608977 AMERICAN,UNITED,T… FOR HIRE VE… 04/30/2025
2 YES 5645622 RAMA,ILIR FOR HIRE VE… 09/11/2023
3 YES 5192507 ORDONEZ,ELIAS FOR HIRE VE… 03/08/2025
4 YES 5643301 BEGUM,TAZMINUR FOR HIRE VE… 09/30/2025
5 YES 5701439 GONZALEZALVARADO,L FOR HIRE VE… 06/13/2024
6 YES 5790931 GOMEZ,JOSE,A FOR HIRE VE… 05/23/2025
7 YES 5867611 HUSSAIN, TARIQ FOR HIRE VE… 05/08/2024
8 YES 5869802 LU,GUI,ZHAO FOR HIRE VE… 05/12/2024
9 YES 5715034 LI,PEI FOR HIRE VE… 08/15/2024
10 YES 5725892 HAILE,TEMESGEN,K FOR HIRE VE… 09/23/2024
# ℹ 76,700 more rows
# ℹ 18 more variables: permit_license_number <chr>,
# dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
# wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
# vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
# hybrid <chr>, base_telephone_number <chr>, website <chr>,
# base_address <chr>, reason <chr>, order_date <lgl>, …
#this dataframe has fewer rows because we have only kept the registered Ubers.
You use R’s logical operators to return the rows that you care about. Here I’ve returned all the rows where the base_name
column exactly matches the string “UBER USA, LLC.” Always use ==
for logical expressions. The single equals sign =
is just for defining the names of arguments and other list items, and will confuse R.
Here’s some other helpful logical operators you may find yourself using, to return certain strings, numbers, or lists.
%>%
fhv_clean filter(base_name %in% c("UBER USA, LLC", "Take Me 2 Inc"), #name is in the list
>= 2000, #year is greater than or equal to
vehicle_year != "HYB" #no hybrids
hybrid )
# A tibble: 6,433 × 23
active vehicle_license_number name license_type expiration_date
<chr> <chr> <chr> <chr> <chr>
1 YES 6025256 ALSAHYBI, SUHAIB FOR HIRE VE… 04/17/2025
2 YES 5707125 CITY,QUEENS,INC FOR HIRE VE… 07/12/2024
3 YES 5278357 LI,LIN FOR HIRE VE… 11/01/2023
4 YES 6015005 GULATI,SONU FOR HIRE VE… 01/23/2025
5 YES 5839092 WILSON',SONS,INC FOR HIRE VE… 12/28/2023
6 YES 5837702 AMERICAN,UNITED,T… FOR HIRE VE… 12/18/2023
7 YES 6036945 CCM NY LLC FOR HIRE VE… 08/02/2025
8 YES 6002683 WU, JINXIANG FOR HIRE VE… 08/23/2024
9 YES 5999878 ALL GREEN HAMSAF … FOR HIRE VE… 08/08/2024
10 YES 5661911 SINGH,SANDEEP FOR HIRE VE… 12/16/2023
# ℹ 6,423 more rows
# ℹ 18 more variables: permit_license_number <chr>,
# dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
# wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
# vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
# hybrid <chr>, base_telephone_number <chr>, website <chr>,
# base_address <chr>, reason <chr>, order_date <lgl>, …
Let’s combine it to get a subsample of columns and rows based on the criteria specified and assign it for further analysis
<- fhv_clean %>%
ubers_thiscentury select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type) %>%
filter(base_name == "UBER USA, LLC",
>= 2000, #year is greater than or equal to
vehicle_year )