6. Select and Filter

Author

Patrick Spauster

Video Tutorial

Select

Let’s read in our data and do some cleaning up of the names with the pipe

library(tidyverse)
library(janitor)

fhv_clean <- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>% 
  clean_names() %>% 
  rename(hybrid = veh)

We have a lot of information in this data frame. What if we want to look at just a few rows and columns. Two core dplyr functions, select and filter, help us do so. dplyr is a core part of the tidyverse, and it has functions that modify dataframes (think of the pipe!)

Let’s try just keeping active, vehicle_lisence_number, name, license_type, vehicle_year, base_name, and base_type. Select’s first argument is the dataframe, and the following arguments are all the names of columns. In R documentation, an ellipses argument ... means that the function takes a list of arguments. In this case, a list of variables to select

fhv_clean %>% 
  select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type)

# A tibble: 98,318 × 7
   active vehicle_license_number name        license_type vehicle_year base_name
   <chr>  <chr>                  <chr>       <chr>               <dbl> <chr>    
 1 YES    5608977                AMERICAN,U… FOR HIRE VE…         2015 UBER USA…
 2 YES    5645622                RAMA,ILIR   FOR HIRE VE…         2022 UBER USA…
 3 YES    5192507                ORDONEZ,EL… FOR HIRE VE…         2016 UBER USA…
 4 YES    5378856                RIVERA,ENMA FOR HIRE VE…         2018 BELL LX …
 5 YES    5852121                A/VA,SERVI… FOR HIRE VE…         2019 BAYRIDGE…
 6 YES    5415237                REYES,JUAN… FOR HIRE VE…         2012 FIRST CL…
 7 YES    5643301                BEGUM,TAZM… FOR HIRE VE…         2015 UBER USA…
 8 YES    5701439                GONZALEZAL… FOR HIRE VE…         2016 UBER USA…
 9 YES    5790931                GOMEZ,JOSE… FOR HIRE VE…         2017 UBER USA…
10 YES    5743759                HOSSAIN,SM… FOR HIRE VE…         2021 TRI-CITY…
# ℹ 98,308 more rows
# ℹ 1 more variable: base_type <chr>

#this dataframe has all our observations, but only 6 variables (columns)

For more advanced selection, check out the logical operations using the tidy-select expressions. Check what - does, for instance.

fhv_clean %>% 
  select(-active)

# A tibble: 98,318 × 22
   vehicle_license_number name                      license_type expiration_date
   <chr>                  <chr>                     <chr>        <chr>          
 1 5608977                AMERICAN,UNITED,TRANSPOR… FOR HIRE VE… 04/30/2025     
 2 5645622                RAMA,ILIR                 FOR HIRE VE… 09/11/2023     
 3 5192507                ORDONEZ,ELIAS             FOR HIRE VE… 03/08/2025     
 4 5378856                RIVERA,ENMA               FOR HIRE VE… 11/12/2024     
 5 5852121                A/VA,SERVICE,CORP         FOR HIRE VE… 04/11/2024     
 6 5415237                REYES,JUAN,E              FOR HIRE VE… 10/31/2023     
 7 5643301                BEGUM,TAZMINUR            FOR HIRE VE… 09/30/2025     
 8 5701439                GONZALEZALVARADO,L        FOR HIRE VE… 06/13/2024     
 9 5790931                GOMEZ,JOSE,A              FOR HIRE VE… 05/23/2025     
10 5743759                HOSSAIN,SM,KAMAL          FOR HIRE VE… 12/08/2024     
# ℹ 98,308 more rows
# ℹ 18 more variables: permit_license_number <chr>,
#   dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
#   wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
#   vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
#   hybrid <chr>, base_telephone_number <chr>, website <chr>,
#   base_address <chr>, reason <chr>, order_date <lgl>, …

Filter

Filter does the same thing as select, but for rows that meet certain logical conditions. Let’s get all the uber vehicles. The first argument of filter is the dataframe. The second is a logical expression.

fhv_clean %>% 
  filter(base_name == "UBER USA, LLC")

# A tibble: 76,710 × 23
   active vehicle_license_number name               license_type expiration_date
   <chr>  <chr>                  <chr>              <chr>        <chr>          
 1 YES    5608977                AMERICAN,UNITED,T… FOR HIRE VE… 04/30/2025     
 2 YES    5645622                RAMA,ILIR          FOR HIRE VE… 09/11/2023     
 3 YES    5192507                ORDONEZ,ELIAS      FOR HIRE VE… 03/08/2025     
 4 YES    5643301                BEGUM,TAZMINUR     FOR HIRE VE… 09/30/2025     
 5 YES    5701439                GONZALEZALVARADO,L FOR HIRE VE… 06/13/2024     
 6 YES    5790931                GOMEZ,JOSE,A       FOR HIRE VE… 05/23/2025     
 7 YES    5867611                HUSSAIN, TARIQ     FOR HIRE VE… 05/08/2024     
 8 YES    5869802                LU,GUI,ZHAO        FOR HIRE VE… 05/12/2024     
 9 YES    5715034                LI,PEI             FOR HIRE VE… 08/15/2024     
10 YES    5725892                HAILE,TEMESGEN,K   FOR HIRE VE… 09/23/2024     
# ℹ 76,700 more rows
# ℹ 18 more variables: permit_license_number <chr>,
#   dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
#   wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
#   vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
#   hybrid <chr>, base_telephone_number <chr>, website <chr>,
#   base_address <chr>, reason <chr>, order_date <lgl>, …

#this dataframe has fewer rows because we have only kept the registered Ubers.

You use R’s logical operators to return the rows that you care about. Here I’ve returned all the rows where the base_name column exactly matches the string “UBER USA, LLC.” Always use == for logical expressions. The single equals sign = is just for defining the names of arguments and other list items, and will confuse R.

Here’s some other helpful logical operators you may find yourself using, to return certain strings, numbers, or lists.

fhv_clean %>% 
  filter(base_name %in% c("UBER USA, LLC", "Take Me 2 Inc"), #name is in the list
         vehicle_year >= 2000, #year is greater than or equal to
         hybrid != "HYB" #no hybrids
         )

# A tibble: 6,433 × 23
   active vehicle_license_number name               license_type expiration_date
   <chr>  <chr>                  <chr>              <chr>        <chr>          
 1 YES    6025256                ALSAHYBI, SUHAIB   FOR HIRE VE… 04/17/2025     
 2 YES    5707125                CITY,QUEENS,INC    FOR HIRE VE… 07/12/2024     
 3 YES    5278357                LI,LIN             FOR HIRE VE… 11/01/2023     
 4 YES    6015005                GULATI,SONU        FOR HIRE VE… 01/23/2025     
 5 YES    5839092                WILSON',SONS,INC   FOR HIRE VE… 12/28/2023     
 6 YES    5837702                AMERICAN,UNITED,T… FOR HIRE VE… 12/18/2023     
 7 YES    6036945                CCM NY LLC         FOR HIRE VE… 08/02/2025     
 8 YES    6002683                WU, JINXIANG       FOR HIRE VE… 08/23/2024     
 9 YES    5999878                ALL GREEN HAMSAF … FOR HIRE VE… 08/08/2024     
10 YES    5661911                SINGH,SANDEEP      FOR HIRE VE… 12/16/2023     
# ℹ 6,423 more rows
# ℹ 18 more variables: permit_license_number <chr>,
#   dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
#   wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
#   vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
#   hybrid <chr>, base_telephone_number <chr>, website <chr>,
#   base_address <chr>, reason <chr>, order_date <lgl>, …

Let’s combine it to get a subsample of columns and rows based on the criteria specified and assign it for further analysis

ubers_thiscentury <- fhv_clean %>% 
  select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type) %>% 
  filter(base_name == "UBER USA, LLC",
         vehicle_year >= 2000, #year is greater than or equal to
         )