6. Select and Filter

Author

Patrick Spauster

Video Tutorial

Select

Let’s read in our data and do some cleaning up of the names with the pipe

library(tidyverse)
library(janitor)

fhv_clean <- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>% 
  clean_names() %>% 
  rename(hybrid = veh)

We have a lot of information in this data frame. What if we want to look at just a few rows and columns. Two core dplyr functions, select and filter, help us do so. dplyr is a core part of the tidyverse, and it has functions that modify dataframes (think of the pipe!)

Let’s try just keeping active, vehicle_lisence_number, name, license_type, vehicle_year, base_name, and base_type. Select’s first argument is the dataframe, and the following arguments are all the names of columns. In R documentation, an ellipses argument ... means that the function takes a list of arguments. In this case, a list of variables to select

fhv_clean %>% 
  select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type)
#this dataframe has all our observations, but only 6 variables (columns)

For more advanced selection, check out the logical operations using the tidy-select expressions. Check what - does, for instance.

fhv_clean %>% 
  select(-active)

Filter

Filter does the same thing as select, but for rows that meet certain logical conditions. Let’s get all the uber vehicles. The first argument of filter is the dataframe. The second is a logical expression.

fhv_clean %>% 
  filter(base_name == "UBER USA, LLC")
#this dataframe has fewer rows because we have only kept the registered Ubers.

You use R’s logical operators to return the rows that you care about. Here I’ve returned all the rows where the base_name column exactly matches the string “UBER USA, LLC.” Always use == for logical expressions. The single equals sign = is just for defining the names of arguments and other list items, and will confuse R.

Here’s some other helpful logical operators you may find yourself using, to return certain strings, numbers, or lists.

fhv_clean %>% 
  filter(base_name %in% c("UBER USA, LLC", "Take Me 2 Inc"), #name is in the list
         vehicle_year >= 2000, #year is greater than or equal to
         hybrid != "HYB" #no hybrids
         )

Let’s combine it to get a subsample of columns and rows based on the criteria specified and assign it for further analysis

ubers_thiscentury <- fhv_clean %>% 
  select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type) %>% 
  filter(base_name == "UBER USA, LLC",
         vehicle_year >= 2000, #year is greater than or equal to
         )