library(tidyverse)
library(janitor)
<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>%
fhv_clean clean_names() %>%
rename(hybrid = veh)
6. Select and Filter
Video Tutorial
Select
Let’s read in our data and do some cleaning up of the names with the pipe
We have a lot of information in this data frame. What if we want to look at just a few rows and columns. Two core dplyr
functions, select
and filter
, help us do so. dplyr
is a core part of the tidyverse, and it has functions that modify dataframes (think of the pipe!)
Let’s try just keeping active
, vehicle_lisence_number
, name
, license_type
, vehicle_year
, base_name
, and base_type
. Select’s first argument is the dataframe, and the following arguments are all the names of columns. In R documentation, an ellipses argument ...
means that the function takes a list of arguments. In this case, a list of variables to select
%>%
fhv_clean select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type)
#this dataframe has all our observations, but only 6 variables (columns)
For more advanced selection, check out the logical operations using the tidy-select
expressions. Check what -
does, for instance.
%>%
fhv_clean select(-active)
Filter
Filter does the same thing as select, but for rows that meet certain logical conditions. Let’s get all the uber vehicles. The first argument of filter is the dataframe. The second is a logical expression.
%>%
fhv_clean filter(base_name == "UBER USA, LLC")
#this dataframe has fewer rows because we have only kept the registered Ubers.
You use R’s logical operators to return the rows that you care about. Here I’ve returned all the rows where the base_name
column exactly matches the string “UBER USA, LLC.” Always use ==
for logical expressions. The single equals sign =
is just for defining the names of arguments and other list items, and will confuse R.
Here’s some other helpful logical operators you may find yourself using, to return certain strings, numbers, or lists.
%>%
fhv_clean filter(base_name %in% c("UBER USA, LLC", "Take Me 2 Inc"), #name is in the list
>= 2000, #year is greater than or equal to
vehicle_year != "HYB" #no hybrids
hybrid )
Let’s combine it to get a subsample of columns and rows based on the criteria specified and assign it for further analysis
<- fhv_clean %>%
ubers_thiscentury select(active, vehicle_license_number, name, license_type, vehicle_year, base_name, base_type) %>%
filter(base_name == "UBER USA, LLC",
>= 2000, #year is greater than or equal to
vehicle_year )