<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>%
fhv_clean clean_names() %>%
rename(hybrid = veh)
8. Group By and Summarize
Video Tutorial
Why summarize data?
Having clean data is great, but when working with large datasets we are often looking for summary statistics to let us compare different groups. group_by
and summarize
, often used together in a pipe, are a powerful combo for generating statistics at the group level.
Summarizing allows us to compare means, medians, or top values based on different categories. It can also be a helpful data cleaning tool, depending on the level of observations in the data.
Summarize
Summarize takes a data frame and a ...
list of new variables to generate.
Let’s grab our code to read in the clean dataframe again.
Note - summarize
is different than summary
- a helpful function that provides the basic stats of all variables - a good first step when looking at a new data set
summary(fhv_clean)
active vehicle_license_number name
Length:98318 Length:98318 Length:98318
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
license_type expiration_date permit_license_number
Length:98318 Length:98318 Length:98318
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
dmv_license_plate_number vehicle_vin_number wheelchair_accessible
Length:98318 Length:98318 Length:98318
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
certification_date hack_up_date vehicle_year base_number
Length:98318 Length:98318 Min. :1949 Length:98318
Class :character Class :character 1st Qu.:2016 Class :character
Mode :character Mode :character Median :2018 Mode :character
Mean :2018
3rd Qu.:2021
Max. :5015
base_name base_type hybrid base_telephone_number
Length:98318 Length:98318 Length:98318 Length:98318
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
website base_address reason order_date
Length:98318 Length:98318 Length:98318 Mode:logical
Class :character Class :character Class :character NA's:98318
Mode :character Mode :character Mode :character
last_date_updated last_time_updated
Length:98318 Length:98318
Class :character Class1:hms
Mode :character Class2:difftime
Mode :numeric
With a basic application of summarize
, we take all the rows of the dataframe and turn them into one row of data. Here we use mean() to get the average
%>%
fhv_clean summarize(average_year = mean(vehicle_year))
# A tibble: 1 × 1
average_year
<dbl>
1 2018.
#careful - NAs are sticky so mean() will return NA if there are any missing values. Use na.rm = T to exclude missing in calculating the average
%>%
fhv_clean summarize(average_year = mean(vehicle_year, na.rm = TRUE))
# A tibble: 1 × 1
average_year
<dbl>
1 2018.
You can summarize multiple variables (...)
at once. Look at summarize documentation for the full list of operations you can summarize with.
%>%
fhv_clean summarize(total_cars = n(), #n() just counts the number of rows in the group
average_year = mean(vehicle_year, na.rm = TRUE),
median_year = median(vehicle_year, na.rm = T))
# A tibble: 1 × 3
total_cars average_year median_year
<int> <dbl> <dbl>
1 98318 2018. 2018
When using summarize it defaults to one big group - all the data is summarized. To summarize by group we can add group_by()
Group_by
Where summarize becomes really powerful is pairing it with group_by
%>%
fhv_clean group_by(hybrid) %>%
summarize(number_cars = n(),
mean_year = mean(vehicle_year, na.rm = T))
# A tibble: 9 × 3
hybrid number_cars mean_year
<chr> <int> <dbl>
1 BEV 2267 2022.
2 CNG 1 2020
3 DSE 1 2019
4 HYB 3267 2019.
5 N 1 2016
6 NON 1 2012
7 STR 33 2017.
8 WAV 6016 2019.
9 <NA> 86731 2018.
Often we are grouping on variables of interest and summarizing across that variable. Take the variable we made last time, for example
<- fhv_clean %>%
fhv_type_summary mutate(
ride_type = case_when(
== "UBER USA, LLC" & base_type == "BLACK-CAR" ~ "BLACK CAR RIDESHARE",
base_name != "UBER USA, LLC" & base_type == "BLACK-CAR" ~ "BLACK CAR NON-RIDESHARE",
base_name TRUE ~ base_type #if it doesn't meet either condition, return the base_type
%>%
)) group_by(ride_type) %>% #group by the variable we just created!
summarize(no_cars = n(),
average_year = mean(vehicle_year, na.rm = T))
fhv_type_summary
# A tibble: 4 × 3
ride_type no_cars average_year
<chr> <int> <dbl>
1 BLACK CAR NON-RIDESHARE 16225 2018.
2 BLACK CAR RIDESHARE 76710 2018.
3 LIVERY 3652 2015.
4 LUXURY 1731 2020.
Here we can see some interesting trends start to emerge, like how the livery cars tend to be the oldest and the luxury cars tend to be newer.
We can also group_by multiple variables, or group by logical expressions.
%>%
fhv_clean group_by(hybrid, base_type) %>%
summarize(total_cars = n())
`summarise()` has grouped output by 'hybrid'. You can override using the
`.groups` argument.
# A tibble: 18 × 3
# Groups: hybrid [9]
hybrid base_type total_cars
<chr> <chr> <int>
1 BEV BLACK-CAR 2232
2 BEV LIVERY 17
3 BEV LUXURY 18
4 CNG BLACK-CAR 1
5 DSE BLACK-CAR 1
6 HYB BLACK-CAR 3009
7 HYB LIVERY 174
8 HYB LUXURY 84
9 N BLACK-CAR 1
10 NON BLACK-CAR 1
11 STR BLACK-CAR 18
12 STR LUXURY 15
13 WAV BLACK-CAR 5904
14 WAV LIVERY 96
15 WAV LUXURY 16
16 <NA> BLACK-CAR 81768
17 <NA> LIVERY 3365
18 <NA> LUXURY 1598
%>%
fhv_clean group_by(base_type, vehicle_year >= 2000) %>%
summarize(total_cars = n())
`summarise()` has grouped output by 'base_type'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups: base_type [3]
base_type `vehicle_year >= 2000` total_cars
<chr> <lgl> <int>
1 BLACK-CAR TRUE 92935
2 LIVERY TRUE 3652
3 LUXURY FALSE 5
4 LUXURY TRUE 1726
Group_by with mutate
Pair group_by with mutate to create helpful summary level variables without reducing the number of rows in the dataset.
%>%
fhv_clean group_by(base_name) %>%
mutate(total_by_name = n()) #variable with the total # of cars for each company name
# A tibble: 98,318 × 24
# Groups: base_name [771]
active vehicle_license_number name license_type expiration_date
<chr> <chr> <chr> <chr> <chr>
1 YES 5608977 AMERICAN,UNITED,T… FOR HIRE VE… 04/30/2025
2 YES 5645622 RAMA,ILIR FOR HIRE VE… 09/11/2023
3 YES 5192507 ORDONEZ,ELIAS FOR HIRE VE… 03/08/2025
4 YES 5378856 RIVERA,ENMA FOR HIRE VE… 11/12/2024
5 YES 5852121 A/VA,SERVICE,CORP FOR HIRE VE… 04/11/2024
6 YES 5415237 REYES,JUAN,E FOR HIRE VE… 10/31/2023
7 YES 5643301 BEGUM,TAZMINUR FOR HIRE VE… 09/30/2025
8 YES 5701439 GONZALEZALVARADO,L FOR HIRE VE… 06/13/2024
9 YES 5790931 GOMEZ,JOSE,A FOR HIRE VE… 05/23/2025
10 YES 5743759 HOSSAIN,SM,KAMAL FOR HIRE VE… 12/08/2024
# ℹ 98,308 more rows
# ℹ 19 more variables: permit_license_number <chr>,
# dmv_license_plate_number <chr>, vehicle_vin_number <chr>,
# wheelchair_accessible <chr>, certification_date <chr>, hack_up_date <chr>,
# vehicle_year <dbl>, base_number <chr>, base_name <chr>, base_type <chr>,
# hybrid <chr>, base_telephone_number <chr>, website <chr>,
# base_address <chr>, reason <chr>, order_date <lgl>, …
Use ungroup() to return to normal mutation operations
%>%
fhv_clean group_by(base_type) %>%
mutate(mean_by_type = mean(vehicle_year, na.rm =T)) %>%
ungroup() %>%
mutate(above_below_mean = if_else(
condition = vehicle_year > mean_by_type,
true = "above mean",
false = "below mean"
%>%
)) count(above_below_mean)
# A tibble: 2 × 2
above_below_mean n
<chr> <int>
1 above mean 44530
2 below mean 53788
#this creates a variable to show if this car is above or below the average year for the group
Normalizing with Groups
Group by, summarize, and mutate are crucial when comparing geographic areas. You can use group_by to get the proportion within a certain group.
Here’s a full example of how to normalize within a group, using the data on new york housing authority apartments by borough. We want to find the proportion of NYCHA apartments that are Section 8 (vouchers) in each borough.
<- read_csv('https://data.cityofnewyork.us/resource/evjd-dqpz.csv') %>%
nycha clean_names()
# Filter for developments that contain any Section 8 transition apartments
<- nycha %>%
section8devs filter(number_of_section_8_transition_apartments > 0)
# Get some stats on number of section 8 apts across all developments by borough
<- section8devs %>%
section8devs_by_boro group_by(borough) %>%
summarize(section8apts = sum(number_of_section_8_transition_apartments),
totalapts = sum(total_number_of_apartments),
median_section8apts = median(number_of_section_8_transition_apartments),
avg_section8apts = mean(number_of_section_8_transition_apartments))
# Is this all the info we need? What's missing? (Normalization)
# Section 8 units as a share of all units in mixed finance developments, by borough
<- section8devs %>%
section8devs_grouped group_by(borough) %>%
summarize(total_s8_apts = sum(number_of_section_8_transition_apartments),
total_apts = sum(total_number_of_apartments))%>%
mutate(s8_share = total_s8_apts / total_apts)
# Section 8 units as a share of all NYCHA units, by borough
<- nycha %>%
nycha_grouped group_by(borough) %>%
summarize(total_s8_apts = sum(number_of_section_8_transition_apartments, na.rm=TRUE),
total_apts = sum(total_number_of_apartments, na.rm=TRUE))%>%
mutate(s8_share = total_s8_apts / total_apts)