<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") %>%
fhv_clean clean_names() %>%
rename(hybrid = veh)
8. Group By and Summarize
Video Tutorial
Why summarize data?
Having clean data is great, but when working with large datasets we are often looking for summary statistics to let us compare different groups. group_by
and summarize
, often used together in a pipe, are a powerful combo for generating statistics at the group level.
Summarizing allows us to compare means, medians, or top values based on different categories. It can also be a helpful data cleaning tool, depending on the level of observations in the data.
Summarize
Summarize takes a data frame and a ...
list of new variables to generate.
Let’s grab our code to read in the clean dataframe again.
Note - summarize
is different than summary
- a helpful function that provides the basic stats of all variables - a good first step when looking at a new data set
summary(fhv_clean)
With a basic application of summarize
, we take all the rows of the dataframe and turn them into one row of data. Here we use mean() to get the average
%>%
fhv_clean summarize(average_year = mean(vehicle_year))
#careful - NAs are sticky so mean() will return NA if there are any missing values. Use na.rm = T to exclude missing in calculating the average
%>%
fhv_clean summarize(average_year = mean(vehicle_year, na.rm = TRUE))
You can summarize multiple variables (...)
at once. Look at summarize documentation for the full list of operations you can summarize with.
%>%
fhv_clean summarize(total_cars = n(), #n() just counts the number of rows in the group
average_year = mean(vehicle_year, na.rm = TRUE),
median_year = median(vehicle_year, na.rm = T))
When using summarize it defaults to one big group - all the data is summarized. To summarize by group we can add group_by()
Group_by
Where summarize becomes really powerful is pairing it with group_by
%>%
fhv_clean group_by(hybrid) %>%
summarize(number_cars = n(),
mean_year = mean(vehicle_year, na.rm = T))
Often we are grouping on variables of interest and summarizing across that variable. Take the variable we made last time, for example
<- fhv_clean %>%
fhv_type_summary mutate(
ride_type = case_when(
== "UBER USA, LLC" & base_type == "BLACK-CAR" ~ "BLACK CAR RIDESHARE",
base_name != "UBER USA, LLC" & base_type == "BLACK-CAR" ~ "BLACK CAR NON-RIDESHARE",
base_name TRUE ~ base_type #if it doesn't meet either condition, return the base_type
%>%
)) group_by(ride_type) %>% #group by the variable we just created!
summarize(no_cars = n(),
average_year = mean(vehicle_year, na.rm = T))
fhv_type_summary
Here we can see some interesting trends start to emerge, like how the livery cars tend to be the oldest and the luxury cars tend to be newer.
We can also group_by multiple variables, or group by logical expressions.
%>%
fhv_clean group_by(hybrid, base_type) %>%
summarize(total_cars = n())
%>%
fhv_clean group_by(base_type, vehicle_year >= 2000) %>%
summarize(total_cars = n())
Group_by with mutate
Pair group_by with mutate to create helpful summary level variables without reducing the number of rows in the dataset.
%>%
fhv_clean group_by(base_name) %>%
mutate(total_by_name = n()) #variable with the total # of cars for each company name
Use ungroup() to return to normal mutation operations
%>%
fhv_clean group_by(base_type) %>%
mutate(mean_by_type = mean(vehicle_year, na.rm =T)) %>%
ungroup() %>%
mutate(above_below_mean = if_else(
condition = vehicle_year > mean_by_type,
true = "above mean",
false = "below mean"
%>%
)) count(above_below_mean)
#this creates a variable to show if this car is above or below the average year for the group
Normalizing with Groups
Group by, summarize, and mutate are crucial when comparing geographic areas. You can use group_by to get the proportion within a certain group.
Here’s a full example of how to normalize within a group, using the data on new york housing authority apartments by borough. We want to find the proportion of NYCHA apartments that are Section 8 (vouchers) in each borough.
<- read_csv('https://data.cityofnewyork.us/resource/evjd-dqpz.csv') %>%
nycha clean_names()
# Filter for developments that contain any Section 8 transition apartments
<- nycha %>%
section8devs filter(number_of_section_8_transition_apartments > 0)
# Get some stats on number of section 8 apts across all developments by borough
<- section8devs %>%
section8devs_by_boro group_by(borough) %>%
summarize(section8apts = sum(number_of_section_8_transition_apartments),
totalapts = sum(total_number_of_apartments),
median_section8apts = median(number_of_section_8_transition_apartments),
avg_section8apts = mean(number_of_section_8_transition_apartments))
# Is this all the info we need? What's missing? (Normalization)
# Section 8 units as a share of all units in mixed finance developments, by borough
<- section8devs %>%
section8devs_grouped group_by(borough) %>%
summarize(total_s8_apts = sum(number_of_section_8_transition_apartments),
total_apts = sum(total_number_of_apartments))%>%
mutate(s8_share = total_s8_apts / total_apts)
# Section 8 units as a share of all NYCHA units, by borough
<- nycha %>%
nycha_grouped group_by(borough) %>%
summarize(total_s8_apts = sum(number_of_section_8_transition_apartments, na.rm=TRUE),
total_apts = sum(total_number_of_apartments, na.rm=TRUE))%>%
mutate(s8_share = total_s8_apts / total_apts)