library(tidyverse)
band_members
band_instruments
10. Joins
Video Tutorial
Joining two Tables together
Joins are powerful functions that allow you to connect two datasets together through matching values. They can be useful with spatial and non-spatial data alike.
Joins rely on “keys” that match records across different datasets. This can be something like a name or ID number.
As a recap of how joins work, we’re going to show a simple example of two different kinds of joins: ‘left’ joins and ‘inner’ joins. For this example, we will be using the band_members
and band_instruments
dataframes, which are simple, 3-row datasets that comes included in the dplyr package.
As you would expect, the first two arguments in a join function are the two tables you are trying to connect together. The third argument is the “join field”, which is the matching column in both datasets we will use to pair up rows.
Left Joins
A left_join
keeps all of the rows in the first table you specify, appending data from the second table through matching values in the specified “join field”. Let’s see how this kind of join looks with our example data:
<-
band_members_and_instruments left_join(band_members, band_instruments, by = "name")
# When the "join field" column names don't match, you can use:
# by = c("column1" = "column2")
band_members_and_instruments
Inner Joins
An `inner_join` keeps only the rows that have matching values between both tables in the specified “join field.” Any other rows are discarded. Let’s see how this kind of join looks with our example data:
<-
band_members_with_instruments_only inner_join(band_members, band_instruments, by = "name")
band_members_with_instruments_only
Full Joins
A full_join
keeps all rows from both tables, even if a row from the join field isn’t present in one of them. It shows any missing values as NA.
<-
band_members_with_or_without_instruments full_join(band_members, band_instruments, by = "name")
band_members_with_or_without_instruments
Troubleshooting Joins
Joining when matching columns have different names
Often times, the “join field” in your first table has a different name than that of your second table. For example, you may be trying to join two tables on a common zip code, but the first table calls the column ‘Zip’ and the second table calls it ‘Postal Code’. There’s a special syntax here to make it work:
# Let's change the "name" column to be called "MusicalArtist"
<- band_instruments %>% rename(MusicalArtist = name)
band_instruments_renamed
# In our join function, we need to specify that the "name" column in the first table matches up with the "MusicalArtist" column in the second table. We do that by setting our "by" parameter differently:
<-
band_members_and_instruments_2 inner_join(band_members, band_instruments_renamed,
by = c("name" = "MusicalArtist"))
band_members_and_instruments_2
You can also join across multiple keys (variables), by giving by a list. It will return only rows that match both variables.
Detecting duplicate values
Duplicate values in your data can cause problems with joins. For example, what if our `band_instruments` dataset listed “John” twice:
# Let's add a new row to band_instruments… say "John" also plays "flute"
<- band_instruments %>% add_row(name = "John", plays = "flute")
band_instruments_with_dup
band_instruments_with_dup
When we try and join our `band_members` table with this new table, we now get 4 rows in the resulting table, even though our first table only had 3 rows… how can that be?
<-
band_members_and_instruments_dup left_join(band_members, band_instruments_with_dup, by = "name")
band_members_and_instruments_dup
In R, if there are multiple matches between the two tables, all combinations of the matches are returned. This GIF illustrates visually how this works:
In general, to avoid confusing and unexpected results like this, it’s important to always check for duplicate values in your data, _especially_ in columns that you intend to use as a join field.
Luckily, the `janitor` package has a function for that called `get_dupes`.
library(janitor)# Remember to run `install.packages('janitor')` in your console if you've
%>% get_dupes(name) band_instruments_with_dup
Sometimes you may expect duplicate values in a column, and sometimes they may come as a surprise. General Rule: always know what each row represents in your data and what should be unique values.
Sometimes you’ll want to return more rows than you start with, for example if you were matching census divisions to states in order to aggregate the total land area.
#crosswalk between regions and divisions
regions_divisions
#this doesn't have the region included! oh no!
state_area
<- left_join(regions_divisions, state_area, by = "division")
regions_states
#the join matches state to region and lets us summarize
regions_states
%>%
regions_states group_by(region) %>%
summarize(region_area = sum(area))