library(tidyverse)
4. Reading Data and Data Types
Video Tutorial
Reading Data
We can use R to read in a number of different types of data, manipulate it, and output it in different ways.
The core type of data we will be using in this class is the .csv
or a comma separated values file. A .csv
is a text file where each observation is in its own row and each variable or value is, you guessed it, separated by a comma. R can read these types of files in super easily. Let’s download our first comma separated file from the NYC Open Data Portal.
Let’s download the data on for hire vehicles in NYC and read it into R.
If we open a csv in a text editor it looks like this, but R will read it into something called a dataframe
which is the tidy format for tabular data (data that has rows and columns).
To read data into R, we are going to need the function read_csv()
and need to learn about file paths.
In order for R to read in the file, we need to tell R where the file is. We can do that with an absolute or a local path. An absolute path is the exact location of the file on your computer. For me, when I downloaded this file it went to my downloads folder - a path that looks something like this: /Users/patrickspauster/Downloads/For_Hire_Vehicles__FHV__-_Active.csv
. You can look up the path to a file by navigating to the file in finder or windows explorer and right clicking to “Get Info”. I could read it in by using read_csv("/Users/patrickspauster/Downloads/For_Hire_Vehicles__FHV__-_Active.csv")
.
But, not everyone who views my work or wants to run my code will have the same file structure on their computer. If i sent this code to someone and they tried to run it, they would get an error. That’s where the R project and a local path comes in handy.
Now, save a copy of For_Hire_Vehicles__FHV__-_Active.csv
to your project folder. When you do, you should see it appear in the file explorer in the bottom right of your R Studio window. Now we can access the .csv
using a local path. Because we have the R project open, R will start looking in the project folder. Now we can run read_csv on our file without having to look up the path.
(remember to load the tidyverse first! If you get an error like “the function function_name
can’t be found”, you probably forgot to load the proper package with library()
!)
read_csv("For_Hire_Vehicles__FHV__-_Active.csv")
Rows: 98318 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Active, Vehicle License Number, Name, License Type, Expiration Da...
dbl (1): Vehicle Year
lgl (1): Order Date
time (1): Last Time Updated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 98,318 × 23
Active `Vehicle License Number` Name `License Type` `Expiration Date`
<chr> <chr> <chr> <chr> <chr>
1 YES 5608977 AMERICAN,UN… FOR HIRE VEHI… 04/30/2025
2 YES 5645622 RAMA,ILIR FOR HIRE VEHI… 09/11/2023
3 YES 5192507 ORDONEZ,ELI… FOR HIRE VEHI… 03/08/2025
4 YES 5378856 RIVERA,ENMA FOR HIRE VEHI… 11/12/2024
5 YES 5852121 A/VA,SERVIC… FOR HIRE VEHI… 04/11/2024
6 YES 5415237 REYES,JUAN,E FOR HIRE VEHI… 10/31/2023
7 YES 5643301 BEGUM,TAZMI… FOR HIRE VEHI… 09/30/2025
8 YES 5701439 GONZALEZALV… FOR HIRE VEHI… 06/13/2024
9 YES 5790931 GOMEZ,JOSE,A FOR HIRE VEHI… 05/23/2025
10 YES 5743759 HOSSAIN,SM,… FOR HIRE VEHI… 12/08/2024
# ℹ 98,308 more rows
# ℹ 18 more variables: `Permit License Number` <chr>,
# `DMV License Plate Number` <chr>, `Vehicle VIN Number` <chr>,
# `Wheelchair Accessible` <chr>, `Certification Date` <chr>,
# `Hack Up Date` <chr>, `Vehicle Year` <dbl>, `Base Number` <chr>,
# `Base Name` <chr>, `Base Type` <chr>, VEH <chr>,
# `Base Telephone Number` <chr>, Website <chr>, `Base Address` <chr>, …
Let’s take a closer look at what read_csv()
is doing.
read_csv() ?
The function has one required argument, “file” and several optional arguments that we can change. The “file” argument asks for a path to a file as a “string” - remember if you see the words “character” or “string” think quotes. So let’s feed read_csv() the name of the file we want to read in in quotes, and assign it to something using our assignment operator <-
so we can further modify it.
<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv") fhv
(Aside - naming things is hard, and you will have to name a lot of different objects. Some general rules - don’t use spaces, and try to keep the names simple but informative, and be careful about overwriting the same name)
Now, fhv
is an object in our environment. R gives us some helpful details about the object in the environment menu and the dropdown arrow on the object itself.
R tells us how many observations (rows) and variables (columns) this object has - note how this is a “tidy” dataset. If you hover over the object itself, it will tell you the type of object and its size. In this case we have a dataframe
, the tidy format for data in R, often abbreviated df
. You can confirm this by running the function is.data.frame()
which identifies if an object is of a certain type.
is.data.frame(fhv)
[1] TRUE
Reading different types of data
read_csv()
is smart, but not perfect. You’ll notice that it has tried to identify the types of data in this dataframe. The Vehicle Year
is read in as a num
because it is made up of all digits. It correctly identified that Last Time Updated
is a Date
in the format hms
(hours:minutes:seconds). And the DMV License Plate Number
is a chr
(character), because it is a categorical string variable.
Numbers, characters, and dates, are three fundamental types of data that we will be using in R. We can use some of the other arguments of read_csv()
to make sure that we get columns in the correct format. For example, it missed that Expiration Date
should be a date.
When you start getting long arguments, and nested functions, it can be helpful to enter between each argument.
<- read_csv(file = "For_Hire_Vehicles__FHV__-_Active.csv",
fhv col_types = cols(`Expiration Date` = col_datetime(format = '%m/%d/%Y'))
)
#col types takes one argument - cols. cols() is a named list with the variable = the column's format.
#we'll learn more about how to parse dates and date data types in lesson 12
Be careful with numeric data types that aren’t actually numbers that will drop leading zeroes (think of a zip code like “06810” which starts with a 0 would get read in as an integer 6,810). If you wanted to match to another dataset with a zipcode you wouldn’t be able to! Another helpful note: you can change the default col_type using col_type = cols(.default = col_character())
. You can always change the types of columns back to numbers later.
Data types
Here’s a brief look at some other object types you might find in R.
A value is just one number, stored in an object.
<- 42
my_value my_value
[1] 42
A list is a group of values put together, separated by commas. In R the syntax to create a list starts with c()
. They are also called vectors in R.
<- c("Patrick", "Lucy", "Henry", "Ceinna")
my_character_vector my_character_vector
[1] "Patrick" "Lucy" "Henry" "Ceinna"
<- c(1, 3, 5, 7, 9, 11, 13, 17)
my_numeric_vector my_numeric_vector
[1] 1 3 5 7 9 11 13 17
Vectors can be named or unnamed. Named vectors are pairs of keys (names) and values separated by =
.
<- c("Patrick" = 42, "Lucy" = 12, "Ceinna" = 56, "Henry" = 44)
named_vector
named_vector
Patrick Lucy Ceinna Henry
42 12 56 44
You can get a vector of a particular variable in a dataframe by using $
with the dataframe name and the variable name.
<- head(fhv)
printme
$`Base Name` printme
[1] "UBER USA, LLC" "UBER USA, LLC"
[3] "UBER USA, LLC" "BELL LX INC"
[5] "BAYRIDGE EXPRESS LUXYRY INC" "FIRST CLASS C/L SVC CORP"
#head() only keeps the first few rows of a dataframe
You’ll also notice an important type of data - missing data - noted in R as NA
. In this dataset the Wheelchair Accessible
column is missing for the first few observations. This means that there is no value for that observation and variable. NA
values in R are sticky, meaning that unless you tell R to ignore them, R will carry them through all your operations and maybe mess up some of your calculations. For example…
1 + 2 + NA
[1] NA
sum(1,2,NA) #you should be able to figure this out based on what we've learned about functions so far!
[1] NA
sum(1,2,NA, na.rm = TRUE) #the na.rm = T argument removes NAs from a calculation.
[1] 3