We will focus today on three specific packages dplyr, data.table, and ggplot. dplyr and2 ggplot are part of a broader library called tidyverse which consolidates commonly-used data science packages. See more here: https://www.tidyverse.org/.

  • dplyr can be used for data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
  • data.table is an R package that provides an enhanced version of data.frames. Its speed makes it a preferable package to munipulate BIG data.
  • ggplot is used for creeating Graphics or Plots.

In addition to tidyverse, nycflights13, gapminder and ggthemes will be used in the tutorial. We use p_load function of pacman, to load all the packages at once. It will first check whether packages are installed. If not, it will install the yet to be installed automatically and then load all the packages.

## Loading required package: pacman

Using dplyr

Below is a table of popular dplyr commands. We will go through each one.

dplyr Command SQL equivalent Action
filter() WHERE Limit based on condition
select() SELECT Choose only certain variables
distinct() DISTINCT De-duplicate result-set
arrange() ORDER BY Order results
rename() SELECT Rename variables
mutate() SELECT Create new variable
group_by() GROUP BY Group rows
summarise() SELECT Create new variable in grouped setting

You can also view the full cheat sheet here: DPLYR Cheat Sheet

To illustrate these commands, we will use a pre-existing dataset the contains 336,776 flights that departed from New York City in 2013. Dplyr allows you to gather insight from a dataset without altering the original dataset. It is considered best practice not to alter the original dataset. For example in this case, we will never overwrite the existing dataset ‘flights’. We will first take a look at the summary statistics.

summary(flights)
##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##  sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   : 106   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 906   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124  
##  Median :1359   Median :  -2.00   Median :1535   Median :1556  
##  Mean   :1344   Mean   :  12.64   Mean   :1502   Mean   :1536  
##  3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945  
##  Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##                 NA's   :8255      NA's   :8713                 
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:336776      Min.   :   1   Length:336776     
##  1st Qu.: -17.000   Class :character   1st Qu.: 553   Class :character  
##  Median :  -5.000   Mode  :character   Median :1496   Mode  :character  
##  Mean   :   6.895                      Mean   :1972                     
##  3rd Qu.:  14.000                      3rd Qu.:3465                     
##  Max.   :1272.000                      Max.   :8500                     
##  NA's   :9430                                                           
##     origin              dest              air_time        distance   
##  Length:336776      Length:336776      Min.   : 20.0   Min.   :  17  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 502  
##  Mode  :character   Mode  :character   Median :129.0   Median : 872  
##                                        Mean   :150.7   Mean   :1040  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##                                        NA's   :9430                  
##       hour           minute        time_hour                  
##  Min.   : 1.00   Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :13.00   Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :13.18   Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:17.00   3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :23.00   Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 
head(flights, n = 30)
## # A tibble: 30 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 20 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
tail(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     9    30       NA           1842        NA       NA
## 2  2013     9    30       NA           1455        NA       NA
## 3  2013     9    30       NA           2200        NA       NA
## 4  2013     9    30       NA           1210        NA       NA
## 5  2013     9    30       NA           1159        NA       NA
## 6  2013     9    30       NA            840        NA       NA
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

Pipes

The %>% command is called a pipe. This means that the result of the code before %>% is sent, or “piped”, to the one after after %>%. Piping is a powerful tool for clearly expressing a sequence of multiple operations, as we will see shortly.

Below is a simple example with just numbers. This is the same as 3 + 4 + 5 which gives us an output of 12.

3 %>% 
  + 4 %>%
  + 5
## [1] 12

Pro Tip: You can use Command + Shift + M in a codeblock to put in a pipe.

Filter

The filter command will only display the subset of your dataset that match a certain condition. This command will only show flights on Jan 1st, 2013.

This code is the same as doing filter(flights, month == 1 & day == 1) since the %>% command passes the flights dataframe to the filter command.

firstdayofmonth <- flights %>%
  filter(month == 1 & day == 1)

glimpse(firstdayofmonth)
## Observations: 842
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

It is important to remember that this command does not alter the orignal flight dataset. If we want to save this subset as its own object, we run the following. Remember the <- is the assignment operator in R.

filteredFlight <- flights %>%
                  filter(month == 1 | day == 1)

Multiple conditions can be included in a filter command. The command below shows any flights from Jan through June to PHL or SLC airports.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6)
## # A tibble: 2,116 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      655            655         0     1021
##  2  2013     1     1      908            915        -7     1004
##  3  2013     1     1     1047           1050        -3     1401
##  4  2013     1     1     1245           1245         0     1616
##  5  2013     1     1     1323           1300        23     1651
##  6  2013     1     1     1543           1550        -7     1933
##  7  2013     1     1     1600           1610       -10     1712
##  8  2013     1     1     1909           1912        -3     2239
##  9  2013     1     1     1915           1920        -5     2238
## 10  2013     1     1     2000           2000         0     2054
## # ... with 2,106 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
flights %>%
  filter(dest %in% c("PHL","SLC") & month %in% c(1, 3, 6))
## # A tibble: 1,112 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      655            655         0     1021
##  2  2013     1     1      908            915        -7     1004
##  3  2013     1     1     1047           1050        -3     1401
##  4  2013     1     1     1245           1245         0     1616
##  5  2013     1     1     1323           1300        23     1651
##  6  2013     1     1     1543           1550        -7     1933
##  7  2013     1     1     1600           1610       -10     1712
##  8  2013     1     1     1909           1912        -3     2239
##  9  2013     1     1     1915           1920        -5     2238
## 10  2013     1     1     2000           2000         0     2054
## # ... with 1,102 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Select

Select will only return columns that are listed. In this case, the resulting dataset will consist of the Origin, Destination, and Carrier of flights that were destined for PHL or SLC in the first 6 months of the year. Remember, the pipe command sends the result of the current line to the next line. In this case, the filtered dataset is then piped into the select command.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier)
## # A tibble: 2,116 x 3
##    origin dest  carrier
##    <chr>  <chr> <chr>  
##  1 JFK    SLC   DL     
##  2 LGA    PHL   US     
##  3 JFK    SLC   DL     
##  4 JFK    SLC   DL     
##  5 EWR    SLC   DL     
##  6 JFK    SLC   DL     
##  7 JFK    PHL   9E     
##  8 JFK    SLC   B6     
##  9 JFK    SLC   DL     
## 10 JFK    PHL   9E     
## # ... with 2,106 more rows

On the contrary, we can use - to deselect columns. If we want to drop year, month and day, we just need to prefix - to each column name.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(-year, -month, -day)
## # A tibble: 2,116 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      655            655         0     1021           1030        -9
##  2      908            915        -7     1004           1033       -29
##  3     1047           1050        -3     1401           1410        -9
##  4     1245           1245         0     1616           1615         1
##  5     1323           1300        23     1651           1608        43
##  6     1543           1550        -7     1933           1925         8
##  7     1600           1610       -10     1712           1729       -17
##  8     1909           1912        -3     2239           2237         2
##  9     1915           1920        -5     2238           2257       -19
## 10     2000           2000         0     2054           2110       -16
## # ... with 2,106 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Distinct

Distinct will remove any duplicate rows from the given dataset. Notice in the previous command, it returned a subset with 2116 rows, but with distinct, we can see that only 8 carriers flew to PHL or SLC in the first half of the year.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier) %>% 
  distinct()
## # A tibble: 8 x 3
##   origin dest  carrier
##   <chr>  <chr> <chr>  
## 1 JFK    SLC   DL     
## 2 LGA    PHL   US     
## 3 EWR    SLC   DL     
## 4 JFK    PHL   9E     
## 5 JFK    SLC   B6     
## 6 EWR    PHL   EV     
## 7 JFK    PHL   US     
## 8 JFK    PHL   DL

Arrange

Arrange puts your data into alphabetical order. In this case the order is first by origin, then descending alpahbetical order of the destination, then alphabetical order of carrier.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier) %>% 
  distinct() %>%
  arrange(origin, desc(dest), carrier)
## # A tibble: 8 x 3
##   origin dest  carrier
##   <chr>  <chr> <chr>  
## 1 EWR    SLC   DL     
## 2 EWR    PHL   EV     
## 3 JFK    SLC   B6     
## 4 JFK    SLC   DL     
## 5 JFK    PHL   9E     
## 6 JFK    PHL   DL     
## 7 JFK    PHL   US     
## 8 LGA    PHL   US

Rename

The Rename function can be used to easily rename a column Header. Here, we rename carrier to airline.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier) %>% 
  distinct() %>%
  arrange(origin, desc(dest), carrier) %>%
  rename(airline = carrier)
## # A tibble: 8 x 3
##   origin dest  airline
##   <chr>  <chr> <chr>  
## 1 EWR    SLC   DL     
## 2 EWR    PHL   EV     
## 3 JFK    SLC   B6     
## 4 JFK    SLC   DL     
## 5 JFK    PHL   9E     
## 6 JFK    PHL   DL     
## 7 JFK    PHL   US     
## 8 LGA    PHL   US
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

Mutate

Mutate is used to create new columns based on current ones. This feature is very useful. Here, we create three new variables “gain”, “speed”, and “gain_per_hour”. Notice how “gain_per_hour” uses the column “gain”, which was created in the same mutate statement.

flights %>%
  mutate(gain          = dep_delay - arr_delay,
         speed         = distance / air_time * 60,
         gain_per_hour = gain / (air_time / 60)) %>%
  select(dep_delay, arr_delay, gain, distance, distance, air_time, speed, gain_per_hour)
## # A tibble: 336,776 x 7
##    dep_delay arr_delay  gain distance air_time speed gain_per_hour
##        <dbl>     <dbl> <dbl>    <dbl>    <dbl> <dbl>         <dbl>
##  1         2        11    -9     1400      227  370.         -2.38
##  2         4        20   -16     1416      227  374.         -4.23
##  3         2        33   -31     1089      160  408.        -11.6 
##  4        -1       -18    17     1576      183  517.          5.57
##  5        -6       -25    19      762      116  394.          9.83
##  6        -4        12   -16      719      150  288.         -6.4 
##  7        -5        19   -24     1065      158  404.         -9.11
##  8        -3       -14    11      229       53  259.         12.5 
##  9        -3        -8     5      944      140  405.          2.14
## 10        -2         8   -10      733      138  319.         -4.35
## # ... with 336,766 more rows
flights %>% 
  mutate(total_delay = dep_delay + arr_delay) %>% 
  select(carrier, total_delay) %>% 
  arrange(desc(total_delay)) %>% 
  group_by(carrier) %>% 
  summarise(average_delay = mean(total_delay))
## # A tibble: 16 x 2
##    carrier average_delay
##    <chr>           <dbl>
##  1 9E              NA   
##  2 AA              NA   
##  3 AS              NA   
##  4 B6              NA   
##  5 DL              NA   
##  6 EV              NA   
##  7 F9              NA   
##  8 FL              NA   
##  9 HA              -2.01
## 10 MQ              NA   
## 11 OO              NA   
## 12 UA              NA   
## 13 US              NA   
## 14 VX              NA   
## 15 WN              NA   
## 16 YV              NA

Group By

Reorginizes dataframe by rows according to the column that is grouped by; sumarise then gives statistics of that group. Here, the origin column had three categories, EWR, JFK, & LGA. The group_by(origin) command organizes the dataframe by the three origins. Then summarise() is used to get metrics related to each origin.

From this table, we can see that EWR had the most flights with 120835, and LGA had the lowest avg delay at 10.34

flights %>%
  group_by(origin) %>%
  summarise(num_of_flights = n(), 
            avg_delay      = mean(dep_delay, na.rm = TRUE)) # na.rm removes any NA data
## # A tibble: 3 x 3
##   origin num_of_flights avg_delay
##   <chr>           <int>     <dbl>
## 1 EWR            120835      15.1
## 2 JFK            111279      12.1
## 3 LGA            104662      10.3

group_by can also take expressions. The following returns the number of flights that started late but arrived early (or on time), started and arrived late etc.

flights %>%
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
  group_by(dep_delay > 0, arr_delay > 0) %>%
  summarise(num_of_flights = n())
## # A tibble: 4 x 3
## # Groups:   dep_delay > 0 [?]
##   `dep_delay > 0` `arr_delay > 0` num_of_flights
##   <lgl>           <lgl>                    <int>
## 1 FALSE           FALSE                   158900
## 2 FALSE           TRUE                     40701
## 3 TRUE            FALSE                    35442
## 4 TRUE            TRUE                     92303

Summarise

Summarise has a number of other functions that can be used within it. n_distinct(dest) returns the number of distinct destinations. From this table we can see that EWR has flights to the largest number of destinations (56). We can also see LGA flights has a lower average distance than those of EWR & JFK.

flights %>%
  group_by(origin) %>%
  summarise(destinations = n_distinct(dest), 
            avg_distance = mean(distance, na.rm = TRUE))
## # A tibble: 3 x 3
##   origin destinations avg_distance
##   <chr>         <int>        <dbl>
## 1 EWR              86        1057.
## 2 JFK              70        1266.
## 3 LGA              68         780.

Here we summarise the whole dataset. We can see we have 337,776 observations, 105 distinct destinations and a 12.6 min avg delay.

flights %>%
  group_by(carrier) %>% 
  summarise(num_of_flights = n(), 
            destinations   = n_distinct(dest), 
            avg_delay      = mean(dep_delay, na.rm = TRUE))
## # A tibble: 16 x 4
##    carrier num_of_flights destinations avg_delay
##    <chr>            <int>        <int>     <dbl>
##  1 9E               18460           49     16.7 
##  2 AA               32729           19      8.59
##  3 AS                 714            1      5.80
##  4 B6               54635           42     13.0 
##  5 DL               48110           40      9.26
##  6 EV               54173           61     20.0 
##  7 F9                 685            1     20.2 
##  8 FL                3260            3     18.7 
##  9 HA                 342            1      4.90
## 10 MQ               26397           20     10.6 
## 11 OO                  32            5     12.6 
## 12 UA               58665           47     12.1 
## 13 US               20536            6      3.78
## 14 VX                5162            5     12.9 
## 15 WN               12275           11     17.7 
## 16 YV                 601            3     19.0

Dpylr is a great way to answer initial questions about a dataset. For example, say we want to know what the farthest flight to leave NYC is.

To answer this, we can group by origin and destination, summarise the max distance for each pair, and then order by the maximum distance value we created. It is now easy to see that the max distance was from EWR or JFK to HNL.

flights %>%
  group_by(origin, dest) %>%
  summarise(max_distance = max(distance)) %>%
  arrange(desc(max_distance))
## # A tibble: 224 x 3
## # Groups:   origin [3]
##    origin dest  max_distance
##    <chr>  <chr>        <dbl>
##  1 JFK    HNL           4983
##  2 EWR    HNL           4963
##  3 EWR    ANC           3370
##  4 JFK    SFO           2586
##  5 JFK    OAK           2576
##  6 JFK    SJC           2569
##  7 EWR    SFO           2565
##  8 JFK    SMF           2521
##  9 JFK    LAX           2475
## 10 JFK    BUR           2465
## # ... with 214 more rows

Using data.table

As an enhanced version of data.frame, one can do a lot more than subsetting rows and selecting columns within the frame of data.table, i.e., within [ ... ]. The general form of data.table syntax is as shown below:

## DT[i, j, by]
## 
## ##   R:        i            j        by
## ## SQL:  where   select | update  group by

The way to read it (out loud) is as suggested:

Take DT, subset rows using i, then calculate j, grouped by by.

Let’s redo the work by dplyr above in the data.table way. This tutorial will only cover the basics of data.table. There are more advanced data.table operations that come in handy. See the data.table cheat sheet here. Comparisons between dplyr, data.table and pandas (a popular Python package) in terms of speed can be found here.

Create data.table

setDT transforms data.frame into data.table. Notice that since data.table is an enhanced version of data.frame, operations on data.frame are still avaiable to data.table. We first make a copy of flights and transform it into data.table.

flights2 <- copy(flights)
setDT(flights2)
class(flights2)
## [1] "data.table" "data.frame"

Subset rows in i

To show flights on Jan 1st, 2013:

dplyr:

flights %>% filter(month == 1 & day == 1)

data.table:

flights2[month == 1 & day == 1]

To get any flights from Jan through June to PHL or SLC airports:

dplyr:

flights %>% filter(dest %in% c("PHL","SLC") & month <= 6)

data.table:

flights2[dest %in% c("PHL","SLC") & month <= 6]

To get the first N row: dplyr:

flights[1:2,]

data.table:

flights2[1:2]

Sort rows in i

To sort column origin in ascending order, and then by dest in descending order: dplyr:

flights %>% 
  select(origin, dest, carrier) %>% 
  distinct() %>%
  arrange(origin, desc(dest), carrier)

data.table:

unique(flights2[order(origin, -dest), .(origin, dest, carrier)])

Select column(s) in j

To select Origin, Destination, and Carrier of flights: dplyr:

flights %>% select(origin, dest, carrier)

data.table:

flights2[, list(origin, dest, carrier)]

or

flights2[, .(origin, dest, carrier)]

or you can also store the columns names into a variable and select them using a special syntax ...

select_columns <- c("origin", "dest", "carrier")
flights2[, ..select_columns]

When selecting only one column, data.table allows us to return the column as a vector as what the base $ does.

If

To select origin and return it as a vector: dplyr:

flights$origin
flights %>% select(origin)

data.table:

flights2[, .(origin)]

To rename dest to destination and carrier to airline:

dplyr:

flights %>%
  select(origin, dest, carrier) %>% 
  rename(dest = destination, airline = carrier)

data.table:

flights2[, .(origin, destination = dest, airline = carrier)]

Notice that in dplyr, the syntax is rename(old_name = new_name) which be read as “rename old_name to new_name”; while data.table selects and renames at once: to take values in dest and to put the values to destination.

Compute in j

Other than selecting columns, j can also handle expressions. For example,

To compute how many trips had total delay < 0: base:

sum((flights$arr_delay + flights$dep_delay) < 0, na.rm = T)
## [1] 188401

data.table:

flights2[, sum((arr_delay + dep_delay) < 0, na.rm = T)]
## [1] 188401

Special symbol .N

There are some special symbol in data.table, e.g. .N, .SD, .SDcols. .N holds the number of observations in the current group. Think of it as a counterpart of n() of dplyr.

To count the total number of flights: dplyr:

flights %>%
  summarise(num_of_flights = n())
## # A tibble: 1 x 1
##   num_of_flights
##            <int>
## 1         336776

data.table:

flights2[, .(num_of_flights = .N)]
##    num_of_flights
## 1:         336776

Special symbol :=

To create new columns as what mutate of dplyr does:

dplyr:

flights %>%
  mutate(gain = dep_delay - arr_delay)

data.table:

flights2[, gain := dep_delay - arr_delay]

To remove gain: Use := NULL to remove columns.

flights2[, gain := NULL]
## Warning in `[.data.table`(flights2, , `:=`(gain, NULL)): Adding new column
## 'gain' then assigning NULL (deleting it).

Grouping using by

Similar to group_by of dplyr, data.table applies the operations in j to groups specified in by.

To count the total number of flights from each origin and average delay time of each: dplyr:

flights %>%
  group_by(origin) %>%
  summarise(num_of_flights = n(),
            avg_delay      = mean(dep_delay, na.rm = TRUE))

data.table:

flights2[, .(num_of_flights = .N, 
             avg_delay      = mean(dep_delay, na.rm = TRUE)), 
         by = origin]

by also takes expressions like group_by does.

To count the number of flights that started late but arrived early (or on time), started and arrived late etc.: dplyr:

flights %>%
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
  group_by(dep_delay > 0, arr_delay > 0) %>%
  summarise(num_of_flights = n())

data.table:

flights2[!is.na(dep_delay) & !is.na(arr_delay), 
         .N, by = .(dep_delay>0, arr_delay>0)]

Chaining

data.table can tack expressions one after another, forming a chain of operations similar to piping %>%, i.e., DT[ … ][ … ][ … ].

flights2[, .(max_distance = max(distance)), by = .(origin, dest)
       ][order(-max_distance)]

dplyr or data.table?

Choosing dplyr or data.table is a personal preference. Here is Stack Overflow post concerning this question. Both authors of dplyr (Hadley) and data.table (Arun) compare their packages in terms of speed, memory usage and syntax. In short, dplyr is more readable (though readability is a subjective question); while data.table performs faster than dplyr as the data size grows, and uses less memory in several functions by its nature.

Using ggplot

We now move on to ggplot. The basic idea of ggplot is to independently specify building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:

  • data
  • aesthetic mapping
  • geometric object
  • faceting

Aesthetic Mappings

In ggplot land aesthetic means “something you can see”. Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • size

We now use a different dataset, gapminder, from the package gapminder. Let’s do a quick summary.

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Plots by Data Types

Data Plots Geom (ggplot command)
One Continuous Histogram geom_histogram
One Continuous + One Categorical Boxplot geom_boxplot
Two Continuous Scatter Plot geom_point
Three Continuous Scatter Plot + Size geom_point w/ size aesthetic
Two Continuous + One Categorical Scatter Plot + Color geom_point w/ color aesthetic
Categorical with reasonable number of levels Faceting!! facet_wrap()

Note: Time is always the x-axis.

There are many more geom types, but we will focus on the ones listed in the table above.

Here is an extremely useful cheatsheet that shows all of ggplots functions and how to use them.

One Continous / Geom_Histogram

The following shows the histogram of life Expectancy in 2007. Life expectancy is a continous variable, so we use geom_histogram().

Note how the %>% or “piping” also works with ggplot. If you are not piping in a dataframe, the first input to ggplot should be your dataframe. For example, the command would become ggplot(gapminder, aes(x = lifeExP)) + geom_histogram(binwidth = 2)

hist(gapminder$lifeExp)

gapminder %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2)

ggplot(gapminder, aes(x = lifeExp)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

One Continuous + One Categorical / Geom_boxplot

Now, we want to show lifeExp broken down by continent. Continent is a categorical variable, also called factors in R. For this, we use the geom_boxplot() command.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  geom_point(position = "jitter", alpha = 0.6, col = "blue")

Two Continous / Geom_Point

Using geom_point() we create a scatter plot of our two continous variables, gdpPercap and LifeExp.

plot(gapminder$gdpPercap, gapminder$lifeExp, pch=16)

gapminder %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(alpha = 0.2, col = "blue")

Some relationships will look better on different scales, and ggplot allows you to change scales very quickly. Here we log the x-axis, with scale_x_log10(), which makes the relationship between these two varibles much clearer.

gapminder %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  scale_x_log10()

Three Continuous / Geom_point With Size Aesthetic

If we want to show three continous variables at the same time, we can use the size aesthetic in ggplot. This will alter the size of the point by the value in the pop column of the gapminder data frame.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, size = pop)) + 
  geom_point() +
  geom_smooth() +
  scale_x_log10()

Two Continuous + One Categorical / Geom_point With Color Aesthetic

To show more insight into this graph, we can show each point by which continent it is from. Adding the color Aesthetic allows us to show a categorical variable, continent, as each point is colored by what continent it is from.

gapminder %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.6, position = "jitter") +
  scale_x_log10()

p <- gapminder %>%
  filter(year==1977) %>%
  ggplot( aes(gdpPercap, lifeExp, size = pop, color=continent)) +
  geom_point() +
  scale_x_log10() +
  theme_bw()

Faceting

Instead of changing the color of points on the graph by continent, you can also create a different graph for each continent by ‘faceting’. Depending on the number of factors and your dataset, faceting may look better than just changing colors. To do this we add the facet_wrap(~ continent) command.

# gapminder %>%
#   ggplot(aes(x = gdpPercap, y = )) +
#   geom_point() +
#   scale_x_log10() +
#   facet_wrap(~continent)

# gapminder %>%
#   ggplot(aes(x = pop, y = gdpPercap)) +
#   geom_point() +
#   scale_x_log10() +
#   facet_grid(. ~ continent, scale = "fixed")
# 
# glimpse(gapminder)

You can facet with any geom type. Here is an example with geom_histogram(). It is also possible to color and facet on the same variable, as shown below.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_histogram(binwidth = 2) +
  facet_wrap(~ continent)

Adding a linear model line quickly / Geom_smooth

ggplot can also quickly add a linear model to a graph. There are also other models geom_smooth can do (“lm”, “glm”, “gam”, “loess”, “rlm”). If you leaving it blank it will automatically choose one for you, but that is not recommended.

To add the linear model line, we add geom_smooth(method = 'lm', se = TRUE) to the command. se = TRUE tells it to plot the standard error ranges on the graph.

gapminder %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(alpha = year)) + 
  geom_smooth(method = 'lm', se = TRUE) + 
  scale_x_log10()

Changing Colors

ggplot makes it very easy to create very nice graphs, using whats called a theme. There are many default themes or you can make your own. Almost everything is customizable on ggplot, including sizes, colors, font types, etc. Below is a example of building a theme

To use a theme, it is simply added on the end of your ggplot string of commands. You can also add titles, change labels, etc. by adding a labs() command.

You can also use pre-existing themes. My favorite package is ‘ggthemes’, which recreates popular themes from websites and magazines. Here is one based on The Economist magazine

Running RStudio on Cloud Services

Lastly, we introduce a way to run Rstudio in the cloud. This can come in handy later on in the class when we do more complex analysis with larger datasets, as it will greatly speed up processing times.

There are many existing guides to help you set up an RStudio instance on the cloud with Google or Amazon Web Services.

Here are a few good guides for using Google Cloud Services:

and Amazon Web Services: