We will focus today on three specific packages dplyr, data.table, and ggplot. dplyr and2 ggplot are part of a broader library called tidyverse which consolidates commonly-used data science packages. See more here: https://www.tidyverse.org/.
dplyr can be used for data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.data.table is an R package that provides an enhanced version of data.frames. Its speed makes it a preferable package to munipulate BIG data.ggplot is used for creeating Graphics or Plots.In addition to tidyverse, nycflights13, gapminder and ggthemes will be used in the tutorial. We use p_load function of pacman, to load all the packages at once. It will first check whether packages are installed. If not, it will install the yet to be installed automatically and then load all the packages.
## Loading required package: pacman
dplyrBelow is a table of popular dplyr commands. We will go through each one.
| dplyr Command | SQL equivalent | Action |
|---|---|---|
filter() |
WHERE | Limit based on condition |
select() |
SELECT | Choose only certain variables |
distinct() |
DISTINCT | De-duplicate result-set |
arrange() |
ORDER BY | Order results |
rename() |
SELECT | Rename variables |
mutate() |
SELECT | Create new variable |
group_by() |
GROUP BY | Group rows |
summarise() |
SELECT | Create new variable in grouped setting |
You can also view the full cheat sheet here: DPLYR Cheat Sheet
To illustrate these commands, we will use a pre-existing dataset the contains 336,776 flights that departed from New York City in 2013. Dplyr allows you to gather insight from a dataset without altering the original dataset. It is considered best practice not to alter the original dataset. For example in this case, we will never overwrite the existing dataset ‘flights’. We will first take a look at the summary statistics.
summary(flights)## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 106 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124
## Median :1359 Median : -2.00 Median :1535 Median :1556
## Mean :1344 Mean : 12.64 Mean :1502 Mean :1536
## 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945
## Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## NA's :8255 NA's :8713
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:336776 Min. : 1 Length:336776
## 1st Qu.: -17.000 Class :character 1st Qu.: 553 Class :character
## Median : -5.000 Mode :character Median :1496 Mode :character
## Mean : 6.895 Mean :1972
## 3rd Qu.: 14.000 3rd Qu.:3465
## Max. :1272.000 Max. :8500
## NA's :9430
## origin dest air_time distance
## Length:336776 Length:336776 Min. : 20.0 Min. : 17
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 502
## Mode :character Mode :character Median :129.0 Median : 872
## Mean :150.7 Mean :1040
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## NA's :9430
## hour minute time_hour
## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :13.00 Median :29.00 Median :2013-07-03 10:00:00
## Mean :13.18 Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:17.00 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :23.00 Max. :59.00 Max. :2013-12-31 23:00:00
##
head(flights, n = 30)## # A tibble: 30 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 20 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
tail(flights)## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 9 30 NA 1842 NA NA
## 2 2013 9 30 NA 1455 NA NA
## 3 2013 9 30 NA 2200 NA NA
## 4 2013 9 30 NA 1210 NA NA
## 5 2013 9 30 NA 1159 NA NA
## 6 2013 9 30 NA 840 NA NA
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
glimpse(flights)## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
The %>% command is called a pipe. This means that the result of the code before %>% is sent, or “piped”, to the one after after %>%. Piping is a powerful tool for clearly expressing a sequence of multiple operations, as we will see shortly.
Below is a simple example with just numbers. This is the same as 3 + 4 + 5 which gives us an output of 12.
3 %>%
+ 4 %>%
+ 5## [1] 12
Pro Tip: You can use Command + Shift + M in a codeblock to put in a pipe.
The filter command will only display the subset of your dataset that match a certain condition. This command will only show flights on Jan 1st, 2013.
This code is the same as doing filter(flights, month == 1 & day == 1) since the %>% command passes the flights dataframe to the filter command.
firstdayofmonth <- flights %>%
filter(month == 1 & day == 1)
glimpse(firstdayofmonth)## Observations: 842
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
glimpse(flights)## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
It is important to remember that this command does not alter the orignal flight dataset. If we want to save this subset as its own object, we run the following. Remember the <- is the assignment operator in R.
filteredFlight <- flights %>%
filter(month == 1 | day == 1)Multiple conditions can be included in a filter command. The command below shows any flights from Jan through June to PHL or SLC airports.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6)## # A tibble: 2,116 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 655 655 0 1021
## 2 2013 1 1 908 915 -7 1004
## 3 2013 1 1 1047 1050 -3 1401
## 4 2013 1 1 1245 1245 0 1616
## 5 2013 1 1 1323 1300 23 1651
## 6 2013 1 1 1543 1550 -7 1933
## 7 2013 1 1 1600 1610 -10 1712
## 8 2013 1 1 1909 1912 -3 2239
## 9 2013 1 1 1915 1920 -5 2238
## 10 2013 1 1 2000 2000 0 2054
## # ... with 2,106 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
flights %>%
filter(dest %in% c("PHL","SLC") & month %in% c(1, 3, 6))## # A tibble: 1,112 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 655 655 0 1021
## 2 2013 1 1 908 915 -7 1004
## 3 2013 1 1 1047 1050 -3 1401
## 4 2013 1 1 1245 1245 0 1616
## 5 2013 1 1 1323 1300 23 1651
## 6 2013 1 1 1543 1550 -7 1933
## 7 2013 1 1 1600 1610 -10 1712
## 8 2013 1 1 1909 1912 -3 2239
## 9 2013 1 1 1915 1920 -5 2238
## 10 2013 1 1 2000 2000 0 2054
## # ... with 1,102 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Select will only return columns that are listed. In this case, the resulting dataset will consist of the Origin, Destination, and Carrier of flights that were destined for PHL or SLC in the first 6 months of the year. Remember, the pipe command sends the result of the current line to the next line. In this case, the filtered dataset is then piped into the select command.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier)## # A tibble: 2,116 x 3
## origin dest carrier
## <chr> <chr> <chr>
## 1 JFK SLC DL
## 2 LGA PHL US
## 3 JFK SLC DL
## 4 JFK SLC DL
## 5 EWR SLC DL
## 6 JFK SLC DL
## 7 JFK PHL 9E
## 8 JFK SLC B6
## 9 JFK SLC DL
## 10 JFK PHL 9E
## # ... with 2,106 more rows
On the contrary, we can use - to deselect columns. If we want to drop year, month and day, we just need to prefix - to each column name.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(-year, -month, -day)## # A tibble: 2,116 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 655 655 0 1021 1030 -9
## 2 908 915 -7 1004 1033 -29
## 3 1047 1050 -3 1401 1410 -9
## 4 1245 1245 0 1616 1615 1
## 5 1323 1300 23 1651 1608 43
## 6 1543 1550 -7 1933 1925 8
## 7 1600 1610 -10 1712 1729 -17
## 8 1909 1912 -3 2239 2237 2
## 9 1915 1920 -5 2238 2257 -19
## 10 2000 2000 0 2054 2110 -16
## # ... with 2,106 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Distinct will remove any duplicate rows from the given dataset. Notice in the previous command, it returned a subset with 2116 rows, but with distinct, we can see that only 8 carriers flew to PHL or SLC in the first half of the year.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier) %>%
distinct()## # A tibble: 8 x 3
## origin dest carrier
## <chr> <chr> <chr>
## 1 JFK SLC DL
## 2 LGA PHL US
## 3 EWR SLC DL
## 4 JFK PHL 9E
## 5 JFK SLC B6
## 6 EWR PHL EV
## 7 JFK PHL US
## 8 JFK PHL DL
Arrange puts your data into alphabetical order. In this case the order is first by origin, then descending alpahbetical order of the destination, then alphabetical order of carrier.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier) %>%
distinct() %>%
arrange(origin, desc(dest), carrier)## # A tibble: 8 x 3
## origin dest carrier
## <chr> <chr> <chr>
## 1 EWR SLC DL
## 2 EWR PHL EV
## 3 JFK SLC B6
## 4 JFK SLC DL
## 5 JFK PHL 9E
## 6 JFK PHL DL
## 7 JFK PHL US
## 8 LGA PHL US
The Rename function can be used to easily rename a column Header. Here, we rename carrier to airline.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier) %>%
distinct() %>%
arrange(origin, desc(dest), carrier) %>%
rename(airline = carrier)## # A tibble: 8 x 3
## origin dest airline
## <chr> <chr> <chr>
## 1 EWR SLC DL
## 2 EWR PHL EV
## 3 JFK SLC B6
## 4 JFK SLC DL
## 5 JFK PHL 9E
## 6 JFK PHL DL
## 7 JFK PHL US
## 8 LGA PHL US
glimpse(flights)## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
Mutate is used to create new columns based on current ones. This feature is very useful. Here, we create three new variables “gain”, “speed”, and “gain_per_hour”. Notice how “gain_per_hour” uses the column “gain”, which was created in the same mutate statement.
flights %>%
mutate(gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
gain_per_hour = gain / (air_time / 60)) %>%
select(dep_delay, arr_delay, gain, distance, distance, air_time, speed, gain_per_hour)## # A tibble: 336,776 x 7
## dep_delay arr_delay gain distance air_time speed gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 11 -9 1400 227 370. -2.38
## 2 4 20 -16 1416 227 374. -4.23
## 3 2 33 -31 1089 160 408. -11.6
## 4 -1 -18 17 1576 183 517. 5.57
## 5 -6 -25 19 762 116 394. 9.83
## 6 -4 12 -16 719 150 288. -6.4
## 7 -5 19 -24 1065 158 404. -9.11
## 8 -3 -14 11 229 53 259. 12.5
## 9 -3 -8 5 944 140 405. 2.14
## 10 -2 8 -10 733 138 319. -4.35
## # ... with 336,766 more rows
flights %>%
mutate(total_delay = dep_delay + arr_delay) %>%
select(carrier, total_delay) %>%
arrange(desc(total_delay)) %>%
group_by(carrier) %>%
summarise(average_delay = mean(total_delay))## # A tibble: 16 x 2
## carrier average_delay
## <chr> <dbl>
## 1 9E NA
## 2 AA NA
## 3 AS NA
## 4 B6 NA
## 5 DL NA
## 6 EV NA
## 7 F9 NA
## 8 FL NA
## 9 HA -2.01
## 10 MQ NA
## 11 OO NA
## 12 UA NA
## 13 US NA
## 14 VX NA
## 15 WN NA
## 16 YV NA
Reorginizes dataframe by rows according to the column that is grouped by; sumarise then gives statistics of that group. Here, the origin column had three categories, EWR, JFK, & LGA. The group_by(origin) command organizes the dataframe by the three origins. Then summarise() is used to get metrics related to each origin.
From this table, we can see that EWR had the most flights with 120835, and LGA had the lowest avg delay at 10.34
flights %>%
group_by(origin) %>%
summarise(num_of_flights = n(),
avg_delay = mean(dep_delay, na.rm = TRUE)) # na.rm removes any NA data## # A tibble: 3 x 3
## origin num_of_flights avg_delay
## <chr> <int> <dbl>
## 1 EWR 120835 15.1
## 2 JFK 111279 12.1
## 3 LGA 104662 10.3
group_by can also take expressions. The following returns the number of flights that started late but arrived early (or on time), started and arrived late etc.
flights %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
group_by(dep_delay > 0, arr_delay > 0) %>%
summarise(num_of_flights = n())## # A tibble: 4 x 3
## # Groups: dep_delay > 0 [?]
## `dep_delay > 0` `arr_delay > 0` num_of_flights
## <lgl> <lgl> <int>
## 1 FALSE FALSE 158900
## 2 FALSE TRUE 40701
## 3 TRUE FALSE 35442
## 4 TRUE TRUE 92303
Summarise has a number of other functions that can be used within it. n_distinct(dest) returns the number of distinct destinations. From this table we can see that EWR has flights to the largest number of destinations (56). We can also see LGA flights has a lower average distance than those of EWR & JFK.
flights %>%
group_by(origin) %>%
summarise(destinations = n_distinct(dest),
avg_distance = mean(distance, na.rm = TRUE))## # A tibble: 3 x 3
## origin destinations avg_distance
## <chr> <int> <dbl>
## 1 EWR 86 1057.
## 2 JFK 70 1266.
## 3 LGA 68 780.
Here we summarise the whole dataset. We can see we have 337,776 observations, 105 distinct destinations and a 12.6 min avg delay.
flights %>%
group_by(carrier) %>%
summarise(num_of_flights = n(),
destinations = n_distinct(dest),
avg_delay = mean(dep_delay, na.rm = TRUE))## # A tibble: 16 x 4
## carrier num_of_flights destinations avg_delay
## <chr> <int> <int> <dbl>
## 1 9E 18460 49 16.7
## 2 AA 32729 19 8.59
## 3 AS 714 1 5.80
## 4 B6 54635 42 13.0
## 5 DL 48110 40 9.26
## 6 EV 54173 61 20.0
## 7 F9 685 1 20.2
## 8 FL 3260 3 18.7
## 9 HA 342 1 4.90
## 10 MQ 26397 20 10.6
## 11 OO 32 5 12.6
## 12 UA 58665 47 12.1
## 13 US 20536 6 3.78
## 14 VX 5162 5 12.9
## 15 WN 12275 11 17.7
## 16 YV 601 3 19.0
Dpylr is a great way to answer initial questions about a dataset. For example, say we want to know what the farthest flight to leave NYC is.
To answer this, we can group by origin and destination, summarise the max distance for each pair, and then order by the maximum distance value we created. It is now easy to see that the max distance was from EWR or JFK to HNL.
flights %>%
group_by(origin, dest) %>%
summarise(max_distance = max(distance)) %>%
arrange(desc(max_distance))## # A tibble: 224 x 3
## # Groups: origin [3]
## origin dest max_distance
## <chr> <chr> <dbl>
## 1 JFK HNL 4983
## 2 EWR HNL 4963
## 3 EWR ANC 3370
## 4 JFK SFO 2586
## 5 JFK OAK 2576
## 6 JFK SJC 2569
## 7 EWR SFO 2565
## 8 JFK SMF 2521
## 9 JFK LAX 2475
## 10 JFK BUR 2465
## # ... with 214 more rows
data.tableAs an enhanced version of data.frame, one can do a lot more than subsetting rows and selecting columns within the frame of data.table, i.e., within [ ... ]. The general form of data.table syntax is as shown below:
## DT[i, j, by]
##
## ## R: i j by
## ## SQL: where select | update group by
The way to read it (out loud) is as suggested:
Take DT, subset rows using i, then calculate j, grouped by by.
Let’s redo the work by dplyr above in the data.table way. This tutorial will only cover the basics of data.table. There are more advanced data.table operations that come in handy. See the data.table cheat sheet here. Comparisons between dplyr, data.table and pandas (a popular Python package) in terms of speed can be found here.
data.tablesetDT transforms data.frame into data.table. Notice that since data.table is an enhanced version of data.frame, operations on data.frame are still avaiable to data.table. We first make a copy of flights and transform it into data.table.
flights2 <- copy(flights)
setDT(flights2)
class(flights2)## [1] "data.table" "data.frame"
iTo show flights on Jan 1st, 2013:
dplyr:
flights %>% filter(month == 1 & day == 1)data.table:
flights2[month == 1 & day == 1]To get any flights from Jan through June to PHL or SLC airports:
dplyr:
flights %>% filter(dest %in% c("PHL","SLC") & month <= 6)data.table:
flights2[dest %in% c("PHL","SLC") & month <= 6]To get the first N row: dplyr:
flights[1:2,]data.table:
flights2[1:2]iTo sort column origin in ascending order, and then by dest in descending order: dplyr:
flights %>%
select(origin, dest, carrier) %>%
distinct() %>%
arrange(origin, desc(dest), carrier)data.table:
unique(flights2[order(origin, -dest), .(origin, dest, carrier)])jTo select Origin, Destination, and Carrier of flights: dplyr:
flights %>% select(origin, dest, carrier)data.table:
flights2[, list(origin, dest, carrier)]or
flights2[, .(origin, dest, carrier)]or you can also store the columns names into a variable and select them using a special syntax ...
select_columns <- c("origin", "dest", "carrier")
flights2[, ..select_columns]When selecting only one column, data.table allows us to return the column as a vector as what the base $ does.
If
To select origin and return it as a vector: dplyr:
flights$origin
flights %>% select(origin)data.table:
flights2[, .(origin)]To rename dest to destination and carrier to airline:
dplyr:
flights %>%
select(origin, dest, carrier) %>%
rename(dest = destination, airline = carrier)data.table:
flights2[, .(origin, destination = dest, airline = carrier)]Notice that in dplyr, the syntax is rename(old_name = new_name) which be read as “rename old_name to new_name”; while data.table selects and renames at once: to take values in dest and to put the values to destination.
jOther than selecting columns, j can also handle expressions. For example,
To compute how many trips had total delay < 0: base:
sum((flights$arr_delay + flights$dep_delay) < 0, na.rm = T)## [1] 188401
data.table:
flights2[, sum((arr_delay + dep_delay) < 0, na.rm = T)]## [1] 188401
.NThere are some special symbol in data.table, e.g. .N, .SD, .SDcols. .N holds the number of observations in the current group. Think of it as a counterpart of n() of dplyr.
To count the total number of flights: dplyr:
flights %>%
summarise(num_of_flights = n())## # A tibble: 1 x 1
## num_of_flights
## <int>
## 1 336776
data.table:
flights2[, .(num_of_flights = .N)]## num_of_flights
## 1: 336776
:=To create new columns as what mutate of dplyr does:
dplyr:
flights %>%
mutate(gain = dep_delay - arr_delay)data.table:
flights2[, gain := dep_delay - arr_delay]To remove gain: Use := NULL to remove columns.
flights2[, gain := NULL]## Warning in `[.data.table`(flights2, , `:=`(gain, NULL)): Adding new column
## 'gain' then assigning NULL (deleting it).
bySimilar to group_by of dplyr, data.table applies the operations in j to groups specified in by.
To count the total number of flights from each origin and average delay time of each: dplyr:
flights %>%
group_by(origin) %>%
summarise(num_of_flights = n(),
avg_delay = mean(dep_delay, na.rm = TRUE))data.table:
flights2[, .(num_of_flights = .N,
avg_delay = mean(dep_delay, na.rm = TRUE)),
by = origin]by also takes expressions like group_by does.
To count the number of flights that started late but arrived early (or on time), started and arrived late etc.: dplyr:
flights %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
group_by(dep_delay > 0, arr_delay > 0) %>%
summarise(num_of_flights = n())data.table:
flights2[!is.na(dep_delay) & !is.na(arr_delay),
.N, by = .(dep_delay>0, arr_delay>0)]data.table can tack expressions one after another, forming a chain of operations similar to piping %>%, i.e., DT[ … ][ … ][ … ].
flights2[, .(max_distance = max(distance)), by = .(origin, dest)
][order(-max_distance)]dplyr or data.table?Choosing dplyr or data.table is a personal preference. Here is Stack Overflow post concerning this question. Both authors of dplyr (Hadley) and data.table (Arun) compare their packages in terms of speed, memory usage and syntax. In short, dplyr is more readable (though readability is a subjective question); while data.table performs faster than dplyr as the data size grows, and uses less memory in several functions by its nature.
ggplotWe now move on to ggplot. The basic idea of ggplot is to independently specify building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
In ggplot land aesthetic means “something you can see”. Examples include:
We now use a different dataset, gapminder, from the package gapminder. Let’s do a quick summary.
summary(gapminder)## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
glimpse(gapminder)## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
| Data | Plots | Geom (ggplot command) |
|---|---|---|
| One Continuous | Histogram | geom_histogram |
| One Continuous + One Categorical | Boxplot | geom_boxplot |
| Two Continuous | Scatter Plot | geom_point |
| Three Continuous | Scatter Plot + Size | geom_point w/ size aesthetic |
| Two Continuous + One Categorical | Scatter Plot + Color | geom_point w/ color aesthetic |
| Categorical with reasonable number of levels | Faceting!! | facet_wrap() |
Note: Time is always the x-axis.
There are many more geom types, but we will focus on the ones listed in the table above.
Here is an extremely useful cheatsheet that shows all of ggplots functions and how to use them.
The following shows the histogram of life Expectancy in 2007. Life expectancy is a continous variable, so we use geom_histogram().
Note how the %>% or “piping” also works with ggplot. If you are not piping in a dataframe, the first input to ggplot should be your dataframe. For example, the command would become ggplot(gapminder, aes(x = lifeExP)) + geom_histogram(binwidth = 2)
hist(gapminder$lifeExp)gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 2)ggplot(gapminder, aes(x = lifeExp)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, we want to show lifeExp broken down by continent. Continent is a categorical variable, also called factors in R. For this, we use the geom_boxplot() command.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
geom_point(position = "jitter", alpha = 0.6, col = "blue")Using geom_point() we create a scatter plot of our two continous variables, gdpPercap and LifeExp.
plot(gapminder$gdpPercap, gapminder$lifeExp, pch=16)gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.2, col = "blue")Some relationships will look better on different scales, and ggplot allows you to change scales very quickly. Here we log the x-axis, with scale_x_log10(), which makes the relationship between these two varibles much clearer.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
scale_x_log10()If we want to show three continous variables at the same time, we can use the size aesthetic in ggplot. This will alter the size of the point by the value in the pop column of the gapminder data frame.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, size = pop)) +
geom_point() +
geom_smooth() +
scale_x_log10()To show more insight into this graph, we can show each point by which continent it is from. Adding the color Aesthetic allows us to show a categorical variable, continent, as each point is colored by what continent it is from.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.6, position = "jitter") +
scale_x_log10()p <- gapminder %>%
filter(year==1977) %>%
ggplot( aes(gdpPercap, lifeExp, size = pop, color=continent)) +
geom_point() +
scale_x_log10() +
theme_bw()Instead of changing the color of points on the graph by continent, you can also create a different graph for each continent by ‘faceting’. Depending on the number of factors and your dataset, faceting may look better than just changing colors. To do this we add the facet_wrap(~ continent) command.
# gapminder %>%
# ggplot(aes(x = gdpPercap, y = )) +
# geom_point() +
# scale_x_log10() +
# facet_wrap(~continent)
# gapminder %>%
# ggplot(aes(x = pop, y = gdpPercap)) +
# geom_point() +
# scale_x_log10() +
# facet_grid(. ~ continent, scale = "fixed")
#
# glimpse(gapminder)You can facet with any geom type. Here is an example with geom_histogram(). It is also possible to color and facet on the same variable, as shown below.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_histogram(binwidth = 2) +
facet_wrap(~ continent)ggplot can also quickly add a linear model to a graph. There are also other models geom_smooth can do (“lm”, “glm”, “gam”, “loess”, “rlm”). If you leaving it blank it will automatically choose one for you, but that is not recommended.
To add the linear model line, we add geom_smooth(method = 'lm', se = TRUE) to the command. se = TRUE tells it to plot the standard error ranges on the graph.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(alpha = year)) +
geom_smooth(method = 'lm', se = TRUE) +
scale_x_log10()ggplot makes it very easy to create very nice graphs, using whats called a theme. There are many default themes or you can make your own. Almost everything is customizable on ggplot, including sizes, colors, font types, etc. Below is a example of building a theme
To use a theme, it is simply added on the end of your ggplot string of commands. You can also add titles, change labels, etc. by adding a labs() command.
You can also use pre-existing themes. My favorite package is ‘ggthemes’, which recreates popular themes from websites and magazines. Here is one based on The Economist magazine
Lastly, we introduce a way to run Rstudio in the cloud. This can come in handy later on in the class when we do more complex analysis with larger datasets, as it will greatly speed up processing times.
There are many existing guides to help you set up an RStudio instance on the cloud with Google or Amazon Web Services.
Here are a few good guides for using Google Cloud Services:
and Amazon Web Services: