This tutorial is designed to get you up and running with R as quickly as possible. If you have any questions, come to office hours (listed on Canvas), or ask questions on Piazza.
First, we need to install ‘base R.’ Different download mediums are available on CRAN; pick the correct one for your computer. I strongly advise updating to the most up to date version of R. If you are familiar with command line package managers, e.g. brew, feel free to use those toolchains.
Base R comes with a passable ‘graphical user interface’ (GUI). Because R is an interpreted programming language, you can write R using any text editor. However, we strongly recommend using RStudio, an excellent interface for coding in R.
Rstudio integrates the command line, graphical file directory explorer, graphical environment/variable inspector, and figure/plot output. It also has first-class support for RMarkdown, which will come in very handy for your assignments.
You can install Rstudio freely from their website - download the personal desktop version. If you are feeling adventurous, can tolerate crashes, and won’t complain to the teaching staff, you can try the Rstudio nightly build for many improvements.
After opening up RStudio you will want to open a new R Markdown document by going to File > New File > R Markdown, or selecting R Markdown from selecting the top-left dragdown in RStudio. You will likely be asked to install a number of packages - do that. If the install fails (permission issues), you can try the following:
install.packages("rmarkdown")
install.packages("knitr")An R code file is fundamentally a list of commands to be executed. You can write code directly in the console area, but it is preferable to use the R Scripting area to visualize and keep track of your commands. This is especially important for complicated analyses that you wish to reproduce or share.
R Markdown files are perfect for reproducing and sharing data analysis. Each file is broken down into text chunks which are lightly formatted using Markdown, and code chunks which run R code. The results are shared linearly, so each code chunk remembers what the result of previous chunks was. A cheat sheet provided by RStudio can be found here.
Code chunks look like the following:
## ```{r}
## # code goes here
## ```
Three backticks indicate that the block is a code block, and {r} indicates that the code is in the R language. You can choose to show/hide codes by specifying echo=TRUE/echo=FALSE, and to show nothing but only run the codes by include=FALSE. More options are available. You can also click on the setting (the little gear at the topright corner of each chunk) to check options.
This file itself is a R Markdown file! You might be reading it as a PDF document or as a Word document. RStudio (technically, the knitr R package) builds and can save documents in a number of different formats. You may need to install Pandoc for exporting to PDF or Word, but exporting to HTML should pretty much always work. The Wharton lab computers should work.
There is a button at the top of RStudio, that should read ‘Knit [format]’ - clicking it will knit this document in that format. Try it out!
For more help with Rmarkdown, check out Hadley’s writeup or this cheatsheet. For time and brevity, we cannot cover everything about it. You can also examine the source of this Rmd file to understand how these files are laid out. Additionally, our homework files will be presented in Rmd to examine as well.
At its most basic, R can be used as a calculator. The following code block shows a super simple operation.
(39 + 14) / 7## [1] 7.571429
You can assign values to variable names with the ‘left arrow’ operator, <-, and then access them with that name.
x <- pi
x## [1] 3.141593
In RStudio, you can use alt + - to create the arrow operator. Do not use the ‘equals’ sign assignment!
There is, of course, far more to R code than just this. More complex projects include packages to write books, serve interactive data applications(examples), or ‘just’ do machine learning.
R packages can also take advantage of C++ integration, (example: RPresto), or integrate tightly with the command line or system processes.
There’s even packages to mansplain the code :)
You can evaluate code in your file in several ways:
Here are a few more nice key shortcuts:
RStudio has many helpful shortcuts and tooling - check their Shortcut cheatsheet and their Tip Twitter for more.
Getting help is pretty straightforward in R. Here are a few ways to get help on the function we just saw. The first three commands don’t output anything to your command line but open up a help file in the help viewer.
Three key ways to look up help pages:
?read.csv
help(read.csv)
apropos("read") # List all the functions with "read" as part of the function. Very useful!Google is your best friend! If you pair “R” with some phrase related to statistics or data google usually does a good job, e.g. “R how to read csv files” or “R plot histogram”. Stack Overflow has lots of questions that you might bump into. Developers of packages answer questions on Stack Overflow too, including Hadley Wickham (author of ggplot, dplyr, etc.) and Dirk Eddelbuettel (author of Rcpp)!
If you’re having trouble importing data, RStudio makes reading data easy with Environment tab > Import Dataset.
R is an open source statistical language, which facilitates contributors to write “packages” with supplemental functions to apply the algorithms we learn about in class to actual data! The vast amount of packages for R is one of its biggest strengths. There are over 11000 available packages on CRAN, the Comprehensive R Archive Network.
One of the best ways to explore the packages available for R is through the Task Views page, describing the packages available in R. We will be using many statistical and machine learning algorithms, plotting functions, and datasets that are not available in the base distribution of R.
The following code explains a few of the core package operations.
install.packages("MASS") # Install MASS from CRAN
library("MASS") # Load package MASS
help(package = "MASS") # Get information about MASS
vignette(package = "dplyr") # Read vignettes about dplyr
detach("package:MASS", # Detach package
unload=TRUE)Your working directory is where R will find and save data files, plots, etc. We recommend making a folder in your Dropbox directory for this class and its assorted files (see appendix).
getwd() ## [1] "C:/Users/Jeffrey Yang/Dropbox/STAT 471 TA"
You can set your working directory with setwd(path). Make sure you always check working directory before reading data! This is especially important if you are working on various analyses that assume different working directories.
# d <- "/Users/lzhao/Dropbox/Stat471/Canvas_Spring_2016/R Tutorial/R_tutorial"
# setwd(d)The data for the rest of this section is in the Data folder in Canvas (Courses > Stat 471 > Files > Data). Download both of these files (Survey_results_final.csv and tips.txt) and put them in your working directory.
Here’s an example of how to read in a .csv data file located in your working directory, using the read.csv function in R:
radio <- read.csv("Survey_results_final.csv", header = TRUE,
stringsAsFactors = F)The most important thing to note is the path to the file. If you set your working directory correctly, and the file is in the working directory, this will work. You can also use a direct path, e.g. C:/Users/Jeffrey Yang/Dropbox/STAT 471 TA/Survey_results_final.csv. Alternatively, you can use direct urls to content on the internet, and R will open the connection to download the file.
This example downloads some county level data from the internet, parses the data as a table, and then returns the first 10 rows.
ff <- "https://cdn.rawgit.com/Keno/8573181/raw/7e97f56f521d1f49b966e04457687e87da1b062b/gistfile1.txt"
ff_example <- read.csv(curl::curl(ff), stringsAsFactors = F)
head(ff_example, 10)## NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS
## 1 Lake of the Woods Minnesota 27 77 27077
## 2 Ferry Washington 53 19 53019
## 3 Stevens Washington 53 65 53065
## 4 Okanogan Washington 53 47 53047
## 5 Pend Oreille Washington 53 51 53051
## 6 Boundary Idaho 16 21 16021
## 7 Lincoln Montana 30 53 30053
## 8 Flathead Montana 30 29 30029
## 9 Glacier Montana 30 35 30035
## 10 Toole Montana 30 101 30101
You can add additional parameters to customize the import process. Use ?read.csv to see the available options (or type read.csv( and hit tab in Rstudio).
Before you conduct your analysis it is always wise to take a quick look at the data and try to spot anything abnormal such as missing data.
In R, data is usually stored in an object called a ‘data frame’. Each row is an observation and each column is a variable/feature.
class(radio)## [1] "data.frame"
As noted above, you can type radio into console and get the full representation of the object. However, this won’t display nicely when there are a lot of columns. We often examine the structure, head, or tail of the data to get a feel for it.
str(radio)
head(radio)
tail(radio)
ncol(radio)You can also check the dimensions of the dataset with dim(). Other useful functions include length(), nrow(), ncol(). Variable names are accessed with names() function.
In Rstudio, you can also go to the Environment panel, and click on a particular object to open a visual representation of the object. You can also access that with View() (capital V).
You can subset with brackets. names(radio) returns a list, and to access the first object of the list you do names(radio)[1].
names(radio)[1] <- "hit_id"
names(radio)[1:10]## [1] "hit_id" "HITTypeId"
## [3] "Title" "Description"
## [5] "Keywords" "Reward"
## [7] "CreationTime" "MaxAssignments"
## [9] "RequesterAnnotation" "AssignmentDurationInSeconds"
The lm() command stands for linear model and allows you to run regressions. Here we’ll run a quick analysis of the relationship between tipping behavior and total bill size.
tips <- read.csv("tips.txt", stringsAsFactors = T)
str(tips)## 'data.frame': 244 obs. of 7 variables:
## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...
Run some of these commands to explore the dataset.
dim(tips) # the size of the data
head(tips) # look at the first few entries
head(tips, 10) # look at the first ten entries
tail(tips)
names(tips) # see the name of the columns
summary(tips) # get a simple summary of each variableIt’s easy to create a new variable as a function of other variables. Here, tips$tip denotes the tip column in the tips data and tips$total_bill denotes the total_bill column.
tips$percent <- 100*tips$tip/tips$total_bill # create a new variable
str(tips)## 'data.frame': 244 obs. of 8 variables:
## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...
## $ percent : num 5.94 16.05 16.66 13.98 14.68 ...
Here are some basic but important plotting functions that come with the base distribution of R. For the most part, this is all you’ll need give or take a few other plot types (i.e. qqplot(), qqline(), abline()).
Often base R plot() returns a passable graphical representation. The rest of this section details a few options you can set to create different graphics. R has amazing graphical capability; in particular, ggplot2 is a great package to use for plotting and graphics.
plot(tips$total_bill,tips$percent)The same plot with some bells and whistles. This is included to show the capabilities of base R graphics, but I would strongly recommend using ggplot2 instead if you want to make serious, involved graphics.
plot(tips$total_bill, tips$percent,
main = "Total Bill v. Percent Tip", # give plot a title
ylab = "Percent", # label the y-axis
xlab = "Total Bill", # label the x-axis
pch = 16, # change the type of plot point
col = "red", # set the color of plot point
lwd = 2, # set the line width
xlim = c(0,60), # change limits of x-axis
ylim = c(0,50)) # change the limits of y-axisA simple linear regression
model <- lm(percent ~ total_bill, data = tips) # save your regression as an object
model # show modelling results
summary(model) # show more detailed results
# plotting the results
plot(tips$total_bill, tips$percent,
main = "Total Bill v. Percent Tip", # give plot a title
ylab = "Percent", # label the y-axis
xlab = "Total Bill", # label the x-axis
pch = 16, # change the type of plot point
col = "red", # set the color of plot point
lwd = 2, # set the line width
xlim = c(0,60), # change limits of x-axis
ylim = c(0,50)) # change the limits of y-axis
abline(model) # add best fit lineThe most important guideline in writing code is to keep it simple.
Hadley’s R style guide is excellent and valuable for writing readable, meaningful, and sharable code in R. We will not enforce adherence to these guidelines, but it is definitely worth reading through to understand their techniques and reasoning. Google’s R style guide is a bit more comprehensive but also older; there are no bad ideas in here, but a few outdated or possibly internal quirks that I disagree with.
You can perform basic math, vector algebra, etc. using R. In fact, these basic commands are the building blocks of many of the sophisticated methods you will learn later in the course.
R is all about functions. There are many built in functions and you can even define your own. Here are some familar ones:
1 + 1
exp(2)
pi
log(3) # this is the NATURAL log not base 10.
cos(2)Here’s a simple function definition:
square <- function(x) {
return(x^2)
}
square(12)## [1] 144
This is hard. Recall the tips set.
head(tips)## total_bill tip sex smoker day time size percent
## 1 16.99 1.01 Female No Sun Dinner 2 5.944673
## 2 10.34 1.66 Male No Sun Dinner 3 16.054159
## 3 21.01 3.50 Male No Sun Dinner 3 16.658734
## 4 23.68 3.31 Male No Sun Dinner 2 13.978041
## 5 24.59 3.61 Female No Sun Dinner 4 14.680765
## 6 25.29 4.71 Male No Sun Dinner 4 18.623962
Then, we can get a column, if we know the name of the column, using the $ operator:
head(tips$percent, 2)## [1] 5.944673 16.054159
Data frames are matrix-like in R. You can access a particular cell using location indices:
# 4th row, 8th column ('percent')
tips[4, 8]## [1] 13.97804
To get a full row or column, leave the other blank
tips[4,] # 4th row/observation## total_bill tip sex smoker day time size percent
## 4 23.68 3.31 Male No Sun Dinner 2 13.97804
head(tips[,8], 2) # 8th column## [1] 5.944673 16.054159
We can assign values to variables in our workspace.
x <- 1 # assign a value to x
x # print the value of x
y <- pi
z <- -10
ls() # see what variables are stored in your workspace
rm(x) # remove x from your workspace
# x
# what will happen
ls()
rm(list = ls()) # remove everything in your workspace (very handy trick)
# y
# what will happenThe c() function “collects” variables of the same class. We use it to collect numbers into a vector.
x <- c(1,2,3,4,5) # variables can store collections of numbers
y <- 11:15 # use ":" as a quick way to write sequence of numbers
z <- c(x,y) # glue two vectors together
length(x) # find the length of x
sum(x) # find the sum of elements in x
max(x) # find the maximum value, ...
min(x)
mean(x)
sd(x)
summary(x)
x[1] <- 100 # change the value at a location
x
y[c(2,3)] <- c(1,1) # change the value at multiple locations
y
z <- x+y # math is done "component wise"
z
x*y # element by element multiplication
sqrt(x) # even "scalar" functions operate on vectorsYou should store your files under some form of version control. Doing so allows you to view your file history and restore your files in case your hard drive dies or you make a mistake.
If you are familiar with Git, I would recommend creating a private repo on Github or Bitbucket. Otherwise, it’s perfectly reasonable to put your working directory in your Dropbox account.
For further study, we highly recommend studying the ‘Tidy-verse’, or a collection of R libraries and paradigms that encourage the use of tidy data. For an introduction, check out this vignette.
Hadley’s new book, R for Data Science is a phenomenal resource for understanding the modern data science workflow.
Airbnb uses R extensively internally and wrote about that experience.
Some more R references: