R is a programming language designed for statistical computing. It is designed for working with data and has an extensive collection of libraries for this purpose. A huge part of the value of a particular programming language is the existing collection of libraries.
I personally find R's programming syntax to be a bit strange, but I expect that we will learn to love it together.
R is often run in interactive mode. This means that there is an interpreter running in a loop; it reads your command, executes, and then displays the output. This is much like a calculator.
In particular we can use R as a calculator:
sin(3)
## [1] 0.14112
2^32
## [1] 4294967296
Of course there are numeric types (integers, floating point numbers).
3L; typeof(3L);
## [1] 3
## [1] "integer"
3; typeof(3)
## [1] 3
## [1] "double"
Here I executed multiple commands by separating them with a semicolon.
There are character/string types for storing text:
"this is text"; typeof("this is text")
## [1] "this is text"
## [1] "character"
Single and double quotes are interpreted the same:
'this' == "this"
## [1] TRUE
But it is convenient to use them when you need to text with quotation marks:
'these are "quotes"'
## [1] "these are \"quotes\""
Vectors are used for lists of elements of the same type:
x <- c(1,3,5); x; typeof(x); x[2]+1
## [1] 1 3 5
## [1] "double"
## [1] 4
Note that we started the value of the vector into a variable x via the assignment operation <- . One can also use the = expression for ment, but this can only be used in a simple assignment expression and may not be compatible with old versions of S-plus. I generally consider restricting assignment to simple expressions to be a good thing.
We also accessed the second element of our vector by x[2]. Note that most programming languages start the indexing at 0, but not R. It starts at 1. This is probably more intuitive for most people, but confusing to those who have programmed with other languages.
Vectors of mixed types are coerced into one type:
y <- c(1,"a"); y; typeof(y)
## [1] "1" "a"
## [1] "character"
We can see that R decided that I did not really want to include the number 1, in my list and turned it into a string. This is both helpful and confusing. It can lead to unexpected errors:
y[1]+1
## Error in y[1] + 1: non-numeric argument to binary operator
We can also generate vector sequences easily:
seq1 <- 1:5; seq1; typeof(seq1)
## [1] 1 2 3 4 5
## [1] "integer"
seq2 <- seq(0,10,2); seq2
## [1] 0 2 4 6 8 10
To put mixed data types together we can create a list:
list1 <- list(1, 2, "three", c(1,2,3)); list1; typeof(list)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] "three"
##
## [[4]]
## [1] 1 2 3
## [1] "builtin"
We can access the elements as with vectors:
list1[3]; typeof(list1[3])
## [[1]]
## [1] "three"
## [1] "list"
Note that indexing still returns a list.
list1[2]+2
## Error in list1[2] + 2: non-numeric argument to binary operator
Instead we need:
list1[[2]]+2
## [1] 4
We can build a matrix from a vector
M1 = matrix( 1:20, nrow = 4, ncol = 5); M1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
Transpose operation:
t(M1)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [5,] 17 18 19 20
Matrix multiplication is done as follows:
t(M1)%*%M1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 30 70 110 150 190
## [2,] 70 174 278 382 486
## [3,] 110 278 446 614 782
## [4,] 150 382 614 846 1078
## [5,] 190 486 782 1078 1374
The standard multiplication is componentwise:
1:5*2:6; 1:5 %*% 2:6
## [1] 2 6 12 20 30
## [,1]
## [1,] 70
Note that R tries to be helpful and change the vectors into column/row vectors to make sense of the multiplication.
1:5*3
## [1] 3 6 9 12 15
We can also create arrays, which are essentially tensors:
T1 <- array(1:24, dim = c(2,3,4)); T1
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
We can access entries in arrays/matrices in the expected way:
M1[[2,3]]; T1[[2,1,3]];
## [1] 10
## [1] 14
In data analysis we often have non-numeric values. These correspond to factors in R.
genders <- c('male', 'female', 'female', 'male'); factor(genders);
## [1] male female female male
## Levels: female male
One of the most common structures in data science is the data frame. This can be thought of as a large table, like a spreadsheet, but all of the columns must be of the same type.
participants <- data.frame(
gender = c("male", "male", "female", "male"),
first_name = c("John", "Moritz", "Martha", "Bernd"),
age = c(19, 20, 18, 30)
);
print(participants)
## gender first_name age
## 1 male John 19
## 2 male Moritz 20
## 3 female Martha 18
## 4 male Bernd 30
We can access columns as follows:
participants$first_name
## [1] John Moritz Martha Bernd
## Levels: Bernd John Martha Moritz
Note that R tried to be helpful and coerced all of the strings into factors. This is not desirable for the names.
participants <- data.frame(
gender = c("male", "male", "female", "male"),
first_name = c("John", "Moritz", "Martha", "Bernd"),
age = c(19, 20, 18, 30), stringsAsFactors = FALSE
);
print(participants)
## gender first_name age
## 1 male John 19
## 2 male Moritz 20
## 3 female Martha 18
## 4 male Bernd 30
We do want the gender to be factor names:
participants$gender = factor(participants$gender); participants
## gender first_name age
## 1 male John 19
## 2 male Moritz 20
## 3 female Martha 18
## 4 male Bernd 30
We can get columns and rows:
participants[1:2]; participants[1:2,]
## gender first_name
## 1 male John
## 2 male Moritz
## 3 female Martha
## 4 male Bernd
## gender first_name age
## 1 male John 19
## 2 male Moritz 20
We can also select from the frame:
subset(participants, gender == "male" & age < 25)
## gender first_name age
## 1 male John 19
## 2 male Moritz 20
R has precisely 3 gazillion stats functions. Let's play with some standard ones.
Here let's grab 20 samples from a standard normal distribution:
y <- rnorm(20); y
## [1] -0.14934151 -0.65468880 -0.07507725 1.26377024 0.73371480
## [6] 1.12150385 -0.91064302 -0.26950052 -0.49669479 0.33415133
## [11] 1.63102991 0.12585310 -0.42296972 -0.64616378 0.30149956
## [16] -0.09799198 1.44958537 1.98008501 0.53238166 0.83266742
Similarly from the uniform distribution (on the interval):
x <- runif(20); x
## [1] 0.55227375 0.17454460 0.84282352 0.85941586 0.09597788 0.51938783
## [7] 0.15086052 0.57803668 0.83860268 0.80295214 0.99171717 0.66124529
## [13] 0.19920251 0.64923401 0.02518160 0.13439947 0.50767363 0.45769632
## [19] 0.14298703 0.03797556
We can then plot this:
plot(x,y)
Or if want to get fancy (and we do), we can use ggplot2:
library(ggplot2);
df <- data.frame( xvals = x, yvals = y);
ggplot(df, aes(x = xvals, y = yvals)) + geom_point();
Or the interactive version (just for this type of webpage):
library(plotly);
df <- data.frame( xvals = x, yvals = y);
pl <- ggplot(df, aes(x = xvals, y = yvals)) + geom_point();
ggplotly(pl)
Okay let's get fancy. Let's plot a two variable standard Gaussian.
x = seq(-3,3, length = 20)
y = x
f = function(x,y) { exp(-(x^2+y^2)/2)*(2*pi)^(0.5) }
g = outer(x,y,f)
contour(x,y,g)
persp(x,y,g, theta = 30, phi = 10)
Let's load our library of datasets and examine one. First let's see the top of the table.
library(ISLR)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Now the bottom:
tail(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 392 27 4 151 90 2950 17.3 82 1
## 393 27 4 140 86 2790 15.6 82 1
## 394 44 4 97 52 2130 24.6 82 2
## 395 32 4 135 84 2295 11.6 82 1
## 396 28 4 120 79 2625 18.6 82 1
## 397 31 4 119 82 2720 19.4 82 1
## name
## 392 chevrolet camaro
## 393 ford mustang gl
## 394 vw pickup
## 395 dodge rampage
## 396 ford ranger
## 397 chevy s-10
Now a summary:
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
The basic shape:
dim(Auto)
## [1] 392 9
Other parts:
colnames(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
rownames(Auto)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
## [12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
## [23] "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "34"
## [34] "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [45] "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [56] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67"
## [67] "68" "69" "70" "71" "72" "73" "74" "75" "76" "77" "78"
## [78] "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
## [89] "90" "91" "92" "93" "94" "95" "96" "97" "98" "99" "100"
## [100] "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111"
## [111] "112" "113" "114" "115" "116" "117" "118" "119" "120" "121" "122"
## [122] "123" "124" "125" "126" "128" "129" "130" "131" "132" "133" "134"
## [133] "135" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145"
## [144] "146" "147" "148" "149" "150" "151" "152" "153" "154" "155" "156"
## [155] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167"
## [166] "168" "169" "170" "171" "172" "173" "174" "175" "176" "177" "178"
## [177] "179" "180" "181" "182" "183" "184" "185" "186" "187" "188" "189"
## [188] "190" "191" "192" "193" "194" "195" "196" "197" "198" "199" "200"
## [199] "201" "202" "203" "204" "205" "206" "207" "208" "209" "210" "211"
## [210] "212" "213" "214" "215" "216" "217" "218" "219" "220" "221" "222"
## [221] "223" "224" "225" "226" "227" "228" "229" "230" "231" "232" "233"
## [232] "234" "235" "236" "237" "238" "239" "240" "241" "242" "243" "244"
## [243] "245" "246" "247" "248" "249" "250" "251" "252" "253" "254" "255"
## [254] "256" "257" "258" "259" "260" "261" "262" "263" "264" "265" "266"
## [265] "267" "268" "269" "270" "271" "272" "273" "274" "275" "276" "277"
## [276] "278" "279" "280" "281" "282" "283" "284" "285" "286" "287" "288"
## [287] "289" "290" "291" "292" "293" "294" "295" "296" "297" "298" "299"
## [298] "300" "301" "302" "303" "304" "305" "306" "307" "308" "309" "310"
## [309] "311" "312" "313" "314" "315" "316" "317" "318" "319" "320" "321"
## [320] "322" "323" "324" "325" "326" "327" "328" "329" "330" "332" "333"
## [331] "334" "335" "336" "338" "339" "340" "341" "342" "343" "344" "345"
## [342] "346" "347" "348" "349" "350" "351" "352" "353" "354" "356" "357"
## [353] "358" "359" "360" "361" "362" "363" "364" "365" "366" "367" "368"
## [364] "369" "370" "371" "372" "373" "374" "375" "376" "377" "378" "379"
## [375] "380" "381" "382" "383" "384" "385" "386" "387" "388" "389" "390"
## [386] "391" "392" "393" "394" "395" "396" "397"
Let's examine the interactions further:
plot(Auto$mpg, Auto$cylinders)
Hmm...it looks it thinks the cylinders are interpreted numerically, when certain values (7) can not happen. Let's fix this and try again:
Auto$cylinders = as.factor(Auto$cylinders)
plot(Auto$mpg ~ Auto$cylinders, legend.text =levels(Auto$cylinders))
Let's see how ggplot handles this:
p <- ggplot(Auto, aes(x = cylinders, y = mpg)) + geom_boxplot()
ggplotly(p)
Now a histogram:
hist(Auto$mpg)
Now with ggplot
p <- ggplot(Auto, aes(x = mpg))+geom_histogram(binwidth = 5)
ggplotly(p)
Check pairs:
pairs(Auto)
pairs(~ mpg + weight + horsepower, Auto)
Okay now the ggplot equivalent with GGally.
library(GGally)
ggpairs(Auto[,-9])
p<- ggpairs(Auto[,c("mpg","weight","horsepower")])
p
ggplotly(p)
## Warning: Can only have one: highlight
## Warning: Can only have one: highlight
Looks like there is a roughly quadratic relation between mpg an d horsepower.
plot(Auto$horsepower, Auto$mpg)
lm.fit=lm(mpg ~ poly(horsepower,2), data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16