Introduction to R

R is a programming language designed for statistical computing. It is designed for working with data and has an extensive collection of libraries for this purpose. A huge part of the value of a particular programming language is the existing collection of libraries.

I personally find R's programming syntax to be a bit strange, but I expect that we will learn to love it together.

R as a calculator

R is often run in interactive mode. This means that there is an interpreter running in a loop; it reads your command, executes, and then displays the output. This is much like a calculator.

In particular we can use R as a calculator:

sin(3)
## [1] 0.14112
2^32
## [1] 4294967296

Data types

Of course there are numeric types (integers, floating point numbers).

3L; typeof(3L);
## [1] 3
## [1] "integer"
3; typeof(3)
## [1] 3
## [1] "double"

Here I executed multiple commands by separating them with a semicolon.

There are character/string types for storing text:

"this is text"; typeof("this is text")
## [1] "this is text"
## [1] "character"

Single and double quotes are interpreted the same:

'this' == "this"
## [1] TRUE

But it is convenient to use them when you need to text with quotation marks:

'these are "quotes"'
## [1] "these are \"quotes\""

Vectors

Vectors are used for lists of elements of the same type:

x <- c(1,3,5); x; typeof(x); x[2]+1
## [1] 1 3 5
## [1] "double"
## [1] 4

Note that we started the value of the vector into a variable x via the assignment operation <- . One can also use the = expression for ment, but this can only be used in a simple assignment expression and may not be compatible with old versions of S-plus. I generally consider restricting assignment to simple expressions to be a good thing.

We also accessed the second element of our vector by x[2]. Note that most programming languages start the indexing at 0, but not R. It starts at 1. This is probably more intuitive for most people, but confusing to those who have programmed with other languages.

Vectors of mixed types are coerced into one type:

y <- c(1,"a"); y; typeof(y)
## [1] "1" "a"
## [1] "character"

We can see that R decided that I did not really want to include the number 1, in my list and turned it into a string. This is both helpful and confusing. It can lead to unexpected errors:

y[1]+1
## Error in y[1] + 1: non-numeric argument to binary operator

We can also generate vector sequences easily:

seq1 <- 1:5; seq1; typeof(seq1)
## [1] 1 2 3 4 5
## [1] "integer"
seq2 <- seq(0,10,2); seq2
## [1]  0  2  4  6  8 10

Lists

To put mixed data types together we can create a list:

list1 <- list(1, 2, "three", c(1,2,3)); list1; typeof(list)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] "three"
## 
## [[4]]
## [1] 1 2 3
## [1] "builtin"

We can access the elements as with vectors:

list1[3]; typeof(list1[3])
## [[1]]
## [1] "three"
## [1] "list"

Note that indexing still returns a list.

list1[2]+2
## Error in list1[2] + 2: non-numeric argument to binary operator

Instead we need:

list1[[2]]+2
## [1] 4

Matrices / Arrays

We can build a matrix from a vector

M1 = matrix( 1:20, nrow = 4, ncol = 5); M1
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

Transpose operation:

t(M1)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20

Matrix multiplication is done as follows:

t(M1)%*%M1
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   30   70  110  150  190
## [2,]   70  174  278  382  486
## [3,]  110  278  446  614  782
## [4,]  150  382  614  846 1078
## [5,]  190  486  782 1078 1374

The standard multiplication is componentwise:

1:5*2:6; 1:5 %*% 2:6
## [1]  2  6 12 20 30
##      [,1]
## [1,]   70

Note that R tries to be helpful and change the vectors into column/row vectors to make sense of the multiplication.

1:5*3
## [1]  3  6  9 12 15

We can also create arrays, which are essentially tensors:

T1 <- array(1:24, dim = c(2,3,4)); T1
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24

We can access entries in arrays/matrices in the expected way:

M1[[2,3]]; T1[[2,1,3]];
## [1] 10
## [1] 14

Factors

In data analysis we often have non-numeric values. These correspond to factors in R.

genders <- c('male', 'female', 'female', 'male'); factor(genders);
## [1] male   female female male  
## Levels: female male

Data Frames

One of the most common structures in data science is the data frame. This can be thought of as a large table, like a spreadsheet, but all of the columns must be of the same type.

participants <- data.frame(
  gender = c("male", "male", "female", "male"),
  first_name = c("John", "Moritz", "Martha", "Bernd"),
  age = c(19, 20, 18, 30)
);
print(participants)
##   gender first_name age
## 1   male       John  19
## 2   male     Moritz  20
## 3 female     Martha  18
## 4   male      Bernd  30

We can access columns as follows:

participants$first_name
## [1] John   Moritz Martha Bernd 
## Levels: Bernd John Martha Moritz

Note that R tried to be helpful and coerced all of the strings into factors. This is not desirable for the names.

participants <- data.frame(
  gender = c("male", "male", "female", "male"),
  first_name = c("John", "Moritz", "Martha", "Bernd"),
  age = c(19, 20, 18, 30), stringsAsFactors = FALSE
);
print(participants)
##   gender first_name age
## 1   male       John  19
## 2   male     Moritz  20
## 3 female     Martha  18
## 4   male      Bernd  30

We do want the gender to be factor names:

participants$gender = factor(participants$gender); participants
##   gender first_name age
## 1   male       John  19
## 2   male     Moritz  20
## 3 female     Martha  18
## 4   male      Bernd  30

We can get columns and rows:

participants[1:2]; participants[1:2,]
##   gender first_name
## 1   male       John
## 2   male     Moritz
## 3 female     Martha
## 4   male      Bernd
##   gender first_name age
## 1   male       John  19
## 2   male     Moritz  20

We can also select from the frame:

subset(participants, gender == "male" & age < 25)
##   gender first_name age
## 1   male       John  19
## 2   male     Moritz  20

Probability functions

R has precisely 3 gazillion stats functions. Let's play with some standard ones.

Here let's grab 20 samples from a standard normal distribution:

y <- rnorm(20); y
##  [1] -0.14934151 -0.65468880 -0.07507725  1.26377024  0.73371480
##  [6]  1.12150385 -0.91064302 -0.26950052 -0.49669479  0.33415133
## [11]  1.63102991  0.12585310 -0.42296972 -0.64616378  0.30149956
## [16] -0.09799198  1.44958537  1.98008501  0.53238166  0.83266742

Similarly from the uniform distribution (on the interval):

x <- runif(20); x
##  [1] 0.55227375 0.17454460 0.84282352 0.85941586 0.09597788 0.51938783
##  [7] 0.15086052 0.57803668 0.83860268 0.80295214 0.99171717 0.66124529
## [13] 0.19920251 0.64923401 0.02518160 0.13439947 0.50767363 0.45769632
## [19] 0.14298703 0.03797556

We can then plot this:

plot(x,y)

Or if want to get fancy (and we do), we can use ggplot2:

library(ggplot2);
df <- data.frame( xvals = x, yvals = y);
ggplot(df, aes(x = xvals, y = yvals)) + geom_point();

Or the interactive version (just for this type of webpage):

library(plotly);
df <- data.frame( xvals = x, yvals = y);
pl <- ggplot(df, aes(x = xvals, y = yvals)) + geom_point();
ggplotly(pl)

Okay let's get fancy. Let's plot a two variable standard Gaussian.

x = seq(-3,3, length = 20)
y = x
f = function(x,y) { exp(-(x^2+y^2)/2)*(2*pi)^(0.5) }
g = outer(x,y,f)
contour(x,y,g)

persp(x,y,g, theta = 30, phi = 10)

Playing with a data set

Let's load our library of datasets and examine one. First let's see the top of the table.

library(ISLR)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Now the bottom:

tail(Auto)
##     mpg cylinders displacement horsepower weight acceleration year origin
## 392  27         4          151         90   2950         17.3   82      1
## 393  27         4          140         86   2790         15.6   82      1
## 394  44         4           97         52   2130         24.6   82      2
## 395  32         4          135         84   2295         11.6   82      1
## 396  28         4          120         79   2625         18.6   82      1
## 397  31         4          119         82   2720         19.4   82      1
##                 name
## 392 chevrolet camaro
## 393  ford mustang gl
## 394        vw pickup
## 395    dodge rampage
## 396      ford ranger
## 397       chevy s-10

Now a summary:

summary(Auto)
##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365

The basic shape:

dim(Auto)
## [1] 392   9

Other parts:

colnames(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"
rownames(Auto)
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11" 
##  [12] "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22" 
##  [23] "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "34" 
##  [34] "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45" 
##  [45] "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56" 
##  [56] "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67" 
##  [67] "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78" 
##  [78] "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88"  "89" 
##  [89] "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99"  "100"
## [100] "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111"
## [111] "112" "113" "114" "115" "116" "117" "118" "119" "120" "121" "122"
## [122] "123" "124" "125" "126" "128" "129" "130" "131" "132" "133" "134"
## [133] "135" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145"
## [144] "146" "147" "148" "149" "150" "151" "152" "153" "154" "155" "156"
## [155] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167"
## [166] "168" "169" "170" "171" "172" "173" "174" "175" "176" "177" "178"
## [177] "179" "180" "181" "182" "183" "184" "185" "186" "187" "188" "189"
## [188] "190" "191" "192" "193" "194" "195" "196" "197" "198" "199" "200"
## [199] "201" "202" "203" "204" "205" "206" "207" "208" "209" "210" "211"
## [210] "212" "213" "214" "215" "216" "217" "218" "219" "220" "221" "222"
## [221] "223" "224" "225" "226" "227" "228" "229" "230" "231" "232" "233"
## [232] "234" "235" "236" "237" "238" "239" "240" "241" "242" "243" "244"
## [243] "245" "246" "247" "248" "249" "250" "251" "252" "253" "254" "255"
## [254] "256" "257" "258" "259" "260" "261" "262" "263" "264" "265" "266"
## [265] "267" "268" "269" "270" "271" "272" "273" "274" "275" "276" "277"
## [276] "278" "279" "280" "281" "282" "283" "284" "285" "286" "287" "288"
## [287] "289" "290" "291" "292" "293" "294" "295" "296" "297" "298" "299"
## [298] "300" "301" "302" "303" "304" "305" "306" "307" "308" "309" "310"
## [309] "311" "312" "313" "314" "315" "316" "317" "318" "319" "320" "321"
## [320] "322" "323" "324" "325" "326" "327" "328" "329" "330" "332" "333"
## [331] "334" "335" "336" "338" "339" "340" "341" "342" "343" "344" "345"
## [342] "346" "347" "348" "349" "350" "351" "352" "353" "354" "356" "357"
## [353] "358" "359" "360" "361" "362" "363" "364" "365" "366" "367" "368"
## [364] "369" "370" "371" "372" "373" "374" "375" "376" "377" "378" "379"
## [375] "380" "381" "382" "383" "384" "385" "386" "387" "388" "389" "390"
## [386] "391" "392" "393" "394" "395" "396" "397"

Let's examine the interactions further:

plot(Auto$mpg, Auto$cylinders)

Hmm...it looks it thinks the cylinders are interpreted numerically, when certain values (7) can not happen. Let's fix this and try again:

Auto$cylinders = as.factor(Auto$cylinders)
plot(Auto$mpg ~ Auto$cylinders, legend.text =levels(Auto$cylinders))

Let's see how ggplot handles this:

p <- ggplot(Auto, aes(x = cylinders, y = mpg)) + geom_boxplot()
ggplotly(p)

Now a histogram:

hist(Auto$mpg)

Now with ggplot

p <- ggplot(Auto, aes(x = mpg))+geom_histogram(binwidth = 5)
ggplotly(p)

Check pairs:

pairs(Auto)

pairs(~ mpg + weight + horsepower, Auto)

Okay now the ggplot equivalent with GGally.

library(GGally)
ggpairs(Auto[,-9])

p<- ggpairs(Auto[,c("mpg","weight","horsepower")])
p

ggplotly(p)
## Warning: Can only have one: highlight

## Warning: Can only have one: highlight

Looks like there is a roughly quadratic relation between mpg an d horsepower.

plot(Auto$horsepower, Auto$mpg)

lm.fit=lm(mpg ~ poly(horsepower,2), data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16