Although we could use the ISLR package. We will download the dataset into our local directory.
download.file("http://www-bcf.usc.edu/~gareth/ISL/College.csv", "College.csv")
Load our data frame from the file.
df <- read.csv("College.csv")
Let’s look at it:
head(df)
## X Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University Yes 1660 1232 721 23
## 2 Adelphi University Yes 2186 1924 512 16
## 3 Adrian College Yes 1428 1097 336 22
## 4 Agnes Scott College Yes 417 349 137 60
## 5 Alaska Pacific University Yes 193 146 55 16
## 6 Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1 52 2885 537 7440 3300 450 2200 70
## 2 29 2683 1227 12280 6450 750 1500 29
## 3 50 1036 99 11250 3750 400 1165 53
## 4 89 510 63 12960 5450 450 875 92
## 5 44 249 869 7560 4120 800 1500 76
## 6 62 678 41 13500 3335 500 675 67
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1 78 18.1 12 7041 60
## 2 30 12.2 16 10527 56
## 3 66 12.9 30 8735 54
## 4 97 7.7 37 19016 59
## 5 72 11.9 2 10922 15
## 6 73 9.4 11 9727 55
summary(df)
## X Private Apps
## Abilene Christian University: 1 No :212 Min. : 81
## Adelphi University : 1 Yes:565 1st Qu.: 776
## Adrian College : 1 Median : 1558
## Agnes Scott College : 1 Mean : 3002
## Alaska Pacific University : 1 3rd Qu.: 3624
## Albertson College : 1 Max. :48094
## (Other) :771
## Accept Enroll Top10perc Top25perc
## Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
##
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
##
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
##
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00
##
Uh oh. We can see that the college names have been converted into factors. This is not especially helpful. Let’s make these into the names of the rows.
rownames(df) <- df[,1]
head(df)
## X Private Apps
## Abilene Christian University Abilene Christian University Yes 1660
## Adelphi University Adelphi University Yes 2186
## Adrian College Adrian College Yes 1428
## Agnes Scott College Agnes Scott College Yes 417
## Alaska Pacific University Alaska Pacific University Yes 193
## Albertson College Albertson College Yes 587
## Accept Enroll Top10perc Top25perc F.Undergrad
## Abilene Christian University 1232 721 23 52 2885
## Adelphi University 1924 512 16 29 2683
## Adrian College 1097 336 22 50 1036
## Agnes Scott College 349 137 60 89 510
## Alaska Pacific University 146 55 16 44 249
## Albertson College 479 158 38 62 678
## P.Undergrad Outstate Room.Board Books
## Abilene Christian University 537 7440 3300 450
## Adelphi University 1227 12280 6450 750
## Adrian College 99 11250 3750 400
## Agnes Scott College 63 12960 5450 450
## Alaska Pacific University 869 7560 4120 800
## Albertson College 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni
## Abilene Christian University 2200 70 78 18.1 12
## Adelphi University 1500 29 30 12.2 16
## Adrian College 1165 53 66 12.9 30
## Agnes Scott College 875 92 97 7.7 37
## Alaska Pacific University 1500 76 72 11.9 2
## Albertson College 675 67 73 9.4 11
## Expend Grad.Rate
## Abilene Christian University 7041 60
## Adelphi University 10527 56
## Adrian College 8735 54
## Agnes Scott College 19016 59
## Alaska Pacific University 10922 15
## Albertson College 9727 55
summary(df)
## X Private Apps
## Abilene Christian University: 1 No :212 Min. : 81
## Adelphi University : 1 Yes:565 1st Qu.: 776
## Adrian College : 1 Median : 1558
## Agnes Scott College : 1 Mean : 3002
## Alaska Pacific University : 1 3rd Qu.: 3624
## Albertson College : 1 Max. :48094
## (Other) :771
## Accept Enroll Top10perc Top25perc
## Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
##
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
##
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
##
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00
##
We are supposed to use the fix command. Let’s see what that is.
?fix
Okay, that was not especially helpful. Let’s run it.
fix(df)
I see. This gives me an incredibly ugly data frame editor.
Okay let’s get rid of the unneccessary column.
df <- df[,-1]
Note that the “-1” here means that we will select all of the columns except the first one. This behavior is different from python where -1 will typically refer to the last column and -2 will refer to the second to last column.
Let’s see this in our ugly editor again.
fix(df)
Okay, let’s look at our cleaned data frame:
summary(df)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Good, the unnecessary factor data is gone and apparently some school has an awful graduation rate.
Which one is it?
which.min(df$Grad.Rate)
## [1] 586
Who is this culprit:
df[586,]
## Private Apps Accept Enroll Top10perc Top25perc
## Texas Southern University No 4345 3245 2604 15 85
## F.Undergrad P.Undergrad Outstate Room.Board
## Texas Southern University 5584 3101 7860 3360
## Books Personal PhD Terminal S.F.Ratio
## Texas Southern University 600 1700 65 75 18.2
## perc.alumni Expend Grad.Rate
## Texas Southern University 21 3605 10
Wow, we should avoid Texas Southern University.
Now, let us look at a pairwise scatterplot of the first ten columns.
pairs(df[,1:10])
That was ugly. Let’s try this again.
library(GGally)
ggpairs(df[,1:10])
Now let us construct a boxplot:
library(ggplot2)
ggplot(df, aes(y = Outstate, x = Private)) + geom_boxplot()
Okay, let us construct a new Elite feature which indicates if more than half of the students come from the top 10% of their class. First populate the entries.
Elite = rep("No", nrow(df))
Mark the elite schools:
Elite[df$Top10perc > 50]="Yes"
Make these strings factors:
Elite=as.factor(Elite)
Add to our data frame.
df$Elite = Elite
colnames(df)
## [1] "Private" "Apps" "Accept" "Enroll" "Top10perc"
## [6] "Top25perc" "F.Undergrad" "P.Undergrad" "Outstate" "Room.Board"
## [11] "Books" "Personal" "PhD" "Terminal" "S.F.Ratio"
## [16] "perc.alumni" "Expend" "Grad.Rate" "Elite"
Okay check the new summary:
summary(df)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate Elite
## Min. : 10.00 No :699
## 1st Qu.: 53.00 Yes: 78
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Looks good. Now let us check out the out-of-state numbers for elite schools.
ggplot(df, aes(x=Elite, y=Outstate)) + geom_boxplot()
Let us check this out side by side:
p1 <- ggplot(df, aes(x = Private, y = Outstate)) + geom_boxplot()
p2 <- ggplot(df, aes(x = Elite, y = Outstate)) + geom_boxplot()
library(ggpubr)
## Loading required package: magrittr
ggarrange(p1, p2)
Cool.
Let’s check out some more distributions.
p1 <- ggplot(df, aes(x = Room.Board)) + geom_histogram(bins = 15)
p2 <- ggplot(df, aes(x = PhD)) + geom_histogram(bins = 20)
p3 <- ggplot(df, aes(x = Grad.Rate)) + geom_histogram(bins = 10)
p4 <- ggplot(df, aes(x = Apps)) + geom_histogram(bins = 20)
ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Finally let’s check out how many applications Elite schools get:
ggplot(df, aes(x = Elite, y = Apps)) + geom_boxplot()
Wow, who is that outlier?
df[which.max(df$Apps),]
## Private Apps Accept Enroll Top10perc Top25perc
## Rutgers at New Brunswick No 48094 26330 4520 36 79
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Rutgers at New Brunswick 21401 3712 7410 4748 690
## Personal PhD Terminal S.F.Ratio perc.alumni
## Rutgers at New Brunswick 2009 90 95 19.5 19
## Expend Grad.Rate Elite
## Rutgers at New Brunswick 10474 77 No
What is going on there?
Let us clear our workspace.
rm(list = ls())
Okay, I have grown a little tired of R’s built in idioscyncracies. I am going to try and work in the tidyverse.
Let us load a spooky dataset:
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
spooky <- read_csv("train.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## text = col_character(),
## author = col_character()
## )
head(spooky)
## # A tibble: 6 x 3
## id
## <chr>
## 1 id26305
## 2 id17569
## 3 id11008
## 4 id27763
## 5 id12958
## 6 id22965
## # ... with 2 more variables: text <chr>, author <chr>
Okay let us clean this up.
spooky = spooky[,-1]
spooky$author <- as.factor(spooky$author)
head(spooky)
## # A tibble: 6 x 2
## text
## <chr>
## 1 This process, however, afforded me no means of ascertaining the dimensions
## 2 It never once occurred to me that the fumbling might be a mere mistake.
## 3 In his left hand was a gold snuff box, from which, as he capered down the h
## 4 How lovely is spring As we looked from Windsor Terrace on the sixteen ferti
## 5 Finding nothing else, not even gold, the Superintendent abandoned his attem
## 6 A youth passed in solitude, my best years spent under your gentle and femin
## # ... with 1 more variables: author <fctr>
We are going to manipulate some text, so let us load stringr (this is unnecessary after loading tidyverse).
library(stringr)
?stringr
Now let us check the use of punctuation.
spooky$commas = str_count(spooky$text, ",")
spooky$exclam = str_count(spooky$text, "!")
spooky$quest = str_count(spooky$text, "\\?")
A summary:
summary(spooky)
## text author commas exclam
## Length:19579 EAP:7900 Min. : 0.000 Min. :0
## Class :character HPL:5635 1st Qu.: 1.000 1st Qu.:0
## Mode :character MWS:6044 Median : 1.000 Median :0
## Mean : 1.952 Mean :0
## 3rd Qu.: 3.000 3rd Qu.:0
## Max. :48.000 Max. :0
## quest
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.05608
## 3rd Qu.:0.00000
## Max. :4.00000
spooky %>% filter(commas == max(commas))
## # A tibble: 1 x 5
## text
## <chr>
## 1 "To chambers of painted state farewell To midnight revelry, and the panting
## # ... with 4 more variables: author <fctr>, commas <int>, exclam <int>,
## # quest <int>
Okay so exclamation points are no help:
spooky = spooky %>% select(-exclam)
ggpairs(spooky %>% select(-text))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
?Corpus
corp <- Corpus(VectorSource(spooky$text))
inspect(corp[[1]])
## Warning in as.POSIXlt.POSIXct(Sys.time(), tz = "GMT"): unknown timezone
## 'default/Europe/Berlin'
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 231
##
## This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.
Okay clean the data.
corp <- tm_map(corp, removePunctuation)
inspect(corp[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 224
##
## This process however afforded me no means of ascertaining the dimensions of my dungeon as I might make its circuit and return to the point whence I set out without being aware of the fact so perfectly uniform seemed the wall
corp <- corp %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(tolower) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stemDocument)
inspect(corp[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 136
##
## process howev afford mean ascertain dimens dungeon might make circuit return point whenc set without awar fact perfect uniform seem wall
Create a document term matrix:
dtm <- DocumentTermMatrix(corp)
inspect(dtm[1:5,1:4])
## <<DocumentTermMatrix (documents: 5, terms: 4)>>
## Non-/sparse entries: 4/16
## Sparsity : 80%
## Maximal term length: 9
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs afford ascertain awar circuit
## 1 1 1 1 1
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
dtmfr <- as_tibble(as.matrix(dtm))
head(dtmfr)
## # A tibble: 6 x 15,013
## afford ascertain awar circuit dimens dungeon fact howev make mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 1 1 1 1 1
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## # ... with 15003 more variables: might <dbl>, perfect <dbl>, point <dbl>,
## # process <dbl>, return <dbl>, seem <dbl>, set <dbl>, uniform <dbl>,
## # wall <dbl>, whenc <dbl>, without <dbl>, fumbl <dbl>, mere <dbl>,
## # mistak <dbl>, never <dbl>, occur <dbl>, air <dbl>, box <dbl>,
## # caper <dbl>, cut <dbl>, fantast <dbl>, gold <dbl>, greatest <dbl>,
## # hand <dbl>, hill <dbl>, incess <dbl>, left <dbl>, manner <dbl>,
## # possibl <dbl>, satisfact <dbl>, self <dbl>, snuff <dbl>, step <dbl>,
## # took <dbl>, beneath <dbl>, cheer <dbl>, cottag <dbl>, counti <dbl>,
## # fair <dbl>, fertil <dbl>, former <dbl>, happi <dbl>, heart <dbl>,
## # look <dbl>, love <dbl>, sixteen <dbl>, speckl <dbl>, spread <dbl>,
## # spring <dbl>, terrac <dbl>, town <dbl>, wealthier <dbl>,
## # windsor <dbl>, year <dbl>, abandon <dbl>, attempt <dbl>,
## # counten <dbl>, desk <dbl>, els <dbl>, even <dbl>, find <dbl>,
## # noth <dbl>, occasion <dbl>, perplex <dbl>, sit <dbl>, steal <dbl>,
## # superintend <dbl>, think <dbl>, abl <dbl>, believ <dbl>, best <dbl>,
## # board <dbl>, brutal <dbl>, charact <dbl>, crew <dbl>, distast <dbl>,
## # equal <dbl>, exercis <dbl>, felt <dbl>, feminin <dbl>, fortun <dbl>,
## # fosterag <dbl>, gentl <dbl>, groundwork <dbl>, heard <dbl>,
## # intens <dbl>, kindli <dbl>, marin <dbl>, necessari <dbl>, note <dbl>,
## # obedi <dbl>, overcom <dbl>, paid <dbl>, pass <dbl>, peculiar <dbl>,
## # refin <dbl>, respect <dbl>, secur <dbl>, servic <dbl>, ship <dbl>, ...
dtmfr$Author = spooky$author
Wow using tibbles was much faster than dataframes.
Now we will aggregate the data and get the most words that vary the most amongst the authors:
counts <- dtmfr %>% group_by(Author) %>%
summarize_all(sum)
ord = order(apply(select(counts,-Author), 2, var), decreasing = TRUE)[1:10]
counts %>% select(-Author) %>% select(ord)
## # A tibble: 3 x 10
## upon love will thing old raymond say heart life thus
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1025 74 410 221 139 0 334 111 105 254
## 2 186 37 78 433 392 2 123 20 130 15
## 3 200 425 468 71 85 270 63 290 333 136
Just for fun let us try a naive Bayes classifier. We need to shrink the data first to make this manageable. We will also set aside the last 5000 entries for validation.
library(e1071)
new_order = order(apply(select(counts,-Author), 2, sum),decreasing = TRUE)[1:400]
tiny = dtmfr %>% select(-Author) %>% select(new_order)
tiny$Author = dtmfr$Author
model <- naiveBayes(Author ~ ., data = tiny[1:14579,])
Now let us test it against the validation data.
We will check it on 5000 entries to assess the accuracy and then look at a truth table to see what kind of errors our classifier makes.
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
test <- (tiny %>% select(-Author))[14580:19579,]
preds <- predict(model, newdata = test)
preds_p <-
predict(model, type = "raw", newdata = test)
sum(preds == tiny$Author[14580:19579])/5000
## [1] 0.6094
MultiLogLoss(preds_p, tiny$Author[14580:19579])
## [1] 10.27673
table(preds, tiny$Author[14580:19579])
##
## preds EAP HPL MWS
## EAP 1388 353 475
## HPL 430 936 347
## MWS 207 141 723
First the good news, 60.94% on a dataset is significantly better than randomly guessing (run a hypothesis test to check this). Our log loss is way below the public leader, which makes me think I have misread the rules.
Let’s see how a logistic classifier would hold up;
library(nnet)
model_lc <- multinom(Author ~ ., data = tiny[1:14579,], MaxNWts = 3000)
## # weights: 1206 (802 variable)
## initial value 16016.668556
## iter 10 value 11628.629580
## iter 20 value 10908.090402
## iter 30 value 10747.013589
## iter 40 value 10732.296042
## iter 50 value 10731.489415
## iter 60 value 10731.442760
## iter 70 value 10731.423366
## iter 80 value 10731.370125
## iter 90 value 10731.313671
## iter 100 value 10731.282456
## final value 10731.282456
## stopped after 100 iterations
summary(model_lc)
## Call:
## multinom(formula = Author ~ ., data = tiny[1:14579, ], MaxNWts = 3000)
##
## Coefficients:
## (Intercept) one upon now will time
## HPL -0.6712898 -0.0396722 -1.565461 -0.1699132 -1.2480167 0.1551636
## MWS -0.6587432 -0.0432382 -1.491000 -0.1533031 0.1203872 0.2656451
## even man day eye thing yet
## HPL 0.02409388 0.2573360 -0.2120391 -0.2306069 0.9366177 -0.1411803
## MWS 0.07859653 0.2630459 0.1029332 0.1391803 -1.1900740 0.2761100
## said seem like might old first
## HPL -0.6503126 0.7321371 0.57697660 0.4901314 1.0595776 -0.24906272
## MWS -0.5292448 -0.2374424 0.05696853 0.7461682 -0.6056544 0.05686004
## night must thought look found never
## HPL 0.4088301 0.3170074 -0.03437579 0.3819927 0.1361783 0.1965915
## MWS 0.2500908 0.5577168 0.23892539 0.4990832 -0.2889225 -0.0718980
## life great made long love everi
## HPL 0.4165398 0.06024384 -0.45386350 -0.1964061 -0.811931 -0.4723819
## MWS 1.1035625 -0.59270974 -0.06573172 -0.3981582 1.145623 0.5123791
## littl still say place saw mani
## HPL -0.3738058 -0.17415286 -0.5407533 0.4546974 0.9294185 0.3154346
## MWS -0.3601869 0.01801602 -1.4454504 0.3867138 0.4076899 -0.2199853
## well appear hand hous came much
## HPL -0.6176318 -0.7933379 -0.04520392 0.6120965 0.8044274 -0.1326384
## MWS -0.8081276 0.1754485 0.21623109 0.1221886 0.3284930 -0.6715427
## see year fear may natur two
## HPL 0.60573058 0.4740895 1.079838 -0.2976259 -0.20575620 0.06686507
## MWS 0.02851436 0.2591825 1.176245 0.1484540 0.03262476 -0.41774080
## word come can death heart mind
## HPL -1.15131105 1.3019883 -0.2811540 -0.005998768 -1.6220977 0.1786874
## MWS -0.02200052 0.7266197 0.5350109 0.419867642 0.7573054 0.2976594
## feel know ever light thus near
## HPL -0.5990999 0.5947941 0.4000024 0.5262952 -2.2886393 -0.0936151
## MWS 0.8076846 -0.0279369 0.5149314 0.1825601 -0.4192177 0.2149737
## whose make friend without far open
## HPL 0.7221844 -0.08866683 -0.2114020 -0.1025191 -0.1576165 0.09228348
## MWS 0.5291968 -0.29114580 0.5917266 -0.3752697 -0.3539289 -0.31003197
## form shall heard men howev earth
## HPL 0.2096011 -0.6732518 0.8206809 1.3250012 -0.864004 0.07141642
## MWS 0.1804578 0.3638667 0.4538310 0.4608323 -1.268178 0.15506944
## last left part hour world head
## HPL 0.8573859 -0.04352952 0.2377415 -0.2499458 0.25876999 -0.1061357
## MWS 0.7242287 -0.29706192 0.7661523 -0.2100075 0.09082829 -0.3230781
## room pass felt door live dream
## HPL 0.2438951 -0.1611681 0.2715014 0.1705943 0.6405385 0.83524356
## MWS -0.4401720 0.6668573 0.5146259 -0.1952871 0.9271164 -0.07099548
## strang dark call way voic though
## HPL 1.3096092 0.43099795 -0.001652164 0.1088651 -0.09962387 3.522883
## MWS 0.7179855 -0.01780144 -0.400680643 -0.9401002 -0.11207445 2.059099
## moment inde human turn return hope
## HPL -0.2605924 0.07223905 0.1299142 0.1859901 0.0340694 -0.2919771
## MWS 0.0821032 -0.53463790 0.7341542 -0.2008771 0.7957967 1.3833473
## back toward let within noth whole
## HPL 0.1096008 1.277124 -0.9742786 -0.7603882 -0.2986702 -0.8508702
## MWS -0.5754191 1.694123 -0.2711971 -0.9969182 -0.3273708 -0.6435991
## good point becam away take mean
## HPL -0.8795458 -0.7112956 0.3706576 0.8575531 -0.06503354 -1.0335017
## MWS -0.1445913 -1.0602374 0.4867340 0.4784960 -0.15815571 -0.6046746
## certain die father sound find bodi seen
## HPL 1.129344 0.1090338 1.194651 0.69951981 0.3716554 0.11985 -0.07356314
## MWS -1.351921 0.6834364 2.178543 0.06116842 0.0582771 -1.32697 -1.00748984
## present face among length person knew
## HPL -0.9410518 1.0099214 -0.5135681 -1.069182 -0.8454418 0.8861121
## MWS 0.1450428 0.1308696 0.3353719 -0.764736 -0.7739436 -0.1490288
## remain raymond soon close three sea
## HPL 0.3134632 6.906738 0.5633482 0.05246229 0.005057285 0.6627028
## MWS 0.7168372 11.076235 0.7869682 -0.19087508 -1.198569913 0.6498779
## water citi spirit new chang half
## HPL -0.3837666 0.8105671 -1.4920659 0.57605564 1.174704 0.3545811
## MWS -0.4935173 -0.2141342 0.6934554 0.01164793 1.922898 -0.1628590
## beauti street beyond although end think
## HPL -0.4685792 1.3207557 0.6800065 -1.6446716 -0.4742255 0.3765827
## MWS -0.1571525 -0.2503289 -0.9972657 -0.7657746 -0.1468291 -0.1019803
## power soul alon observ air idea
## HPL -0.4780959 -0.09590163 0.4529120 -1.5547068 -0.7490732 -1.12381902
## MWS 0.8549776 0.27762584 0.5317043 -0.6370111 0.1183782 -0.03491193
## object just anoth less wall kind
## HPL -0.01612585 -0.6541325 0.8924173 0.03232287 0.3386471 0.7891610
## MWS -0.22598142 -1.3613651 1.0247239 -0.22155207 -0.4681757 0.5493638
## continu sinc name high happi gave
## HPL -0.7222088 0.4903320 -0.09891292 0.2877814 -0.4557131 0.1452382
## MWS -0.1572161 -0.5402618 0.14330979 0.1218856 1.4002225 0.2633416
## window enter follow horror express imagin
## HPL 0.3894553 -0.2810855 0.8145551 0.43897747 -0.08003865 0.22986708
## MWS -0.8365273 1.1879456 0.6992749 0.04450259 0.52215562 0.01444484
## almost mere small wish black matter
## HPL 1.4292878 -0.580112 0.4884050 0.6140826 0.6670094 -1.163222
## MWS 0.5381814 -1.313922 0.1607619 0.9273372 -0.9710789 -1.714175
## young took around full quit wonder
## HPL 0.35091769 -0.2178550 0.54838080 -0.003955085 -0.51211916 0.0104690
## MWS 0.05628391 -0.1090416 0.05436416 -0.659625667 0.09428088 -0.2204143
## side dead lay someth cours exist
## HPL 0.1384926 0.3959645 -0.1516873 1.0168633 -0.5598006 -0.1994335
## MWS 0.0536306 0.1384244 0.1066368 -0.1301605 -0.7940545 0.3256657
## tell work went told believ sudden
## HPL 0.3620058 0.5872606 0.0695603 1.5572630 -0.1992236 0.39132833
## MWS -0.4577638 0.5236975 0.2080567 0.2536619 0.1482011 0.02157818
## give becom scene leav reason arm
## HPL -0.1977696 0.1393761 0.2461689 0.01960883 -0.8957183 -0.52986715
## MWS 0.2151664 0.7871082 1.0032416 0.89345926 -0.9064348 0.02595853
## god caus also manner perhap speak
## HPL 0.41571152 0.2793080 -0.6464642 -2.2141128 0.09602863 -0.2675078
## MWS 0.04455414 0.8693414 0.2791693 -0.7044164 -0.19791120 -0.3555598
## general wind possibl tree right state
## HPL -0.6161065 0.1438807 0.4478163 -0.10264907 -0.07298696 -0.7188274
## MWS -1.3701622 0.6992059 -1.0326562 0.07388377 -0.52179702 0.2683152
## reach sens direct wild put deep
## HPL 0.112169616 -0.008102586 -0.5633598 0.2401963 -0.3851525 -0.05150622
## MWS 0.001368847 -0.770182657 -0.9036619 0.3057393 -0.3830607 0.15939980
## care question set possess utter sight
## HPL 0.1291901 0.19885747 0.7065219 -0.3422381 0.00708707 1.2033479
## MWS 0.3153708 -0.05669003 0.2914727 0.2569324 -0.22346312 0.3242558
## feet use sever evid doubt immedi
## HPL -0.5738387 0.74984691 0.02634754 -0.7730532 -0.6794186 -1.328802
## MWS -0.8882131 -0.06306273 0.78746824 -1.4974631 -1.1343181 -0.997358
## age began town known fact least
## HPL 0.2745762 1.4541961 1.454089 0.426662 -0.7216491 0.1769029
## MWS -0.1994432 0.5280776 1.398367 -1.034332 -1.8795486 -0.2900365
## step interest often mad moon lost
## HPL -0.3724678 0.09630872 0.7093904 0.87304640 0.2683780 -0.4788316
## MWS 0.1311770 0.09650610 0.9199753 -0.05698348 -0.3130555 0.7799730
## repli home thousand morn fanci rather
## HPL -2.1942710 -0.1537096 -0.9372408 -0.1664466 -0.08356306 -0.5208451
## MWS -0.1823151 -0.3605622 0.2030441 0.1569298 -0.24665375 -0.5195005
## mountain sun read visit second hill
## HPL 0.8940084 -1.4861705 0.8490446 0.4610521 -0.1334212 1.4889475
## MWS 0.8645928 0.8739729 0.8195299 1.0381436 -1.4699955 0.1162009
## stone heaven rememb watch peopl sometim
## HPL 1.5456067 -0.8530627 0.08042333 1.121059 0.2577938 1.548605
## MWS -0.6084312 0.2307745 -0.43521364 1.183967 -0.5184615 1.258596
## charact brought view book larg affect
## HPL -1.150583 0.1824834 0.04672884 0.40516782 -0.5450343 -0.06623483
## MWS -1.300090 0.1074299 -0.42446175 0.04064702 -0.8632639 1.23462183
## perdita alway taken secret countri sleep
## HPL -1.324141 0.6539962 -0.2818156 0.46255382 0.4314401 0.3708728
## MWS 10.679775 -0.6003435 0.2159734 0.05729402 2.0036952 0.4890235
## get perceiv land suffer ancient case
## HPL 0.3701533 -1.5749740 1.282089 -0.3489746 1.61505081 0.0004804619
## MWS -1.9721628 -0.1485994 1.081708 0.6827955 0.04515496 -1.5608372612
## entir regard stood sure spoke circumst
## HPL -1.3362894 -0.7373667 0.03929624 -0.7146996 -0.10473010 -0.97809196
## MWS -0.5990079 -0.7352640 0.03213757 -1.0221298 -0.03046066 -0.05767761
## gentl reflect discov proceed fell attent
## HPL -0.2467913 -0.04797975 -0.9259138 -1.47475882 0.3171402 -1.6065352
## MWS 1.0389742 0.82441247 0.4222872 -0.03037744 0.3842206 -0.6700781
## arriv minut forc tear shadow period
## HPL -0.4411455 -1.929841 0.4790925 -1.123994 0.2082803 -1.331336
## MWS 1.0634379 -1.699129 -0.1916206 1.339649 -0.6700201 -1.006817
## move rest appar degre pain white
## HPL 0.3173116 0.6306424 -0.2549077 -0.7441478 -1.2104179 0.9682860
## MWS 0.0800344 0.8163082 -0.7146816 0.2627532 0.7637222 -0.1265203
## mother floor expect excit relat true
## HPL 0.8885127 0.9226296 0.2461755 -0.7954789 0.3771277 -0.6731686
## MWS 1.3004799 -0.9579621 0.5490836 -0.4241952 1.0149429 -0.4268049
## terribl account talk longer letter famili
## HPL 0.8139453 -0.1327228 1.4166108 -0.5810062 -0.9350487 0.7386304
## MWS -0.4067545 -0.2235025 0.9999955 -0.1084941 -0.4500920 0.7610563
## passion peculiar effect ground other suppos
## HPL -2.8789850 0.1122965 -0.0000144838 0.2331737 1.523782 -1.680288
## MWS 0.9010292 -1.3583118 -0.0980044100 0.5030340 1.754273 -1.053643
## better west month desir truth receiv
## HPL 0.1853768 2.9541930 -0.1797906 0.1210762 -0.95560920 0.01984439
## MWS 0.3331220 0.2055228 0.8015554 1.0848527 -0.01084531 0.69800654
## grew listen tri approach despair done walk
## HPL 0.1125965 0.6740614 3.284406 0.1529271 -1.182099 -0.8070523 0.910660
## MWS -1.0178047 1.0685488 1.768460 0.6366253 1.312783 -1.0205017 1.264679
## vast evil memori late fill line
## HPL -0.1894932 0.4823457 0.5757542 -0.2366506 -0.1002902 -0.008361823
## MWS -0.7446772 0.7200443 -0.6839914 -0.4154363 1.2307967 -0.880678549
## posit beneath subject escap adrian alreadi
## HPL -0.8877981 -0.1958092 -0.4213285 0.42843444 -1.465258 -1.4314767
## MWS -2.7492379 0.2554528 -0.2435097 -0.09337106 8.877616 0.3143522
## ask clear usual hideous fire poor
## HPL 0.7940028 0.4976363 -0.6722979 0.9479556 0.01348264 0.8317882
## MWS 0.7978548 -0.2843757 -0.2801564 -1.9119643 0.37892786 1.4758537
## attend counten breath impress hear suffici
## HPL -0.7337794 -1.7959002 -0.9291009 0.3318729 0.8116621 -2.0474993
## MWS 0.4051958 0.8738952 -1.3875777 -0.5288110 0.8528495 -0.5335557
## dear past purpos low dare cri till
## HPL -1.609684 1.340865 -0.97706754 0.4698974 1.088141 0.4267356 2.833866
## MWS 1.415510 1.030456 -0.05036787 -0.8179133 1.144108 0.5408059 1.981657
## fellow anim chamber event short figur
## HPL 0.2636688 0.1272983 -1.169447 -0.3663693 -0.2782079 -0.742283
## MWS 0.5464234 0.2385033 -1.122878 0.3831414 -0.1135372 -1.834829
## creatur final star wood dread hard
## HPL 0.7038123 0.3219756 0.6698859 0.5187576 0.1383283 0.6818811
## MWS 1.4567356 -1.7503251 0.3615295 1.3311690 0.5585247 1.4315458
## busi space cold either ill `next`
## HPL -0.57148403 1.6094161 1.199020 -0.5749888 0.3759970 0.7932166
## MWS -0.04428204 0.4648739 1.677301 -1.2954711 0.8091589 -0.3121052
## youth none studi given delight five child
## HPL 1.037516 0.3694499 1.020938 -0.3389876 -1.206512 -0.5335428 0.3479535
## MWS 1.120719 0.2755419 1.171242 -0.1488712 1.069165 -0.9900519 1.1227548
## attempt unknown murder togeth order terror
## HPL -0.9973991 0.8497529 -1.2075509 -0.5841953 0.5373530 -0.02590554
## MWS -0.8363839 -0.5513333 0.1653636 0.3161136 0.3092768 -1.40477473
## companion instant spot smile river sky
## HPL 0.1982159 -1.2562162 -0.3406070 -0.4922511 -0.5615726 1.089963
## MWS 0.9161791 -0.8655652 0.6243694 0.4349721 -0.3364643 1.046874
## motion origin paper best fall want
## HPL 0.2249229 -0.1669384 0.3935401 -0.5036965 -0.1643942 1.0681045
## MWS -0.4689432 -1.3620864 -0.2427141 -0.1318715 0.2384420 0.9137371
## led
## HPL 0.6597864
## MWS 0.9534994
##
## Std. Errors:
## (Intercept) one upon now will time
## HPL 0.04347801 0.08895336 0.1226958 0.1079919 0.1727309 0.1289752
## MWS 0.04280188 0.09010291 0.1187788 0.1054332 0.1057692 0.1270855
## even man day eye thing yet said
## HPL 0.1322215 0.1301098 0.1362692 0.1435903 0.1252612 0.1519076 0.1448455
## MWS 0.1261912 0.1319441 0.1254368 0.1356817 0.2139607 0.1347225 0.1355507
## seem like might old first night must
## HPL 0.1358842 0.1376152 0.1505361 0.1504316 0.1513716 0.1471832 0.1495566
## MWS 0.1627163 0.1613118 0.1443352 0.2108473 0.1466235 0.1574988 0.1403239
## thought look found never life great made
## HPL 0.156508 0.1574394 0.1456482 0.1436096 0.1877579 0.1422788 0.1596337
## MWS 0.149921 0.1603747 0.1585931 0.1595474 0.1656205 0.1718363 0.1510207
## long love everi littl still say place
## HPL 0.1523636 0.2877107 0.1762552 0.1508735 0.1566622 0.1661121 0.1656796
## MWS 0.1664314 0.1692010 0.1425770 0.1619226 0.1583802 0.2090354 0.1749273
## saw mani well appear hand hous came
## HPL 0.1653155 0.1596949 0.1624583 0.2021352 0.1706379 0.1628086 0.1717530
## MWS 0.1900313 0.1739111 0.1753911 0.1538881 0.1595616 0.1919001 0.1885887
## much see year fear may natur two
## HPL 0.1543541 0.1736933 0.1737153 0.1949145 0.2022461 0.1966978 0.1683057
## MWS 0.1929701 0.1936198 0.1785572 0.1873831 0.1623035 0.1702931 0.1998374
## word come can death heart mind feel
## HPL 0.2553243 0.1834933 0.1957871 0.1996895 0.3524163 0.1947685 0.2641336
## MWS 0.1695109 0.1958721 0.1627352 0.1791986 0.1671444 0.1809335 0.1754081
## know ever light thus near whose make
## HPL 0.1655467 0.1985341 0.1842129 0.3309437 0.1944548 0.1867658 0.1916569
## MWS 0.1887437 0.1866209 0.1982558 0.1695663 0.1902422 0.2042636 0.2043752
## friend without far open form shall heard
## HPL 0.2253639 0.1808312 0.1796225 0.1715548 0.1979849 0.2236583 0.1821378
## MWS 0.1806318 0.1919262 0.1903245 0.2129033 0.1957393 0.1673812 0.2039585
## men howev earth last left part hour
## HPL 0.2187685 0.2078769 0.2089845 0.2007967 0.1798080 0.1983000 0.1997898
## MWS 0.2397644 0.2164029 0.2022268 0.2020679 0.1962952 0.1821261 0.1938762
## world head room pass felt door live
## HPL 0.2021445 0.1814914 0.1813809 0.2135439 0.2067695 0.1896509 0.2159018
## MWS 0.2050159 0.2078845 0.2260336 0.1827435 0.1982195 0.2351617 0.2033611
## dream strang dark call way voic though
## HPL 0.2076134 0.2218328 0.1999766 0.1954435 0.1924081 0.2173221 0.4020666
## MWS 0.2417436 0.2432587 0.2298038 0.2067110 0.2416431 0.2117281 0.4287814
## moment inde human turn return hope back
## HPL 0.2174737 0.1994910 0.2231213 0.2132197 0.2374778 0.3222585 0.1938875
## MWS 0.1935804 0.2204381 0.2014919 0.2222703 0.2009601 0.2247801 0.2599978
## toward let within noth whole good point
## HPL 0.2590906 0.2490110 0.2168900 0.2007712 0.2340186 0.2443912 0.2234647
## MWS 0.2497348 0.1727581 0.2360819 0.2185736 0.2241095 0.2030713 0.2504574
## becam away take mean certain die father
## HPL 0.2175011 0.2223121 0.2185384 0.2587004 0.2133700 0.2450334 0.3489673
## MWS 0.2072798 0.2364243 0.2069655 0.2145502 0.3730971 0.2063040 0.3015671
## sound find bodi seen present face among
## HPL 0.1995711 0.2251692 0.1924481 0.1993202 0.2688553 0.2269417 0.2409137
## MWS 0.2365239 0.2280221 0.2993335 0.2604141 0.1989643 0.2617907 0.2064902
## length person knew remain raymond soon close
## HPL 0.2460720 0.2377020 0.2072597 0.2396220 15.19228 0.2422976 0.2105110
## MWS 0.2224213 0.2380263 0.2689763 0.2125934 15.17481 0.2233256 0.2405814
## three sea water citi spirit new chang
## HPL 0.1998812 0.2323773 0.2290755 0.2243734 0.3917374 0.2236477 0.2792093
## MWS 0.3115542 0.2359226 0.2453829 0.2637741 0.2165968 0.2508697 0.2581277
## half beauti street beyond although end think
## HPL 0.2017411 0.2606622 0.2300385 0.210065 0.3371831 0.2441597 0.2144933
## MWS 0.2254320 0.2184780 0.3363138 0.312999 0.2424540 0.2275864 0.2351065
## power soul alon observ air idea object
## HPL 0.2944896 0.2569111 0.2378772 0.3157754 0.2664946 0.3086191 0.2269136
## MWS 0.2172934 0.2144994 0.2290060 0.2375295 0.2220928 0.2167734 0.2258748
## just anoth less wall kind continu sinc
## HPL 0.2076999 0.2606883 0.2221822 0.2100931 0.2389226 0.2662268 0.2237478
## MWS 0.2963536 0.2554840 0.2343383 0.2976529 0.2343380 0.2133851 0.2728339
## name high happi gave window enter follow
## HPL 0.2530096 0.2365445 0.4040968 0.2526482 0.2269949 0.2823359 0.2503172
## MWS 0.2297608 0.2310594 0.2450192 0.2316719 0.3421747 0.2185926 0.2536151
## horror express imagin almost mere small wish
## HPL 0.2238457 0.2997476 0.2521915 0.2642755 0.2538001 0.2350065 0.2667779
## MWS 0.2509089 0.2450439 0.2449013 0.2860476 0.2876228 0.2736684 0.2468518
## black matter young took around full quit
## HPL 0.2168660 0.2694658 0.2395420 0.2526196 0.2309350 0.2394983 0.2772579
## MWS 0.3542973 0.3438529 0.2604514 0.2415861 0.2622884 0.2742344 0.2394517
## wonder side dead lay someth cours exist
## HPL 0.2558813 0.2274641 0.2504499 0.2484624 0.2247089 0.2628364 0.2720783
## MWS 0.2631927 0.2673379 0.2528336 0.2523118 0.2902002 0.2740917 0.2379652
## tell work went told believ sudden give
## HPL 0.2369266 0.2413111 0.2372299 0.2674060 0.2489387 0.2422841 0.2815722
## MWS 0.2826676 0.2430061 0.2324169 0.3323418 0.2413835 0.2613046 0.2413741
## becom scene leav reason arm god caus
## HPL 0.2907462 0.3037672 0.2824968 0.2632055 0.2631889 0.2472735 0.2780095
## MWS 0.2508124 0.2587994 0.2381121 0.2663296 0.2394994 0.2509788 0.2521331
## also manner perhap speak general wind possibl
## HPL 0.2917458 0.4069391 0.2334034 0.2643088 0.2518508 0.2820329 0.2365931
## MWS 0.2321593 0.2466122 0.2566501 0.2514581 0.3180925 0.2621862 0.3280712
## tree right state reach sens direct wild
## HPL 0.2419262 0.2452483 0.2900765 0.2458656 0.2399767 0.2626257 0.2674641
## MWS 0.2328793 0.2858555 0.2414047 0.2819423 0.2855388 0.3009016 0.2693141
## put deep care question set possess utter
## HPL 0.2625795 0.2728066 0.2733781 0.2601362 0.2573335 0.2884919 0.2755218
## MWS 0.2691238 0.2685264 0.2632439 0.2520933 0.2906732 0.2405185 0.2671267
## sight feet use sever evid doubt immedi
## HPL 0.2621083 0.2635096 0.2494567 0.2844662 0.2709303 0.2891241 0.3490540
## MWS 0.2971248 0.3359998 0.2909943 0.2500513 0.3220162 0.3147344 0.2865018
## age began town known fact least step
## HPL 0.2436768 0.2918019 0.3197795 0.2487740 0.2823448 0.2584542 0.290921
## MWS 0.2778622 0.3258793 0.3159653 0.3518601 0.4258727 0.2964588 0.271407
## interest often mad moon lost repli home
## HPL 0.2922188 0.3218335 0.2832078 0.2489794 0.3256443 0.4866690 0.2698739
## MWS 0.2858130 0.2979872 0.3324676 0.3109963 0.2671907 0.2507681 0.2835546
## thousand morn fanci rather mountain sun read
## HPL 0.3902115 0.2805144 0.2835175 0.2660703 0.3024512 0.4027862 0.2952346
## MWS 0.2442105 0.2555264 0.3063516 0.2775808 0.3084722 0.2664746 0.2874299
## visit second hill stone heaven rememb watch
## HPL 0.3250629 0.2454598 0.3080484 0.2988458 0.3352422 0.2626317 0.3006567
## MWS 0.2866541 0.3519044 0.3695914 0.5039573 0.2550328 0.2837529 0.2989842
## peopl sometim charact brought view book larg
## HPL 0.2653211 0.3667626 0.3164126 0.2677256 0.2765012 0.2739805 0.2668594
## MWS 0.3210877 0.3571694 0.3154434 0.2752773 0.3144105 0.3191491 0.3330494
## affect perdita alway taken secret countri sleep
## HPL 0.4478972 49.64799 0.2680880 0.2944089 0.2702187 0.3401416 0.2864222
## MWS 0.3066841 15.27691 0.3260181 0.2727998 0.3078306 0.2913971 0.2843143
## get perceiv land suffer ancient case entir
## HPL 0.2530289 0.4375199 0.3265479 0.3765420 0.3217041 0.2701866 0.3807066
## MWS 0.5411373 0.2845191 0.3422383 0.2751984 0.4198403 0.4288339 0.2819162
## regard stood sure spoke circumst gentl reflect
## HPL 0.3439060 0.2862194 0.2513222 0.3213001 0.3746145 0.3979570 0.3564313
## MWS 0.3017984 0.2904766 0.3016954 0.2842698 0.2894518 0.2920714 0.2853477
## discov proceed fell attent arriv minut forc
## HPL 0.3463049 0.4039452 0.3098120 0.3978628 0.3642553 0.4132961 0.2843212
## MWS 0.2545922 0.2572715 0.2850902 0.2972019 0.2456290 0.3796597 0.3192826
## tear shadow period move rest appar degre
## HPL 0.5941206 0.2798835 0.3445128 0.2651095 0.3209778 0.2735570 0.3643210
## MWS 0.3286354 0.3357031 0.3005516 0.2863444 0.3044127 0.3041284 0.2864957
## pain white mother floor expect excit relat
## HPL 0.4327013 0.2790448 0.3936947 0.2584485 0.3121575 0.3320807 0.3268129
## MWS 0.2573709 0.3467556 0.3506591 0.4792446 0.2904756 0.2861743 0.2949068
## true terribl account talk longer letter famili
## HPL 0.3386763 0.2896511 0.3231945 0.3075655 0.3578201 0.3843125 0.3086888
## MWS 0.2770083 0.3857649 0.3027158 0.3155546 0.3029336 0.3036273 0.3041059
## passion peculiar effect ground other suppos better
## HPL 1.0696483 0.2792654 0.3002151 0.3195528 0.3820827 0.4562231 0.3013698
## MWS 0.2943032 0.3807536 0.2919244 0.3204750 0.3637185 0.3841793 0.2972543
## west month desir truth receiv grew listen
## HPL 0.3777245 0.3535168 0.3873883 0.3672437 0.3652824 0.2770041 0.3396231
## MWS 0.5316733 0.2790656 0.2919653 0.2831360 0.3077479 0.3550346 0.3153912
## tri approach despair done walk vast evil
## HPL 0.5304824 0.3247982 0.6011709 0.2940288 0.3645054 0.2977065 0.3301212
## MWS 0.5713321 0.2860258 0.3020996 0.3187712 0.3684199 0.3675172 0.2937783
## memori late fill line posit beneath subject
## HPL 0.3079064 0.3064624 0.4406264 0.2828886 0.3452203 0.3158088 0.3777960
## MWS 0.3731046 0.3139449 0.3555449 0.3800402 0.6275683 0.3351625 0.3137865
## escap adrian alreadi ask clear usual hideous
## HPL 0.2856375 20.13541 0.4259040 0.3201343 0.2779753 0.3220448 0.2877327
## MWS 0.2988481 8.34721 0.2788231 0.3099150 0.3577910 0.3011289 0.5605165
## fire poor attend counten breath impress hear
## HPL 0.3231077 0.3517843 0.3729445 0.6395099 0.3418551 0.2777533 0.3175466
## MWS 0.3083570 0.3112843 0.2750567 0.2992271 0.3385736 0.3521397 0.3214880
## suffici dear past purpos low dare cri
## HPL 0.5516409 0.7734607 0.3501606 0.3917928 0.2972859 0.3697516 0.3325559
## MWS 0.3240643 0.3165662 0.3673496 0.2800435 0.4160026 0.3591644 0.3129643
## till fellow anim chamber event short figur
## HPL 0.5572014 0.3311995 0.3266877 0.3271600 0.3430434 0.3338472 0.3015080
## MWS 0.5597706 0.3171120 0.3006762 0.3656637 0.2952402 0.3147527 0.4978633
## creatur final star wood dread hard busi
## HPL 0.4125686 0.2850478 0.3224423 0.3792680 0.3377706 0.3739193 0.3150507
## MWS 0.3639274 0.4974412 0.3257494 0.3550286 0.3000471 0.3289955 0.2885643
## space cold either ill `next` youth none
## HPL 0.3574864 0.4161970 0.3406185 0.3610789 0.3149796 0.4143747 0.3253118
## MWS 0.4650457 0.3933513 0.4016952 0.3206572 0.3661955 0.4202231 0.3278768
## studi given delight five child attempt unknown
## HPL 0.3463570 0.3054849 0.5065025 0.3425079 0.3674458 0.3805589 0.3069884
## MWS 0.3380743 0.2999451 0.3148131 0.4001457 0.3104979 0.3163136 0.4031078
## murder togeth order terror companion instant spot
## HPL 0.4411708 0.3812219 0.2904356 0.2959606 0.3781021 0.3785187 0.3483151
## MWS 0.2649089 0.3112236 0.3180901 0.4060880 0.3120298 0.3353669 0.3337312
## smile river sky motion origin paper best
## HPL 0.4014302 0.3397778 0.3922403 0.3054318 0.3308058 0.2961408 0.3518996
## MWS 0.2940460 0.3573274 0.4090401 0.3715193 0.4466205 0.3626197 0.2949877
## fall want led
## HPL 0.3368749 0.3316937 0.3608234
## MWS 0.3402750 0.3330111 0.3483958
##
## Residual Deviance: 21462.56
## AIC: 23066.56
preds_lc <- predict(model_lc, type = "class", newdata = test)
preds_lcp <- predict(model_lc, type = "probs", newdata = test)
sum(preds_lc == tiny$Author[14580:19579])/5000
## [1] 0.653
table(preds_lc, tiny$Author[14580:19579])
##
## preds_lc EAP HPL MWS
## EAP 1518 437 447
## HPL 245 835 186
## MWS 262 158 912
MultiLogLoss(preds_lcp, tiny$Author[14580:19579])
## [1] 0.7857041
Okay, so we ran this through a logistic classifier and now we got our accuracy up to 65.28%. But our multiclass log loss increased. This seems to be because the NaiveBayesClassifier is extremely confident about its predictions. Not bad for our second shot. Admittedly, it takes more and more effort to move up from here, but this looks like a reasonable track.
My suggestions: use more of the words. Use all of the training data. Maybe clean the data. If you are going to use a logistic classifier or a neural net, normalizing the data is a good idea. Add an unknown word marker to deal with new words.
Of course, you still have to transform the testing data into the same format in order to submit.
If you win, throw a party for the class :)
You still have to figure out how to transform the testing data into acceptable input for this classifier.