ISLR Exercise 8

Although we could use the ISLR package. We will download the dataset into our local directory.

download.file("", "College.csv")

Load our data frame from the file.

df <- read.csv("College.csv")

Let’s look at it:

##                              X Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University     Yes 1660   1232    721        23
## 2           Adelphi University     Yes 2186   1924    512        16
## 3               Adrian College     Yes 1428   1097    336        22
## 4          Agnes Scott College     Yes  417    349    137        60
## 5    Alaska Pacific University     Yes  193    146     55        16
## 6            Albertson College     Yes  587    479    158        38
##   Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1        52        2885         537     7440       3300   450     2200  70
## 2        29        2683        1227    12280       6450   750     1500  29
## 3        50        1036          99    11250       3750   400     1165  53
## 4        89         510          63    12960       5450   450      875  92
## 5        44         249         869     7560       4120   800     1500  76
## 6        62         678          41    13500       3335   500      675  67
##   Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1       78      18.1          12   7041        60
## 2       30      12.2          16  10527        56
## 3       66      12.9          30   8735        54
## 4       97       7.7          37  19016        59
## 5       72      11.9           2  10922        15
## 6       73       9.4          11   9727        55
##                             X       Private        Apps      
##  Abilene Christian University:  1   No :212   Min.   :   81  
##  Adelphi University          :  1   Yes:565   1st Qu.:  776  
##  Adrian College              :  1             Median : 1558  
##  Agnes Scott College         :  1             Mean   : 3002  
##  Alaska Pacific University   :  1             3rd Qu.: 3624  
##  Albertson College           :  1             Max.   :48094  
##  (Other)                     :771                            
##      Accept          Enroll       Top10perc       Top25perc    
##  Min.   :   72   Min.   :  35   Min.   : 1.00   Min.   :  9.0  
##  1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0  
##  Median : 1110   Median : 434   Median :23.00   Median : 54.0  
##  Mean   : 2019   Mean   : 780   Mean   :27.56   Mean   : 55.8  
##  3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0  
##  Max.   :26330   Max.   :6392   Max.   :96.00   Max.   :100.0  
##   F.Undergrad     P.Undergrad         Outstate       Room.Board  
##  Min.   :  139   Min.   :    1.0   Min.   : 2340   Min.   :1780  
##  1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597  
##  Median : 1707   Median :  353.0   Median : 9990   Median :4200  
##  Mean   : 3700   Mean   :  855.3   Mean   :10441   Mean   :4358  
##  3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050  
##  Max.   :31643   Max.   :21836.0   Max.   :21700   Max.   :8124  
##      Books           Personal         PhD            Terminal    
##  Min.   :  96.0   Min.   : 250   Min.   :  8.00   Min.   : 24.0  
##  1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0  
##  Median : 500.0   Median :1200   Median : 75.00   Median : 82.0  
##  Mean   : 549.4   Mean   :1341   Mean   : 72.66   Mean   : 79.7  
##  3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0  
##  Max.   :2340.0   Max.   :6800   Max.   :103.00   Max.   :100.0  
##    S.F.Ratio      perc.alumni        Expend        Grad.Rate     
##  Min.   : 2.50   Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median :13.60   Median :21.00   Median : 8377   Median : 65.00  
##  Mean   :14.09   Mean   :22.74   Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :39.80   Max.   :64.00   Max.   :56233   Max.   :118.00  

Uh oh. We can see that the college names have been converted into factors. This is not especially helpful. Let’s make these into the names of the rows.

rownames(df) <- df[,1]
##                                                         X Private Apps
## Abilene Christian University Abilene Christian University     Yes 1660
## Adelphi University                     Adelphi University     Yes 2186
## Adrian College                             Adrian College     Yes 1428
## Agnes Scott College                   Agnes Scott College     Yes  417
## Alaska Pacific University       Alaska Pacific University     Yes  193
## Albertson College                       Albertson College     Yes  587
##                              Accept Enroll Top10perc Top25perc F.Undergrad
## Abilene Christian University   1232    721        23        52        2885
## Adelphi University             1924    512        16        29        2683
## Adrian College                 1097    336        22        50        1036
## Agnes Scott College             349    137        60        89         510
## Alaska Pacific University       146     55        16        44         249
## Albertson College               479    158        38        62         678
##                              P.Undergrad Outstate Room.Board Books
## Abilene Christian University         537     7440       3300   450
## Adelphi University                  1227    12280       6450   750
## Adrian College                        99    11250       3750   400
## Agnes Scott College                   63    12960       5450   450
## Alaska Pacific University            869     7560       4120   800
## Albertson College                     41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni
## Abilene Christian University     2200  70       78      18.1          12
## Adelphi University               1500  29       30      12.2          16
## Adrian College                   1165  53       66      12.9          30
## Agnes Scott College               875  92       97       7.7          37
## Alaska Pacific University        1500  76       72      11.9           2
## Albertson College                 675  67       73       9.4          11
##                              Expend Grad.Rate
## Abilene Christian University   7041        60
## Adelphi University            10527        56
## Adrian College                 8735        54
## Agnes Scott College           19016        59
## Alaska Pacific University     10922        15
## Albertson College              9727        55
##                             X       Private        Apps      
##  Abilene Christian University:  1   No :212   Min.   :   81  
##  Adelphi University          :  1   Yes:565   1st Qu.:  776  
##  Adrian College              :  1             Median : 1558  
##  Agnes Scott College         :  1             Mean   : 3002  
##  Alaska Pacific University   :  1             3rd Qu.: 3624  
##  Albertson College           :  1             Max.   :48094  
##  (Other)                     :771                            
##      Accept          Enroll       Top10perc       Top25perc    
##  Min.   :   72   Min.   :  35   Min.   : 1.00   Min.   :  9.0  
##  1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0  
##  Median : 1110   Median : 434   Median :23.00   Median : 54.0  
##  Mean   : 2019   Mean   : 780   Mean   :27.56   Mean   : 55.8  
##  3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0  
##  Max.   :26330   Max.   :6392   Max.   :96.00   Max.   :100.0  
##   F.Undergrad     P.Undergrad         Outstate       Room.Board  
##  Min.   :  139   Min.   :    1.0   Min.   : 2340   Min.   :1780  
##  1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597  
##  Median : 1707   Median :  353.0   Median : 9990   Median :4200  
##  Mean   : 3700   Mean   :  855.3   Mean   :10441   Mean   :4358  
##  3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050  
##  Max.   :31643   Max.   :21836.0   Max.   :21700   Max.   :8124  
##      Books           Personal         PhD            Terminal    
##  Min.   :  96.0   Min.   : 250   Min.   :  8.00   Min.   : 24.0  
##  1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0  
##  Median : 500.0   Median :1200   Median : 75.00   Median : 82.0  
##  Mean   : 549.4   Mean   :1341   Mean   : 72.66   Mean   : 79.7  
##  3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0  
##  Max.   :2340.0   Max.   :6800   Max.   :103.00   Max.   :100.0  
##    S.F.Ratio      perc.alumni        Expend        Grad.Rate     
##  Min.   : 2.50   Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median :13.60   Median :21.00   Median : 8377   Median : 65.00  
##  Mean   :14.09   Mean   :22.74   Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :39.80   Max.   :64.00   Max.   :56233   Max.   :118.00  

We are supposed to use the fix command. Let’s see what that is.


Okay, that was not especially helpful. Let’s run it.


I see. This gives me an incredibly ugly data frame editor.

Okay let’s get rid of the unneccessary column.

df <- df[,-1]

Note that the “-1” here means that we will select all of the columns except the first one. This behavior is different from python where -1 will typically refer to the last column and -2 will refer to the second to last column.

Let’s see this in our ugly editor again.


Okay, let’s look at our cleaned data frame:

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Good, the unnecessary factor data is gone and apparently some school has an awful graduation rate.

Which one is it?

## [1] 586

Who is this culprit:

##                           Private Apps Accept Enroll Top10perc Top25perc
## Texas Southern University      No 4345   3245   2604        15        85
##                           F.Undergrad P.Undergrad Outstate Room.Board
## Texas Southern University        5584        3101     7860       3360
##                           Books Personal PhD Terminal S.F.Ratio
## Texas Southern University   600     1700  65       75      18.2
##                           perc.alumni Expend Grad.Rate
## Texas Southern University          21   3605        10

Wow, we should avoid Texas Southern University.

Now, let us look at a pairwise scatterplot of the first ten columns.


That was ugly. Let’s try this again.


Now let us construct a boxplot:

ggplot(df, aes(y = Outstate, x = Private)) + geom_boxplot()

Okay, let us construct a new Elite feature which indicates if more than half of the students come from the top 10% of their class. First populate the entries.

Elite = rep("No", nrow(df))

Mark the elite schools:

Elite[df$Top10perc > 50]="Yes"

Make these strings factors:


Add to our data frame.

df$Elite = Elite
##  [1] "Private"     "Apps"        "Accept"      "Enroll"      "Top10perc"  
##  [6] "Top25perc"   "F.Undergrad" "P.Undergrad" "Outstate"    "Room.Board" 
## [11] "Books"       "Personal"    "PhD"         "Terminal"    "S.F.Ratio"  
## [16] "perc.alumni" "Expend"      "Grad.Rate"   "Elite"

Okay check the new summary:

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate      Elite    
##  Min.   : 10.00   No :699  
##  1st Qu.: 53.00   Yes: 78  
##  Median : 65.00            
##  Mean   : 65.46            
##  3rd Qu.: 78.00            
##  Max.   :118.00

Looks good. Now let us check out the out-of-state numbers for elite schools.

ggplot(df, aes(x=Elite, y=Outstate)) + geom_boxplot()

Let us check this out side by side:

p1 <- ggplot(df, aes(x = Private, y = Outstate)) + geom_boxplot()
p2 <- ggplot(df, aes(x = Elite, y = Outstate)) + geom_boxplot()
## Loading required package: magrittr
ggarrange(p1, p2)


Let’s check out some more distributions.

p1 <- ggplot(df, aes(x = Room.Board)) + geom_histogram(bins = 15)

p2 <- ggplot(df, aes(x = PhD)) + geom_histogram(bins = 20)

p3 <- ggplot(df, aes(x = Grad.Rate)) + geom_histogram(bins = 10)

p4 <- ggplot(df, aes(x = Apps)) + geom_histogram(bins = 20)
ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

Finally let’s check out how many applications Elite schools get:

ggplot(df, aes(x = Elite, y = Apps)) + geom_boxplot()

Wow, who is that outlier?

##                          Private  Apps Accept Enroll Top10perc Top25perc
## Rutgers at New Brunswick      No 48094  26330   4520        36        79
##                          F.Undergrad P.Undergrad Outstate Room.Board Books
## Rutgers at New Brunswick       21401        3712     7410       4748   690
##                          Personal PhD Terminal S.F.Ratio perc.alumni
## Rutgers at New Brunswick     2009  90       95      19.5          19
##                          Expend Grad.Rate Elite
## Rutgers at New Brunswick  10474        77    No

What is going on there?

Some fun

Let us clear our workspace.

rm(list = ls())

Okay, I have grown a little tired of R’s built in idioscyncracies. I am going to try and work in the tidyverse.

Let us load a spooky dataset:

## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
spooky <- read_csv("train.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   text = col_character(),
##   author = col_character()
## )
## # A tibble: 6 x 3
##        id
##     <chr>
## 1 id26305
## 2 id17569
## 3 id11008
## 4 id27763
## 5 id12958
## 6 id22965
## # ... with 2 more variables: text <chr>, author <chr>

Okay let us clean this up.

spooky = spooky[,-1]
spooky$author <- as.factor(spooky$author)
## # A tibble: 6 x 2
##                                                                          text
##                                                                         <chr>
## 1 This process, however, afforded me no means of ascertaining the dimensions 
## 2     It never once occurred to me that the fumbling might be a mere mistake.
## 3 In his left hand was a gold snuff box, from which, as he capered down the h
## 4 How lovely is spring As we looked from Windsor Terrace on the sixteen ferti
## 5 Finding nothing else, not even gold, the Superintendent abandoned his attem
## 6 A youth passed in solitude, my best years spent under your gentle and femin
## # ... with 1 more variables: author <fctr>

We are going to manipulate some text, so let us load stringr (this is unnecessary after loading tidyverse).


Now let us check the use of punctuation.

spooky$commas = str_count(spooky$text, ",")
spooky$exclam = str_count(spooky$text, "!")
spooky$quest = str_count(spooky$text, "\\?")

A summary:

##      text           author         commas           exclam 
##  Length:19579       EAP:7900   Min.   : 0.000   Min.   :0  
##  Class :character   HPL:5635   1st Qu.: 1.000   1st Qu.:0  
##  Mode  :character   MWS:6044   Median : 1.000   Median :0  
##                                Mean   : 1.952   Mean   :0  
##                                3rd Qu.: 3.000   3rd Qu.:0  
##                                Max.   :48.000   Max.   :0  
##      quest        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.05608  
##  3rd Qu.:0.00000  
##  Max.   :4.00000
spooky %>% filter(commas == max(commas))
## # A tibble: 1 x 5
##                                                                          text
##                                                                         <chr>
## 1 "To chambers of painted state farewell To midnight revelry, and the panting
## # ... with 4 more variables: author <fctr>, commas <int>, exclam <int>,
## #   quest <int>

Okay so exclamation points are no help:

spooky = spooky %>% select(-exclam)
ggpairs(spooky %>% select(-text))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Loading required package: NLP
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##     annotate
corp <- Corpus(VectorSource(spooky$text))
## Warning in as.POSIXlt.POSIXct(Sys.time(), tz = "GMT"): unknown timezone
## 'default/Europe/Berlin'
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 231
## This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.

Okay clean the data.

corp <- tm_map(corp, removePunctuation)
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 224
## This process however afforded me no means of ascertaining the dimensions of my dungeon as I might make its circuit and return to the point whence I set out without being aware of the fact so perfectly uniform seemed the wall
corp <- corp %>% 
  tm_map(removeNumbers) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(tolower) %>%
  tm_map(removeWords, stopwords("english")) %>%
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 136
## process howev afford mean ascertain dimens dungeon might make circuit return point whenc set without awar fact perfect uniform seem wall

Create a document term matrix:

dtm <- DocumentTermMatrix(corp)
## <<DocumentTermMatrix (documents: 5, terms: 4)>>
## Non-/sparse entries: 4/16
## Sparsity           : 80%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs afford ascertain awar circuit
##    1      1         1    1       1
##    2      0         0    0       0
##    3      0         0    0       0
##    4      0         0    0       0
##    5      0         0    0       0
dtmfr <- as_tibble(as.matrix(dtm))
## # A tibble: 6 x 15,013
##   afford ascertain  awar circuit dimens dungeon  fact howev  make  mean
##    <dbl>     <dbl> <dbl>   <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1      1         1     1       1      1       1     1     1     1     1
## 2      0         0     0       0      0       0     0     0     0     0
## 3      0         0     0       0      0       0     0     0     0     0
## 4      0         0     0       0      0       0     0     0     0     0
## 5      0         0     0       0      0       0     0     0     0     0
## 6      0         0     0       0      0       0     0     0     0     0
## # ... with 15003 more variables: might <dbl>, perfect <dbl>, point <dbl>,
## #   process <dbl>, return <dbl>, seem <dbl>, set <dbl>, uniform <dbl>,
## #   wall <dbl>, whenc <dbl>, without <dbl>, fumbl <dbl>, mere <dbl>,
## #   mistak <dbl>, never <dbl>, occur <dbl>, air <dbl>, box <dbl>,
## #   caper <dbl>, cut <dbl>, fantast <dbl>, gold <dbl>, greatest <dbl>,
## #   hand <dbl>, hill <dbl>, incess <dbl>, left <dbl>, manner <dbl>,
## #   possibl <dbl>, satisfact <dbl>, self <dbl>, snuff <dbl>, step <dbl>,
## #   took <dbl>, beneath <dbl>, cheer <dbl>, cottag <dbl>, counti <dbl>,
## #   fair <dbl>, fertil <dbl>, former <dbl>, happi <dbl>, heart <dbl>,
## #   look <dbl>, love <dbl>, sixteen <dbl>, speckl <dbl>, spread <dbl>,
## #   spring <dbl>, terrac <dbl>, town <dbl>, wealthier <dbl>,
## #   windsor <dbl>, year <dbl>, abandon <dbl>, attempt <dbl>,
## #   counten <dbl>, desk <dbl>, els <dbl>, even <dbl>, find <dbl>,
## #   noth <dbl>, occasion <dbl>, perplex <dbl>, sit <dbl>, steal <dbl>,
## #   superintend <dbl>, think <dbl>, abl <dbl>, believ <dbl>, best <dbl>,
## #   board <dbl>, brutal <dbl>, charact <dbl>, crew <dbl>, distast <dbl>,
## #   equal <dbl>, exercis <dbl>, felt <dbl>, feminin <dbl>, fortun <dbl>,
## #   fosterag <dbl>, gentl <dbl>, groundwork <dbl>, heard <dbl>,
## #   intens <dbl>, kindli <dbl>, marin <dbl>, necessari <dbl>, note <dbl>,
## #   obedi <dbl>, overcom <dbl>, paid <dbl>, pass <dbl>, peculiar <dbl>,
## #   refin <dbl>, respect <dbl>, secur <dbl>, servic <dbl>, ship <dbl>, ...
dtmfr$Author  = spooky$author

Wow using tibbles was much faster than dataframes.

Now we will aggregate the data and get the most words that vary the most amongst the authors:

counts <- dtmfr %>% group_by(Author) %>%
ord = order(apply(select(counts,-Author), 2, var), decreasing = TRUE)[1:10] 
counts %>% select(-Author) %>% select(ord)
## # A tibble: 3 x 10
##    upon  love  will thing   old raymond   say heart  life  thus
##   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1025    74   410   221   139       0   334   111   105   254
## 2   186    37    78   433   392       2   123    20   130    15
## 3   200   425   468    71    85     270    63   290   333   136

Just for fun let us try a naive Bayes classifier. We need to shrink the data first to make this manageable. We will also set aside the last 5000 entries for validation.

new_order = order(apply(select(counts,-Author), 2, sum),decreasing = TRUE)[1:400]
tiny = dtmfr %>% select(-Author) %>% select(new_order) 
tiny$Author = dtmfr$Author
model <- naiveBayes(Author ~ ., data = tiny[1:14579,])

Now let us test it against the validation data.

We will check it on 5000 entries to assess the accuracy and then look at a truth table to see what kind of errors our classifier makes.

## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##     Recall
test <- (tiny %>% select(-Author))[14580:19579,]
preds <- predict(model, newdata = test)
preds_p <- 
  predict(model, type = "raw", newdata = test)
sum(preds == tiny$Author[14580:19579])/5000
## [1] 0.6094
MultiLogLoss(preds_p, tiny$Author[14580:19579])
## [1] 10.27673
table(preds, tiny$Author[14580:19579])
## preds  EAP  HPL  MWS
##   EAP 1388  353  475
##   HPL  430  936  347
##   MWS  207  141  723

First the good news, 60.94% on a dataset is significantly better than randomly guessing (run a hypothesis test to check this). Our log loss is way below the public leader, which makes me think I have misread the rules.

Let’s see how a logistic classifier would hold up;

model_lc <- multinom(Author ~ ., data = tiny[1:14579,], MaxNWts = 3000)
## # weights:  1206 (802 variable)
## initial  value 16016.668556 
## iter  10 value 11628.629580
## iter  20 value 10908.090402
## iter  30 value 10747.013589
## iter  40 value 10732.296042
## iter  50 value 10731.489415
## iter  60 value 10731.442760
## iter  70 value 10731.423366
## iter  80 value 10731.370125
## iter  90 value 10731.313671
## iter 100 value 10731.282456
## final  value 10731.282456 
## stopped after 100 iterations
## Call:
## multinom(formula = Author ~ ., data = tiny[1:14579, ], MaxNWts = 3000)
## Coefficients:
##     (Intercept)        one      upon        now       will      time
## HPL  -0.6712898 -0.0396722 -1.565461 -0.1699132 -1.2480167 0.1551636
## MWS  -0.6587432 -0.0432382 -1.491000 -0.1533031  0.1203872 0.2656451
##           even       man        day        eye      thing        yet
## HPL 0.02409388 0.2573360 -0.2120391 -0.2306069  0.9366177 -0.1411803
## MWS 0.07859653 0.2630459  0.1029332  0.1391803 -1.1900740  0.2761100
##           said       seem       like     might        old       first
## HPL -0.6503126  0.7321371 0.57697660 0.4901314  1.0595776 -0.24906272
## MWS -0.5292448 -0.2374424 0.05696853 0.7461682 -0.6056544  0.05686004
##         night      must     thought      look      found      never
## HPL 0.4088301 0.3170074 -0.03437579 0.3819927  0.1361783  0.1965915
## MWS 0.2500908 0.5577168  0.23892539 0.4990832 -0.2889225 -0.0718980
##          life       great        made       long      love      everi
## HPL 0.4165398  0.06024384 -0.45386350 -0.1964061 -0.811931 -0.4723819
## MWS 1.1035625 -0.59270974 -0.06573172 -0.3981582  1.145623  0.5123791
##          littl       still        say     place       saw       mani
## HPL -0.3738058 -0.17415286 -0.5407533 0.4546974 0.9294185  0.3154346
## MWS -0.3601869  0.01801602 -1.4454504 0.3867138 0.4076899 -0.2199853
##           well     appear        hand      hous      came       much
## HPL -0.6176318 -0.7933379 -0.04520392 0.6120965 0.8044274 -0.1326384
## MWS -0.8081276  0.1754485  0.21623109 0.1221886 0.3284930 -0.6715427
##            see      year     fear        may       natur         two
## HPL 0.60573058 0.4740895 1.079838 -0.2976259 -0.20575620  0.06686507
## MWS 0.02851436 0.2591825 1.176245  0.1484540  0.03262476 -0.41774080
##            word      come        can        death      heart      mind
## HPL -1.15131105 1.3019883 -0.2811540 -0.005998768 -1.6220977 0.1786874
## MWS -0.02200052 0.7266197  0.5350109  0.419867642  0.7573054 0.2976594
##           feel       know      ever     light       thus       near
## HPL -0.5990999  0.5947941 0.4000024 0.5262952 -2.2886393 -0.0936151
## MWS  0.8076846 -0.0279369 0.5149314 0.1825601 -0.4192177  0.2149737
##         whose        make     friend    without        far        open
## HPL 0.7221844 -0.08866683 -0.2114020 -0.1025191 -0.1576165  0.09228348
## MWS 0.5291968 -0.29114580  0.5917266 -0.3752697 -0.3539289 -0.31003197
##          form      shall     heard       men     howev      earth
## HPL 0.2096011 -0.6732518 0.8206809 1.3250012 -0.864004 0.07141642
## MWS 0.1804578  0.3638667 0.4538310 0.4608323 -1.268178 0.15506944
##          last        left      part       hour      world       head
## HPL 0.8573859 -0.04352952 0.2377415 -0.2499458 0.25876999 -0.1061357
## MWS 0.7242287 -0.29706192 0.7661523 -0.2100075 0.09082829 -0.3230781
##           room       pass      felt       door      live       dream
## HPL  0.2438951 -0.1611681 0.2715014  0.1705943 0.6405385  0.83524356
## MWS -0.4401720  0.6668573 0.5146259 -0.1952871 0.9271164 -0.07099548
##        strang        dark         call        way        voic   though
## HPL 1.3096092  0.43099795 -0.001652164  0.1088651 -0.09962387 3.522883
## MWS 0.7179855 -0.01780144 -0.400680643 -0.9401002 -0.11207445 2.059099
##         moment        inde     human       turn    return       hope
## HPL -0.2605924  0.07223905 0.1299142  0.1859901 0.0340694 -0.2919771
## MWS  0.0821032 -0.53463790 0.7341542 -0.2008771 0.7957967  1.3833473
##           back   toward        let     within       noth      whole
## HPL  0.1096008 1.277124 -0.9742786 -0.7603882 -0.2986702 -0.8508702
## MWS -0.5754191 1.694123 -0.2711971 -0.9969182 -0.3273708 -0.6435991
##           good      point     becam      away        take       mean
## HPL -0.8795458 -0.7112956 0.3706576 0.8575531 -0.06503354 -1.0335017
## MWS -0.1445913 -1.0602374 0.4867340 0.4784960 -0.15815571 -0.6046746
##       certain       die   father      sound      find     bodi        seen
## HPL  1.129344 0.1090338 1.194651 0.69951981 0.3716554  0.11985 -0.07356314
## MWS -1.351921 0.6834364 2.178543 0.06116842 0.0582771 -1.32697 -1.00748984
##        present      face      among    length     person       knew
## HPL -0.9410518 1.0099214 -0.5135681 -1.069182 -0.8454418  0.8861121
## MWS  0.1450428 0.1308696  0.3353719 -0.764736 -0.7739436 -0.1490288
##        remain   raymond      soon       close        three       sea
## HPL 0.3134632  6.906738 0.5633482  0.05246229  0.005057285 0.6627028
## MWS 0.7168372 11.076235 0.7869682 -0.19087508 -1.198569913 0.6498779
##          water       citi     spirit        new    chang       half
## HPL -0.3837666  0.8105671 -1.4920659 0.57605564 1.174704  0.3545811
## MWS -0.4935173 -0.2141342  0.6934554 0.01164793 1.922898 -0.1628590
##         beauti     street     beyond   although        end      think
## HPL -0.4685792  1.3207557  0.6800065 -1.6446716 -0.4742255  0.3765827
## MWS -0.1571525 -0.2503289 -0.9972657 -0.7657746 -0.1468291 -0.1019803
##          power        soul      alon     observ        air        idea
## HPL -0.4780959 -0.09590163 0.4529120 -1.5547068 -0.7490732 -1.12381902
## MWS  0.8549776  0.27762584 0.5317043 -0.6370111  0.1183782 -0.03491193
##          object       just     anoth        less       wall      kind
## HPL -0.01612585 -0.6541325 0.8924173  0.03232287  0.3386471 0.7891610
## MWS -0.22598142 -1.3613651 1.0247239 -0.22155207 -0.4681757 0.5493638
##        continu       sinc        name      high      happi      gave
## HPL -0.7222088  0.4903320 -0.09891292 0.2877814 -0.4557131 0.1452382
## MWS -0.1572161 -0.5402618  0.14330979 0.1218856  1.4002225 0.2633416
##         window      enter    follow     horror     express     imagin
## HPL  0.3894553 -0.2810855 0.8145551 0.43897747 -0.08003865 0.22986708
## MWS -0.8365273  1.1879456 0.6992749 0.04450259  0.52215562 0.01444484
##        almost      mere     small      wish      black    matter
## HPL 1.4292878 -0.580112 0.4884050 0.6140826  0.6670094 -1.163222
## MWS 0.5381814 -1.313922 0.1607619 0.9273372 -0.9710789 -1.714175
##          young       took     around         full        quit     wonder
## HPL 0.35091769 -0.2178550 0.54838080 -0.003955085 -0.51211916  0.0104690
## MWS 0.05628391 -0.1090416 0.05436416 -0.659625667  0.09428088 -0.2204143
##          side      dead        lay     someth      cours      exist
## HPL 0.1384926 0.3959645 -0.1516873  1.0168633 -0.5598006 -0.1994335
## MWS 0.0536306 0.1384244  0.1066368 -0.1301605 -0.7940545  0.3256657
##           tell      work      went      told     believ     sudden
## HPL  0.3620058 0.5872606 0.0695603 1.5572630 -0.1992236 0.39132833
## MWS -0.4577638 0.5236975 0.2080567 0.2536619  0.1482011 0.02157818
##           give     becom     scene       leav     reason         arm
## HPL -0.1977696 0.1393761 0.2461689 0.01960883 -0.8957183 -0.52986715
## MWS  0.2151664 0.7871082 1.0032416 0.89345926 -0.9064348  0.02595853
##            god      caus       also     manner      perhap      speak
## HPL 0.41571152 0.2793080 -0.6464642 -2.2141128  0.09602863 -0.2675078
## MWS 0.04455414 0.8693414  0.2791693 -0.7044164 -0.19791120 -0.3555598
##        general      wind    possibl        tree       right      state
## HPL -0.6161065 0.1438807  0.4478163 -0.10264907 -0.07298696 -0.7188274
## MWS -1.3701622 0.6992059 -1.0326562  0.07388377 -0.52179702  0.2683152
##           reach         sens     direct      wild        put        deep
## HPL 0.112169616 -0.008102586 -0.5633598 0.2401963 -0.3851525 -0.05150622
## MWS 0.001368847 -0.770182657 -0.9036619 0.3057393 -0.3830607  0.15939980
##          care    question       set    possess       utter     sight
## HPL 0.1291901  0.19885747 0.7065219 -0.3422381  0.00708707 1.2033479
## MWS 0.3153708 -0.05669003 0.2914727  0.2569324 -0.22346312 0.3242558
##           feet         use      sever       evid      doubt    immedi
## HPL -0.5738387  0.74984691 0.02634754 -0.7730532 -0.6794186 -1.328802
## MWS -0.8882131 -0.06306273 0.78746824 -1.4974631 -1.1343181 -0.997358
##            age     began     town     known       fact      least
## HPL  0.2745762 1.4541961 1.454089  0.426662 -0.7216491  0.1769029
## MWS -0.1994432 0.5280776 1.398367 -1.034332 -1.8795486 -0.2900365
##           step   interest     often         mad       moon       lost
## HPL -0.3724678 0.09630872 0.7093904  0.87304640  0.2683780 -0.4788316
## MWS  0.1311770 0.09650610 0.9199753 -0.05698348 -0.3130555  0.7799730
##          repli       home   thousand       morn       fanci     rather
## HPL -2.1942710 -0.1537096 -0.9372408 -0.1664466 -0.08356306 -0.5208451
## MWS -0.1823151 -0.3605622  0.2030441  0.1569298 -0.24665375 -0.5195005
##      mountain        sun      read     visit     second      hill
## HPL 0.8940084 -1.4861705 0.8490446 0.4610521 -0.1334212 1.4889475
## MWS 0.8645928  0.8739729 0.8195299 1.0381436 -1.4699955 0.1162009
##          stone     heaven      rememb    watch      peopl  sometim
## HPL  1.5456067 -0.8530627  0.08042333 1.121059  0.2577938 1.548605
## MWS -0.6084312  0.2307745 -0.43521364 1.183967 -0.5184615 1.258596
##       charact   brought        view       book       larg      affect
## HPL -1.150583 0.1824834  0.04672884 0.40516782 -0.5450343 -0.06623483
## MWS -1.300090 0.1074299 -0.42446175 0.04064702 -0.8632639  1.23462183
##       perdita      alway      taken     secret   countri     sleep
## HPL -1.324141  0.6539962 -0.2818156 0.46255382 0.4314401 0.3708728
## MWS 10.679775 -0.6003435  0.2159734 0.05729402 2.0036952 0.4890235
##            get    perceiv     land     suffer    ancient          case
## HPL  0.3701533 -1.5749740 1.282089 -0.3489746 1.61505081  0.0004804619
## MWS -1.9721628 -0.1485994 1.081708  0.6827955 0.04515496 -1.5608372612
##          entir     regard      stood       sure       spoke    circumst
## HPL -1.3362894 -0.7373667 0.03929624 -0.7146996 -0.10473010 -0.97809196
## MWS -0.5990079 -0.7352640 0.03213757 -1.0221298 -0.03046066 -0.05767761
##          gentl     reflect     discov     proceed      fell     attent
## HPL -0.2467913 -0.04797975 -0.9259138 -1.47475882 0.3171402 -1.6065352
## MWS  1.0389742  0.82441247  0.4222872 -0.03037744 0.3842206 -0.6700781
##          arriv     minut       forc      tear     shadow    period
## HPL -0.4411455 -1.929841  0.4790925 -1.123994  0.2082803 -1.331336
## MWS  1.0634379 -1.699129 -0.1916206  1.339649 -0.6700201 -1.006817
##          move      rest      appar      degre       pain      white
## HPL 0.3173116 0.6306424 -0.2549077 -0.7441478 -1.2104179  0.9682860
## MWS 0.0800344 0.8163082 -0.7146816  0.2627532  0.7637222 -0.1265203
##        mother      floor    expect      excit     relat       true
## HPL 0.8885127  0.9226296 0.2461755 -0.7954789 0.3771277 -0.6731686
## MWS 1.3004799 -0.9579621 0.5490836 -0.4241952 1.0149429 -0.4268049
##        terribl    account      talk     longer     letter    famili
## HPL  0.8139453 -0.1327228 1.4166108 -0.5810062 -0.9350487 0.7386304
## MWS -0.4067545 -0.2235025 0.9999955 -0.1084941 -0.4500920 0.7610563
##        passion   peculiar        effect    ground    other    suppos
## HPL -2.8789850  0.1122965 -0.0000144838 0.2331737 1.523782 -1.680288
## MWS  0.9010292 -1.3583118 -0.0980044100 0.5030340 1.754273 -1.053643
##        better      west      month     desir       truth     receiv
## HPL 0.1853768 2.9541930 -0.1797906 0.1210762 -0.95560920 0.01984439
## MWS 0.3331220 0.2055228  0.8015554 1.0848527 -0.01084531 0.69800654
##           grew    listen      tri  approach   despair       done     walk
## HPL  0.1125965 0.6740614 3.284406 0.1529271 -1.182099 -0.8070523 0.910660
## MWS -1.0178047 1.0685488 1.768460 0.6366253  1.312783 -1.0205017 1.264679
##           vast      evil     memori       late       fill         line
## HPL -0.1894932 0.4823457  0.5757542 -0.2366506 -0.1002902 -0.008361823
## MWS -0.7446772 0.7200443 -0.6839914 -0.4154363  1.2307967 -0.880678549
##          posit    beneath    subject       escap    adrian    alreadi
## HPL -0.8877981 -0.1958092 -0.4213285  0.42843444 -1.465258 -1.4314767
## MWS -2.7492379  0.2554528 -0.2435097 -0.09337106  8.877616  0.3143522
##           ask      clear      usual    hideous       fire      poor
## HPL 0.7940028  0.4976363 -0.6722979  0.9479556 0.01348264 0.8317882
## MWS 0.7978548 -0.2843757 -0.2801564 -1.9119643 0.37892786 1.4758537
##         attend    counten     breath    impress      hear    suffici
## HPL -0.7337794 -1.7959002 -0.9291009  0.3318729 0.8116621 -2.0474993
## MWS  0.4051958  0.8738952 -1.3875777 -0.5288110 0.8528495 -0.5335557
##          dear     past      purpos        low     dare       cri     till
## HPL -1.609684 1.340865 -0.97706754  0.4698974 1.088141 0.4267356 2.833866
## MWS  1.415510 1.030456 -0.05036787 -0.8179133 1.144108 0.5408059 1.981657
##        fellow      anim   chamber      event      short     figur
## HPL 0.2636688 0.1272983 -1.169447 -0.3663693 -0.2782079 -0.742283
## MWS 0.5464234 0.2385033 -1.122878  0.3831414 -0.1135372 -1.834829
##       creatur      final      star      wood     dread      hard
## HPL 0.7038123  0.3219756 0.6698859 0.5187576 0.1383283 0.6818811
## MWS 1.4567356 -1.7503251 0.3615295 1.3311690 0.5585247 1.4315458
##            busi     space     cold     either       ill     `next`
## HPL -0.57148403 1.6094161 1.199020 -0.5749888 0.3759970  0.7932166
## MWS -0.04428204 0.4648739 1.677301 -1.2954711 0.8091589 -0.3121052
##        youth      none    studi      given   delight       five     child
## HPL 1.037516 0.3694499 1.020938 -0.3389876 -1.206512 -0.5335428 0.3479535
## MWS 1.120719 0.2755419 1.171242 -0.1488712  1.069165 -0.9900519 1.1227548
##        attempt    unknown     murder     togeth     order      terror
## HPL -0.9973991  0.8497529 -1.2075509 -0.5841953 0.5373530 -0.02590554
## MWS -0.8363839 -0.5513333  0.1653636  0.3161136 0.3092768 -1.40477473
##     companion    instant       spot      smile      river      sky
## HPL 0.1982159 -1.2562162 -0.3406070 -0.4922511 -0.5615726 1.089963
## MWS 0.9161791 -0.8655652  0.6243694  0.4349721 -0.3364643 1.046874
##         motion     origin      paper       best       fall      want
## HPL  0.2249229 -0.1669384  0.3935401 -0.5036965 -0.1643942 1.0681045
## MWS -0.4689432 -1.3620864 -0.2427141 -0.1318715  0.2384420 0.9137371
##           led
## HPL 0.6597864
## MWS 0.9534994
## Std. Errors:
##     (Intercept)        one      upon       now      will      time
## HPL  0.04347801 0.08895336 0.1226958 0.1079919 0.1727309 0.1289752
## MWS  0.04280188 0.09010291 0.1187788 0.1054332 0.1057692 0.1270855
##          even       man       day       eye     thing       yet      said
## HPL 0.1322215 0.1301098 0.1362692 0.1435903 0.1252612 0.1519076 0.1448455
## MWS 0.1261912 0.1319441 0.1254368 0.1356817 0.2139607 0.1347225 0.1355507
##          seem      like     might       old     first     night      must
## HPL 0.1358842 0.1376152 0.1505361 0.1504316 0.1513716 0.1471832 0.1495566
## MWS 0.1627163 0.1613118 0.1443352 0.2108473 0.1466235 0.1574988 0.1403239
##      thought      look     found     never      life     great      made
## HPL 0.156508 0.1574394 0.1456482 0.1436096 0.1877579 0.1422788 0.1596337
## MWS 0.149921 0.1603747 0.1585931 0.1595474 0.1656205 0.1718363 0.1510207
##          long      love     everi     littl     still       say     place
## HPL 0.1523636 0.2877107 0.1762552 0.1508735 0.1566622 0.1661121 0.1656796
## MWS 0.1664314 0.1692010 0.1425770 0.1619226 0.1583802 0.2090354 0.1749273
##           saw      mani      well    appear      hand      hous      came
## HPL 0.1653155 0.1596949 0.1624583 0.2021352 0.1706379 0.1628086 0.1717530
## MWS 0.1900313 0.1739111 0.1753911 0.1538881 0.1595616 0.1919001 0.1885887
##          much       see      year      fear       may     natur       two
## HPL 0.1543541 0.1736933 0.1737153 0.1949145 0.2022461 0.1966978 0.1683057
## MWS 0.1929701 0.1936198 0.1785572 0.1873831 0.1623035 0.1702931 0.1998374
##          word      come       can     death     heart      mind      feel
## HPL 0.2553243 0.1834933 0.1957871 0.1996895 0.3524163 0.1947685 0.2641336
## MWS 0.1695109 0.1958721 0.1627352 0.1791986 0.1671444 0.1809335 0.1754081
##          know      ever     light      thus      near     whose      make
## HPL 0.1655467 0.1985341 0.1842129 0.3309437 0.1944548 0.1867658 0.1916569
## MWS 0.1887437 0.1866209 0.1982558 0.1695663 0.1902422 0.2042636 0.2043752
##        friend   without       far      open      form     shall     heard
## HPL 0.2253639 0.1808312 0.1796225 0.1715548 0.1979849 0.2236583 0.1821378
## MWS 0.1806318 0.1919262 0.1903245 0.2129033 0.1957393 0.1673812 0.2039585
##           men     howev     earth      last      left      part      hour
## HPL 0.2187685 0.2078769 0.2089845 0.2007967 0.1798080 0.1983000 0.1997898
## MWS 0.2397644 0.2164029 0.2022268 0.2020679 0.1962952 0.1821261 0.1938762
##         world      head      room      pass      felt      door      live
## HPL 0.2021445 0.1814914 0.1813809 0.2135439 0.2067695 0.1896509 0.2159018
## MWS 0.2050159 0.2078845 0.2260336 0.1827435 0.1982195 0.2351617 0.2033611
##         dream    strang      dark      call       way      voic    though
## HPL 0.2076134 0.2218328 0.1999766 0.1954435 0.1924081 0.2173221 0.4020666
## MWS 0.2417436 0.2432587 0.2298038 0.2067110 0.2416431 0.2117281 0.4287814
##        moment      inde     human      turn    return      hope      back
## HPL 0.2174737 0.1994910 0.2231213 0.2132197 0.2374778 0.3222585 0.1938875
## MWS 0.1935804 0.2204381 0.2014919 0.2222703 0.2009601 0.2247801 0.2599978
##        toward       let    within      noth     whole      good     point
## HPL 0.2590906 0.2490110 0.2168900 0.2007712 0.2340186 0.2443912 0.2234647
## MWS 0.2497348 0.1727581 0.2360819 0.2185736 0.2241095 0.2030713 0.2504574
##         becam      away      take      mean   certain       die    father
## HPL 0.2175011 0.2223121 0.2185384 0.2587004 0.2133700 0.2450334 0.3489673
## MWS 0.2072798 0.2364243 0.2069655 0.2145502 0.3730971 0.2063040 0.3015671
##         sound      find      bodi      seen   present      face     among
## HPL 0.1995711 0.2251692 0.1924481 0.1993202 0.2688553 0.2269417 0.2409137
## MWS 0.2365239 0.2280221 0.2993335 0.2604141 0.1989643 0.2617907 0.2064902
##        length    person      knew    remain  raymond      soon     close
## HPL 0.2460720 0.2377020 0.2072597 0.2396220 15.19228 0.2422976 0.2105110
## MWS 0.2224213 0.2380263 0.2689763 0.2125934 15.17481 0.2233256 0.2405814
##         three       sea     water      citi    spirit       new     chang
## HPL 0.1998812 0.2323773 0.2290755 0.2243734 0.3917374 0.2236477 0.2792093
## MWS 0.3115542 0.2359226 0.2453829 0.2637741 0.2165968 0.2508697 0.2581277
##          half    beauti    street   beyond  although       end     think
## HPL 0.2017411 0.2606622 0.2300385 0.210065 0.3371831 0.2441597 0.2144933
## MWS 0.2254320 0.2184780 0.3363138 0.312999 0.2424540 0.2275864 0.2351065
##         power      soul      alon    observ       air      idea    object
## HPL 0.2944896 0.2569111 0.2378772 0.3157754 0.2664946 0.3086191 0.2269136
## MWS 0.2172934 0.2144994 0.2290060 0.2375295 0.2220928 0.2167734 0.2258748
##          just     anoth      less      wall      kind   continu      sinc
## HPL 0.2076999 0.2606883 0.2221822 0.2100931 0.2389226 0.2662268 0.2237478
## MWS 0.2963536 0.2554840 0.2343383 0.2976529 0.2343380 0.2133851 0.2728339
##          name      high     happi      gave    window     enter    follow
## HPL 0.2530096 0.2365445 0.4040968 0.2526482 0.2269949 0.2823359 0.2503172
## MWS 0.2297608 0.2310594 0.2450192 0.2316719 0.3421747 0.2185926 0.2536151
##        horror   express    imagin    almost      mere     small      wish
## HPL 0.2238457 0.2997476 0.2521915 0.2642755 0.2538001 0.2350065 0.2667779
## MWS 0.2509089 0.2450439 0.2449013 0.2860476 0.2876228 0.2736684 0.2468518
##         black    matter     young      took    around      full      quit
## HPL 0.2168660 0.2694658 0.2395420 0.2526196 0.2309350 0.2394983 0.2772579
## MWS 0.3542973 0.3438529 0.2604514 0.2415861 0.2622884 0.2742344 0.2394517
##        wonder      side      dead       lay    someth     cours     exist
## HPL 0.2558813 0.2274641 0.2504499 0.2484624 0.2247089 0.2628364 0.2720783
## MWS 0.2631927 0.2673379 0.2528336 0.2523118 0.2902002 0.2740917 0.2379652
##          tell      work      went      told    believ    sudden      give
## HPL 0.2369266 0.2413111 0.2372299 0.2674060 0.2489387 0.2422841 0.2815722
## MWS 0.2826676 0.2430061 0.2324169 0.3323418 0.2413835 0.2613046 0.2413741
##         becom     scene      leav    reason       arm       god      caus
## HPL 0.2907462 0.3037672 0.2824968 0.2632055 0.2631889 0.2472735 0.2780095
## MWS 0.2508124 0.2587994 0.2381121 0.2663296 0.2394994 0.2509788 0.2521331
##          also    manner    perhap     speak   general      wind   possibl
## HPL 0.2917458 0.4069391 0.2334034 0.2643088 0.2518508 0.2820329 0.2365931
## MWS 0.2321593 0.2466122 0.2566501 0.2514581 0.3180925 0.2621862 0.3280712
##          tree     right     state     reach      sens    direct      wild
## HPL 0.2419262 0.2452483 0.2900765 0.2458656 0.2399767 0.2626257 0.2674641
## MWS 0.2328793 0.2858555 0.2414047 0.2819423 0.2855388 0.3009016 0.2693141
##           put      deep      care  question       set   possess     utter
## HPL 0.2625795 0.2728066 0.2733781 0.2601362 0.2573335 0.2884919 0.2755218
## MWS 0.2691238 0.2685264 0.2632439 0.2520933 0.2906732 0.2405185 0.2671267
##         sight      feet       use     sever      evid     doubt    immedi
## HPL 0.2621083 0.2635096 0.2494567 0.2844662 0.2709303 0.2891241 0.3490540
## MWS 0.2971248 0.3359998 0.2909943 0.2500513 0.3220162 0.3147344 0.2865018
##           age     began      town     known      fact     least     step
## HPL 0.2436768 0.2918019 0.3197795 0.2487740 0.2823448 0.2584542 0.290921
## MWS 0.2778622 0.3258793 0.3159653 0.3518601 0.4258727 0.2964588 0.271407
##      interest     often       mad      moon      lost     repli      home
## HPL 0.2922188 0.3218335 0.2832078 0.2489794 0.3256443 0.4866690 0.2698739
## MWS 0.2858130 0.2979872 0.3324676 0.3109963 0.2671907 0.2507681 0.2835546
##      thousand      morn     fanci    rather  mountain       sun      read
## HPL 0.3902115 0.2805144 0.2835175 0.2660703 0.3024512 0.4027862 0.2952346
## MWS 0.2442105 0.2555264 0.3063516 0.2775808 0.3084722 0.2664746 0.2874299
##         visit    second      hill     stone    heaven    rememb     watch
## HPL 0.3250629 0.2454598 0.3080484 0.2988458 0.3352422 0.2626317 0.3006567
## MWS 0.2866541 0.3519044 0.3695914 0.5039573 0.2550328 0.2837529 0.2989842
##         peopl   sometim   charact   brought      view      book      larg
## HPL 0.2653211 0.3667626 0.3164126 0.2677256 0.2765012 0.2739805 0.2668594
## MWS 0.3210877 0.3571694 0.3154434 0.2752773 0.3144105 0.3191491 0.3330494
##        affect  perdita     alway     taken    secret   countri     sleep
## HPL 0.4478972 49.64799 0.2680880 0.2944089 0.2702187 0.3401416 0.2864222
## MWS 0.3066841 15.27691 0.3260181 0.2727998 0.3078306 0.2913971 0.2843143
##           get   perceiv      land    suffer   ancient      case     entir
## HPL 0.2530289 0.4375199 0.3265479 0.3765420 0.3217041 0.2701866 0.3807066
## MWS 0.5411373 0.2845191 0.3422383 0.2751984 0.4198403 0.4288339 0.2819162
##        regard     stood      sure     spoke  circumst     gentl   reflect
## HPL 0.3439060 0.2862194 0.2513222 0.3213001 0.3746145 0.3979570 0.3564313
## MWS 0.3017984 0.2904766 0.3016954 0.2842698 0.2894518 0.2920714 0.2853477
##        discov   proceed      fell    attent     arriv     minut      forc
## HPL 0.3463049 0.4039452 0.3098120 0.3978628 0.3642553 0.4132961 0.2843212
## MWS 0.2545922 0.2572715 0.2850902 0.2972019 0.2456290 0.3796597 0.3192826
##          tear    shadow    period      move      rest     appar     degre
## HPL 0.5941206 0.2798835 0.3445128 0.2651095 0.3209778 0.2735570 0.3643210
## MWS 0.3286354 0.3357031 0.3005516 0.2863444 0.3044127 0.3041284 0.2864957
##          pain     white    mother     floor    expect     excit     relat
## HPL 0.4327013 0.2790448 0.3936947 0.2584485 0.3121575 0.3320807 0.3268129
## MWS 0.2573709 0.3467556 0.3506591 0.4792446 0.2904756 0.2861743 0.2949068
##          true   terribl   account      talk    longer    letter    famili
## HPL 0.3386763 0.2896511 0.3231945 0.3075655 0.3578201 0.3843125 0.3086888
## MWS 0.2770083 0.3857649 0.3027158 0.3155546 0.3029336 0.3036273 0.3041059
##       passion  peculiar    effect    ground     other    suppos    better
## HPL 1.0696483 0.2792654 0.3002151 0.3195528 0.3820827 0.4562231 0.3013698
## MWS 0.2943032 0.3807536 0.2919244 0.3204750 0.3637185 0.3841793 0.2972543
##          west     month     desir     truth    receiv      grew    listen
## HPL 0.3777245 0.3535168 0.3873883 0.3672437 0.3652824 0.2770041 0.3396231
## MWS 0.5316733 0.2790656 0.2919653 0.2831360 0.3077479 0.3550346 0.3153912
##           tri  approach   despair      done      walk      vast      evil
## HPL 0.5304824 0.3247982 0.6011709 0.2940288 0.3645054 0.2977065 0.3301212
## MWS 0.5713321 0.2860258 0.3020996 0.3187712 0.3684199 0.3675172 0.2937783
##        memori      late      fill      line     posit   beneath   subject
## HPL 0.3079064 0.3064624 0.4406264 0.2828886 0.3452203 0.3158088 0.3777960
## MWS 0.3731046 0.3139449 0.3555449 0.3800402 0.6275683 0.3351625 0.3137865
##         escap   adrian   alreadi       ask     clear     usual   hideous
## HPL 0.2856375 20.13541 0.4259040 0.3201343 0.2779753 0.3220448 0.2877327
## MWS 0.2988481  8.34721 0.2788231 0.3099150 0.3577910 0.3011289 0.5605165
##          fire      poor    attend   counten    breath   impress      hear
## HPL 0.3231077 0.3517843 0.3729445 0.6395099 0.3418551 0.2777533 0.3175466
## MWS 0.3083570 0.3112843 0.2750567 0.2992271 0.3385736 0.3521397 0.3214880
##       suffici      dear      past    purpos       low      dare       cri
## HPL 0.5516409 0.7734607 0.3501606 0.3917928 0.2972859 0.3697516 0.3325559
## MWS 0.3240643 0.3165662 0.3673496 0.2800435 0.4160026 0.3591644 0.3129643
##          till    fellow      anim   chamber     event     short     figur
## HPL 0.5572014 0.3311995 0.3266877 0.3271600 0.3430434 0.3338472 0.3015080
## MWS 0.5597706 0.3171120 0.3006762 0.3656637 0.2952402 0.3147527 0.4978633
##       creatur     final      star      wood     dread      hard      busi
## HPL 0.4125686 0.2850478 0.3224423 0.3792680 0.3377706 0.3739193 0.3150507
## MWS 0.3639274 0.4974412 0.3257494 0.3550286 0.3000471 0.3289955 0.2885643
##         space      cold    either       ill    `next`     youth      none
## HPL 0.3574864 0.4161970 0.3406185 0.3610789 0.3149796 0.4143747 0.3253118
## MWS 0.4650457 0.3933513 0.4016952 0.3206572 0.3661955 0.4202231 0.3278768
##         studi     given   delight      five     child   attempt   unknown
## HPL 0.3463570 0.3054849 0.5065025 0.3425079 0.3674458 0.3805589 0.3069884
## MWS 0.3380743 0.2999451 0.3148131 0.4001457 0.3104979 0.3163136 0.4031078
##        murder    togeth     order    terror companion   instant      spot
## HPL 0.4411708 0.3812219 0.2904356 0.2959606 0.3781021 0.3785187 0.3483151
## MWS 0.2649089 0.3112236 0.3180901 0.4060880 0.3120298 0.3353669 0.3337312
##         smile     river       sky    motion    origin     paper      best
## HPL 0.4014302 0.3397778 0.3922403 0.3054318 0.3308058 0.2961408 0.3518996
## MWS 0.2940460 0.3573274 0.4090401 0.3715193 0.4466205 0.3626197 0.2949877
##          fall      want       led
## HPL 0.3368749 0.3316937 0.3608234
## MWS 0.3402750 0.3330111 0.3483958
## Residual Deviance: 21462.56 
## AIC: 23066.56
preds_lc <- predict(model_lc, type = "class", newdata = test)
preds_lcp <- predict(model_lc, type = "probs", newdata = test)
sum(preds_lc == tiny$Author[14580:19579])/5000
## [1] 0.653
table(preds_lc, tiny$Author[14580:19579])
## preds_lc  EAP  HPL  MWS
##      EAP 1518  437  447
##      HPL  245  835  186
##      MWS  262  158  912
MultiLogLoss(preds_lcp, tiny$Author[14580:19579])
## [1] 0.7857041

Okay, so we ran this through a logistic classifier and now we got our accuracy up to 65.28%. But our multiclass log loss increased. This seems to be because the NaiveBayesClassifier is extremely confident about its predictions. Not bad for our second shot. Admittedly, it takes more and more effort to move up from here, but this looks like a reasonable track.

My suggestions: use more of the words. Use all of the training data. Maybe clean the data. If you are going to use a logistic classifier or a neural net, normalizing the data is a good idea. Add an unknown word marker to deal with new words.

Of course, you still have to transform the testing data into the same format in order to submit.

If you win, throw a party for the class :)

You still have to figure out how to transform the testing data into acceptable input for this classifier.