Problem Session 3

Exercises

MLE for uniform distribution

\(p(x)=\chi_{[-a,a]}/2a\)

Now we are given dataset \(x_1,\dots, x_n\). \[L(a)=p(x_1,\dots,x_n|a)=\prod_{i=1}^n p(x_i|a)=\prod_{i=1}^n\chi_{[-a,a]}(x_i)/2a.\] If one of the \(|x_i|>a\) then this product is zero, so this term will only be non-zero if \(a\geq \max(|x_i|)\) In this case \(L(a)=1/(2a)^n\). This is a decreasing function of \(a\) and hence is maximized at the minimum allowable value \(a=\max(|x_i|)=M\).
What is the probability this model would assign to a new data point \(x\)? \(p(x|M)=1/(2M)\) if \(|x|\leq M\) otherwise it is 0.
This is not especially reasonable if there is any reason to expect that a datapoint could be larger in magnitude than those that preceded. One way to handle this is to be Bayesian and choose the uninformative improper prior \(p(a)=1.\) This is improper which means it does not integrate to 1 (the integral is infinite), but in the presence of at least \(2\) data points this yields a posterior distribution which is proper. If we use this new posterior distribution \(p'(a)\) and calculuate \(p(x)=\int p(x|a)p'(a)da\) then we obtain non-zero values.

Point estimates for Beta distriubtion.

1. The expected value is \[\int_0^1 \Gamma(\alpha+\beta)/(\Gamma(\alpha)\Gamma(\beta)) \theta^\alpha (1-\theta)^{\beta-1}d\theta.\] Now \(\theta^{\alpha}(1-\theta)^{\beta-1}\) is proportional to the pdf for Beta\((\alpha+1,\beta)\). So the integral of this term has to be the multiplicative inverse of the missing constant so that the Beta distribution integrates to 1. The missing constant is \(\Gamma(\alpha+\beta+1)/(\Gamma(\alpha+1)\Gamma(\beta)).\) Forming the quotient and using the equation \(\Gamma(x+1)=x\Gamma(x)\) one gets that the expected value is \(\alpha/(\alpha+\beta)\).

For the mode, one sets the derivative to 0 andwe get that the extreme point is at \(\theta=(\alpha-1)/(\alpha+\beta-2)\) (if \(\alpha+\beta\neq 2\)). If \(\alpha\) or \(\beta\) ie less than 1, than this distribution has no maximum on \((0,1)\) because it is unbounded. Otherwise, this is indeed the maximum, by a tedious second derivative calculation.

Change of variables

We get \(p_{Y\circ X}(t)=P(Y\circ X=t)=\sum_{z\in Y^{-1}(t)} p_x(z)\).
These are independent terms so \[p_{X^n}(v_1,\dots,v_n)=\prod p_X(v_i)=\prod \theta^{v_i}(1-\theta)^{1-v_i}\] and this is \[\theta^{\sum v_i}(1-\theta)^{n-\sum v_i}.\] The next part is a simple application of the previous step and we obtain \[p_{\sigma\circ X^n}(k)=n!/(k!(n-k)!)\theta^{k}(1-\theta)^{n-k}.\]
The composite of a measurable function \(X\) with a continuous function \(Y\) to \(\Bbb R\) is measurable because it is neccessary and sufficient to check that the preimage of an open subset of \(\Bbb R\) is measurable, but because \(Y\) is continuous the preimage of an open subset of \(\Bbb R\) is an open subset of \(\Bbb R\). Remainder omitted to do to live texing.

R Lab

Let’s load our datasets:

library(tidyverse)
tr_tbl = read_csv("train.csv")
test_tbl = read_csv("test.csv")
all_text = c(tr_tbl$text, test_tbl$text)

Let’s make a dfm:

library(quanteda)
corp = corpus(all_text)
all_dfm = dfm(corp, remove = stopwords("english"), stem = T, remove_punct = T, remove_numbers = T)
all_dfm_trm = dfm_trim(all_dfm, min_count = 5, max_count = 1000, verbose = T)

Transform the counts into binary counts.

all_dfm_trm_b = tf(all_dfm_trm, scheme="boolean")

Build our train and validation data frames.

train_df = as.tibble(as.data.frame(all_dfm_trm_b[1:length(tr_tbl$author)]))
train_df$author = tr_tbl$author
library(caret)
train_ix = createDataPartition(train_df$author, p = 0.8, list = F, times = 1)
train = slice(train_df,train_ix)
valid = slice(train_df, -train_ix)

Construct our model. For numerical stability reasons I will use the logarithm of the smoothed possibilities for the weights.

I apologize for the terrible code here. For whatever reason, using the “length” command instead of the “n” command in normalizing the data is incredibly slow once the number of features gets large (the code would not complete on my laptop. So the simple code had to be replaced with a much more convoluted block which does the same thing, but somehow is fast enough to execute.

# This more readable code runs 35 times slower than the mess that follows it.
#smoother = function(x) { return( log( (sum(x)+1)/(length(x)+2)) )}
#system.time({ber_model = train %>% group_by(author) %>% summarize_all(.funs = smoother)})

# I need to recover the size of the groups
# Turn the means into sums
# Add one to all entries for smoothing
# Divide by the size of each group + 2 (for smoothing)
system.time({ber_model <- train %>% group_by(author) %>% mutate(sz = n()) %>% summarize_all(.funs = mean)
ber_mode_mat = diag(ber_model$sz) %*% as.matrix(select(ber_model,-author,-sz))
ber_mode_mat2 = ber_mode_mat+matrix(data = rep.int(1,prod(dim(ber_mode_mat))),nrow = nrow(ber_mode_mat))
ber_mode_mat3 =  diag(map(ber_model$sz, ~1/(.+2))) %*% ber_mode_mat2 })

##    user  system elapsed 
##   9.215   0.301   9.634

Now let us run the model:

#log_probs = as.matrix(select(ber_model,-author)) %*% t(as.matrix(select(valid,-author)))
#get 
log_probs = log(ber_mode_mat3) %*% t(as.matrix(select(valid,-author)))
log_probs = apply(log_probs, c(1,2), function(n) max(n,-500))
log_part = apply(log_probs,2,function(x){log(sum(exp(x)))})
log_part = matrix(rep(log_part,3),nrow = 3, byrow = TRUE)
log_probs = log_probs - log_part
preds = apply(log_probs, 2, function(x){return(ber_model$author[which.max(x)])})

Calculate the accuracy of our model:

mean(preds == valid$author)

## [1] 0.7897829

Not too bad.

Let’s calculate the average negative log loss; for this we will use the so-called one-hot encoding matrix:

val_labels = slice(train_df,-train_ix) %>%
  select(author)
val_matrix = t(model.matrix(~author -1, data = val_labels))
-sum(val_matrix*log_probs)/length(val_labels$author)

## [1] 0.6016365

That is just barely good enough to hit the 68th percentile in this Kaggle competition as of the time of this writing. Not great, but still it is nice to see this all work explicitly.