Sid Rajaram
April-06-2017
\[ y = \beta_{0} + \beta_{1}x_{1} + \ldots + \beta_{m}x_{m} + \epsilon, \]
\[ \textrm{where }\epsilon \sim \mathcal{N}(0,\,\sigma^{2}). \]
n <- 200
m <- 3
intercept <- 5
true_coeffs <- c(-6, 3, -2)
mat <- matrix( runif( n * m, 0, 1 ), nrow=n, ncol=m ,
dimnames=list( NULL, c("X1", "X2", "X3") ))
x <- data.frame(mat)
x$cleantarget <- as.matrix(x) %*% true_coeffs + intercept
x$olstarget <- x$cleantarget + rnorm(n, mean=0, sd=0.1)
mdl <- lm( olstarget ~ X1 + X2 + X3, data = x)
summary(mdl)
Call:
lm(formula = olstarget ~ X1 + X2 + X3, data = x)
Residuals:
Min 1Q Median 3Q Max
-0.261629 -0.056514 0.003524 0.066996 0.232048
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.98747 0.02092 238.35 <2e-16 ***
X1 -5.99994 0.02401 -249.84 <2e-16 ***
X2 2.98252 0.02247 132.76 <2e-16 ***
X3 -1.94995 0.02364 -82.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09466 on 196 degrees of freedom
Multiple R-squared: 0.9977, Adjusted R-squared: 0.9977
F-statistic: 2.834e+04 on 3 and 196 DF, p-value: < 2.2e-16
plot(x$olstarget, fitted(mdl), main="Ordinary Least Squares",
xlab="Target variable", ylab="Predicted value", pch=19)
Let \( y \in \{0, 1\} \) be a random variable (the target) and \( X = x_1, \ldots, x_m \) be a set of independent variables (the predictors). Then
\[ y \sim \mathcal{Ber} (p(X)) \]
where
\[ p(X) = f(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_m x_m) \]
and \( f \) is the logistic function
\[ f(t) = \frac{1}{1 + e^{-t}} \]
x$logistic <- 1/(1+exp(-x$cleantarget))
x$logistictarget <- rbinom(n,1,x$logistic)
logisticmdl <- glm( x$logistictarget ~ X1 + X2 + X3, data=x, family="binomial")
confint.default(logisticmdl)
2.5 % 97.5 %
(Intercept) 2.761958 6.562481
X1 -9.009325 -4.117794
X2 1.072077 4.435680
X3 -2.043255 1.203091
plot( x$cleantarget, x$logistictarget, main="Logistic Fit",
xlab="w^T.x", ylab="Target and Predicted variable", pch=19 )
points( x$cleantarget, fitted(logisticmdl), col=2, pch=19 )
legend(5, 0.5, legend=c("Target variable", "Predicted value"),
col=c("black", "red"), lty=1, cex=0.8)
If the feature space (or number of degrees of freedom in your model) is very large relative to the number of samples, overfitting to the training data will occur
In regression models, we can correct for this by penalizing coefficients from getting too large
Other techniques are dimensionality reduction and clustering
Simply add a penalty term to the loss function
\[ \underset{w}{\operatorname{argmin}}{\cal L}(X) + \lambda \sum_{i=1}^M \beta_i^2 \]
Convex, smooth loss function
Leads to the max-margin property (maximally separating hyperplane): http://cs229.stanford.edu/notes/cs229-notes3.pdf
Closed form solution (for linear regression)
Pushes coefficients of useless features to be very close to (but not exactly equal to) zero
\[ \underset{w}{\operatorname{argmin}}{\cal L}(X) + \lambda \sum_{i=1}^M |\beta_i| \]
Convex loss function, but not smooth
Drives useless coefficients to be exactly zero
No closed form solution
Sparse solution is more interpretable
\[
\underset{w}{\operatorname{argmin}}{\cal L}(X) + (1 - \alpha) \cdot \lambda \sum_{i=1}^M \beta_i^2 + \alpha \cdot \lambda \sum_{i=1}^M |\beta_i|
\]
Simply add both previous forms of penalty
Gives you the performance and max-margin property of Ridge, with sparseness of Lasso
Has an extra hyperparameter to optimize - \( \alpha \), in addition to \( \lambda \)
Regularization exercise…