Wine quality example #104

fortisil · 2018-01-18T13:18:25Z

I am trying to SGD for classification of a huge dataset and wanted first to try it on the example you provided about the wine quality. But with the wine quality, something doesn't work for me. When I am trying to predict using the sgd.theta all the results are false and don't predict the quality at all.

Here is the code:

Wine quality (Cortez et al., 2009): Logistic regression

set.seed(42)
#data("winequality")
#dat <- winequality
dat <- read.csv("../Data/winequality-red.csv", header = TRUE, sep = ";", stringsAsFactors = FALSE)
dat$quality <- as.numeric(dat$quality > 5) # transform to binary
test.set <- sample(1:nrow(dat), size=nrow(dat)/8, replace=FALSE)
dat.test <- dat[test.set, ]
dat <- dat[-test.set, ]
sgd.theta <- sgd(quality ~ ., data=dat,
model="glm", model.control=binomial(link="logit"),
sgd.control=list(reltol=1e-5, npasses=200),
lr.control=c(scale=1, gamma=1, alpha=30, c=1))
sgd.theta
sprintf("Mean squared error: %0.3f", mean((theta - as.numeric(sgd.theta$coefficients))^2))

sum(dat[190,]*sgd.theta$coefficients)

Output>

sgd.theta
[,1]
[1,] -0.21988224003
[2,] -0.14012303981
[3,] -0.70007323510
[4,] 0.24427708544
[5,] -0.27235118177
[6,] -0.06525874448
[7,] 0.07980364712
[8,] -0.04987978826
[9,] -0.21990837019
[10,] -0.74766256961
[11,] 0.24841101887
[12,] 0.55420888771
sprintf("Mean squared error: %0.3f", mean((theta - as.numeric(sgd.theta$coefficients))^2))
[1] "Mean squared error: 26.214"

sum(dat[190,]*sgd.theta$coefficients)
[1] 4.761156998

Maybe am I missing something?

ptoulis · 2018-01-19T03:20:56Z

SGD seems to be comparable to full GLM here:


theta.glm = glm(quality ~ ., data=dat, family = binomial)  # runs full logistic regression.

Xtest = as.matrix(dat.test[, -ncol(dat.test)])
Xtest = cbind(rep(1, nrow(Xtest)), Xtest)  # test features, add bias
Ytest = dat.test[, ncol(dat.test)]  # test outcome

# predict outcomes from SGD and GLM
expit = function(x) exp(x) / (1 + exp(x))
y_sgd = as.numeric(predict(sgd.theta, Xtest) > 0.5) 
y_glm = as.numeric(expit(predict(theta.glm, newdata =dat.test[, -ncol(dat.test)])) > 0.5)

# confusion tables
M_glm = table(y_glm, Ytest) 
M_sgd = table(y_sgd, Ytest)
print(M_glm)
print(M_sgd)

# Print accuracy
print(sprintf("GLM accuracy = %.2f -- SGD accuracy = %.2f", 
              sum(diag(M_glm)) / sum(M_glm), sum(diag(M_sgd)) / sum(M_sgd)))

I get

GLM accuracy = 0.71 -- SGD accuracy = 0.66

fortisil · 2018-01-21T08:30:08Z

What would be the criterias to use SGD for classification on others classification methods?

ptoulis · 2018-01-22T13:55:07Z

Note that SGD is not a classification method per se. It is a method to fit models, including classification models. However, it is a noisy method and in many cases it will perform worse than a deterministic method, such as maximum likelihood (the glm() output is the maximum likelihood estimate in the example above).
That said, SGD may the only choice when you have a large model or dataset to fit.

So, in the example above you wouldn't want to use SGD because it is small enough to use glm().
However, in your application you might have, say, 200 features to train on, and then you might want to use SGD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wine quality example #104

Wine quality example #104

fortisil commented Jan 18, 2018

ptoulis commented Jan 19, 2018

fortisil commented Jan 21, 2018

ptoulis commented Jan 22, 2018

Wine quality example #104

Wine quality example #104

Comments

fortisil commented Jan 18, 2018

Wine quality (Cortez et al., 2009): Logistic regression

ptoulis commented Jan 19, 2018

fortisil commented Jan 21, 2018

ptoulis commented Jan 22, 2018