Data Science and Big Data Analytics
Chapter 6: Advanced Analytical Theory and Methods: Regression
1
Chapter Sections
6.1 Linear Regression
6.2 Logical Regression
6.3 Reasons to Choose and Cautions
6.4 Additional Regression Models
Summary
2
6 Regression
Regression analysis attempts to explain the influence that input (independent) variables have on the outcome (dependent) variable
Questions regression might answer
What is a person’s expected income?
What is probability an applicant will default on a loan?
Regression can find the input variables having the greatest statistical influence on the outcome
Then, can try to produce better values of input variables
E.g. – if 10-year-old reading level predicts students’ later success, then try to improve early age reading levels
3
6.1 Linear Regression
Models the relationship between several input variables and a continuous outcome variable
Assumption is that the relationship is linear
Various transformations can be used to achieve a linear relationship
Linear regression models are probabilistic
Involves randomness and uncertainty
Not deterministic like Ohm’s Law (V=IR)
4
6.1.1 Use Cases
Real estate example
Predict residential home prices
Possible inputs – living area, #bathrooms, #bedrooms, lot size, property taxes
Demand forecasting example
Restaurant predicts quantity of food needed
Possible inputs – weather, day of week, etc.
Medical example
Analyze effect of proposed radiation treatment
Possible inputs – radiation treatment duration, freq
5
6.1.2 Model Description
6
6.1.2 Model Description Example
Predict person’s annual income as a function of age and education
Ordinary Least Squares (OLS) is a common technique to estimate the parameters
7
6.1.2 Model Description Example
OLS
8
6.1.2 Model Description Example
9
6.1.2 Model Description With Normally Distributed Errors
Making additional assumptions on the error term provides further capabilities
It is common to assume the error term is a normally distributed random variable
Mean zero and constant variance
That is
10
With this assumption, the expected value is
And the variance is
6.1.2 Model Description With Normally Distributed Errors
11
Normality assumption with one input variable
E.g., for x=8, E(y)~20 but varies 15-25
6.1.2 Model Description With Normally Distributed Errors
12
6.1.2 Model Description Example in R
Be sure to get publisher's R downloads: http://www.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html
> income_input = as.data.frame(read.csv(“c:/data/income.csv”))
> income_input[1:10,]
> summary(income_input)
> library(lattice)
> splom(~income_input[c(2:5)], groups=NULL, data=income_input,
axis.line.tck=0, axis.text.alpha=0)
13
Scatterplot
Examine bottom line
income~age: strong + trend
income~educ: slight + trend
income~gender: no trend
6.1.2 Model Description Example in R
14
Quantify the linear relationship trends
> results <- lm(Income~Age+Education+Gender,income_input)
> summary(results)
Intercept: income of $7263 for newborn female
Age coef: ~1, year age increase -> $1k income incr
Educ coef: ~1.76, year educ + -> $1.76k income +
Gender coef: ~-0.93, male income decreases $930
Residuals – assumed to be normally distributed – vary from -37 to +37 (more information coming)
6.1.2 Model Description Example in R
15
Examine residuals – uncertainty or sampling error
Small p-values indicate statistically significant results
Age and Education highly significant, p<2e-16
Gender p=0.13 large, not significant at 90% confid. level
Therefore, drop variable gender from linear model
> results2 <- lm(Income~Age+Education,income_input)
> summary(results) # results about same as before
Residual standard error: residual standard deviation
R-squared (R2): variation of data explained by model
Here ~64% (R2 = 1 means model explains data perfectly)
F-statistic: tests entire model – here p value is small
6.1.2 Model Description Example in R
16
6.1.2 Model Description Categorical Variables
In the example in R, Gender is a binary variable
Variables like Gender are categorical variables in contrast to numeric variables where numeric differences are meaningful
The book section discusses how income by state could be implemented
17
6.1.2 Model Description Confidence Intervals on the Parameters
Once an acceptable linear model is developed, it is often useful to draw some inferences
R provides confidence intervals using confint() function
> confint(results2, level = .95)
For example, Education coefficient was 1.76, and now the corresponding 95% confidence interval is (1.53. 1.99)
18
6.1.2 Model Description Confidence Interval on Expected Outcome
In the income example, the regression line provides the expected income for a given Age and Education
Using the predict() function in R, a confidence interval on the expected outcome can be obtained
> Age <- 41
> Education <- 12
> new_pt <- data.frame(Age, Education)
> conf_int_pt <- predict(results2,new_pt,level=.95,
interval=“confidence”)
> conf_int_pt
Expected income = $68699, conf interval ($67831,$69567)
19
6.1.2 Model Description Prediction Interval on a Particular Outcome
The predict() function in R also provides upper/lower bounds on a particular outcome, prediction intervals
> pred_int_pt <- predict(results2,new_pt,level=.95,
interval=“prediction”)
> pred_int_pt
Expected income = $68699, pred interval ($44988,$92409)
This is a much wider interval because the confidence interval applies to the expected outcome that falls on the regression line, but the prediction interval applies to an outcome that may appear anywhere within the normal distribution
20
6.1.3 Diagnostics Evaluating the Linearity Assumption
A major assumption in linear regression modeling is that the relationship between the input and output variables is linear
The most fundamental way to evaluate this is to plot the outcome variable against each income variable
In the following figure a linear model would not apply
In such cases, a transformation might allow a linear model to apply
Class of dataset Groceries is transactions, containing 3 slots
transactionInfo # data frame with vectors having length of transactions
itemInfo # data frame storing item labels
data # binary evidence matrix of labels in transactions
> Groceries@itemInfo[1:10,]
> apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))
21
6.1.3 Diagnostics Evaluating the Linearity Assumption
Income as a quadratic function of Age
22
6.1.3 Diagnostics Evaluating the Residuals
The error terms was assumed to be normally distributed with zero mean and constant variance
> with(results2,{plot(fitted.values,residuals,ylim=c(-40,40)) })
23
6.1.3 Diagnostics Evaluating the Residuals
Next four figs don’t fit zero mean, const variance assumption
Nonlnear trend in residuals
Residuals not centered on zero
24
6.1.3 Diagnostics Evaluating the Residuals
Variance not
constant
Residuals not centered on zero
25
6.1.3 Diagnostics Evaluating the Normality Assumption
The normality assumption still has to be validate
> hist(results2$residuals)
Residuals centered on zero and appear normally distributed
26
6.1.3 Diagnostics Evaluating the Normality Assumption
Another option is to examine a Q-Q plot comparing observed data against quantiles (Q) of assumed dist
> qqnorm(results2$residuals)
> qqline(results2$residuals)
27
6.1.3 Diagnostics Evaluating the Normality Assumption
Normally distributed residuals
Non-normally distributed residuals
28
6.1.3 Diagnostics N-Fold Cross-Validation
To prevent overfitting, a common practice splits the dataset into training and test sets, develops the model on the training set and evaluates it on the test set
If the quantity of the dataset is insufficient for this, an N-fold cross-validation technique can be used
Dataset randomly split into N dataset of equal size
Model trained on N-1 of the sets, tested on remaining one
Process repeated N times
Average the N model errors over the N folds
Note: if N = size of dataset, this is leave-one-out procedure
29
6.1.3 Diagnostics Other Diagnostic Considerations
The model might be improved by including additional input variables
However, the adjusted R2 applies a penalty as the number of parameters increases
Residual plots should be examined for outliers
Points markedly different from the majority of points
They result from bad data, data processing errors, or actual rare occurrences
Finally, the magnitude and signs of the estimated parameters should be examined to see if they make sense
30
6.2 Logistic Regression Introduction
In linear regression modeling, the outcome variable is continuous – e.g., income ~ age and education
In logistic regression, the outcome variable is categorical, and this chapter focuses on two-valued outcomes like true/false, pass/fail, or yes/no
31
6.2.1 Logistic Regression Use Cases
Medical
Probability of a patient’s successful response to a specific medical treatment – input could include age, weight, etc.
Finance
Probability an applicant defaults on a loan
Marketing
Probability a wireless customer switches carriers (churns)
Engineering
Probability a mechanical part malfunctions or fails
32
6.2.2 Logistic Regression Model Description
Logical regression is based on the logistic function
As y -> infinity, f(y)->1; and as y->-infinity, f(y)->0
33
6.2.2 Logistic Regression Model Description
With the range of f(y) as (0,1), the logistic function models the probability of an outcome occurring
In contrast to linear regression, the values of y are not directly observed; only the values of f(y) in terms of success or failure are observed.
Called log odds ratio, or logit of p.
Maximum Likelihood Estimation (MLE) is used to estimate model parameters. MLR is beyond the scope of this book.
34
6.2.2 Logistic Regression Model Description: customer churn example
A wireless telecom company estimates probability of a customer churning (switching companies)
Variables collected for each customer: age (years), married (y/n), duration as customer (years), churned contacts (count), churned (true/false)
After analyzing the data and fitting a logical regression model, age and churned contacts were selected as the best predictor variables
35
6.2.2 Logistic Regression Model Description: customer churn example
36
6.2.3 Diagnostics Model Description: customer churn example
> head(churn_input) # Churned = 1 if cust churned
> sum(churn_input$Churned) # 1743/8000 churned
Use the Generalized Linear Model function glm()
> Churn_logistic1<-glm(Churned~Age+Married+Cust_years+Churned_contacts,data=churn_input,family=binomial(link=“logit”))
> summary(Churn_logistic1) # Age + Churned_contacts best
> Churn_logistic3<-glm(Churned~Age+Churned_contacts,data=churn_input,family=binomial(link=“logit”))
> summary(Churn_logistic3) # Age + Churned_contacts
37
6.2.3 Diagnostics Deviance and the Pseudo-R2
In logistic regression, deviance = -2logL
where L is the maximized value of the likelihood function used to obtain the parameter estimates
Two deviance values are provided
Null deviance = deviance based on only the y-intercept term
Residual deviance = deviance based on all parameters
Pseudo-R2 measures how well fitted model explains the data
Value near 1 indicates a good fit over the null model
38
6.2.3 Diagnostics Receiver Operating Characteristic (ROC) Curve
Logistic regression is often used to classify
In the Churn example, a customer can be classified as Churn if the model predicts high probability of churning
Although 0.5 is often used as the probability threshold, other values can be used based on desired error tradeoff
For two classes, C and nC, we have
True Positive: predict C, when actually C
True Negative: predict nC, when actually nC
False Positive: predict C, when actually nC
False Negative: predict nC, when actually C
39
6.2.3 Diagnostics Receiver Operating Characteristic (ROC) Curve
The Receiver Operating Characteristic (ROC) curve
Plots TPR against FPR
40
6.2.3 Diagnostics Receiver Operating Characteristic (ROC) Curve
> library(ROCR)
> Pred = predict(Churn_logistic3, type=“response”)
41
6.2.3 Diagnostics Receiver Operating Characteristic (ROC) Curve
42
6.2.3 Diagnostics Histogram of the Probabilities
It is interesting to visualize the counts of the customers who churned and who didn’t churn against the estimated churn probability.
43
6.3 Reasons to Choose and Cautions
Linear regression – outcome variable continuous
Logistic regression – outcome variable categorical
Both models assume a linear additive function of the inputs variables
If this is not true, the models perform poorly
In linear regression, the further assumption of normally distributed error terms is important for many statistical inferences
Although a set of input variables may be a good predictor of an output variable, “correlation does not imply causation”
44
6.4 Additional Regression Models
Multicollinearity is the condition when several input variables are highly correlated
This can lead to inappropriately large coefficients
To mitigate this problem
Ridge regression applies a penalty based on the size of the coefficients
Lasso regression applies a penalty proportional to the sum of the absolute values of the coefficients
Multinomial logistic regression – used for a more-than-two-state categorical outcome variable
45