Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 153
theoretical regression line). Even though there are several methods/algorithms proposed to identify the regression line, the one that is most commonly used is called the ordinary least squares (OLS) method. The OLS method aims to minimize the sum of squared residuals (squared vertical distances between the observation and the regression point) and leads to a mathematical expression for the estimated value of the regression line (which are known as b parameters). For simple linear regression, the aforementioned relationship between the response variable 1y2 and the explanatory variable(s) 1x2 can be shown as a simple equation as follows:
y = b0 + b1x
In this equation, b0 is called the intercept, and b1 is called the slope. Once OLS deter- mines the values of these two coefficients, the simple equation can be used to forecast the values of y for given values of x. The sign and the value of b1 also reveal the direc- tion and the strengths of relationship between the two variables.
If the model is of a multiple linear regression type, then there would be more coef- ficients to be determined, one for each additional explanatory variable. As the following formula shows, the additional explanatory variable would be multiplied with the new bi coefficients and summed together to establish a linear additive representation of the response variable.
y = b0 + b1x1 + b2x2 + b3x3 + # + bnxn
How Do We Know If the Model Is Good Enough?
Because of a variety of reasons, sometimes models as representations of the reality do not prove to be good. Regardless of the number of explanatory variables included, there is always a possibility of not having a good model, and therefore the linear regression model needs to be assessed for its fit (the degree to which it represents the response variable). In the simplest sense, a well-fitting regression model results in predicted values close to the observed data values. For the numerical assessment, three statistical measures are often used in evaluating the fit of a regression model: R2(R - squared), the overall F-test, and the root mean square error (RMSE). All three of these measures are based on the sums of the square errors (how far the data are from the mean and how far the data are from the model’s predicted values). Different combinations of these two values pro- vide different information about how the regression model compares to the mean model.
Of the three, R2 has the most useful and understandable meaning because of its intuitive scale. The value of R2 ranges from 0 to 1 (corresponding to the amount of vari- ability explained in percentage) with 0 indicating that the relationship and the prediction power of the proposed model is not good, and 1 indicating that the proposed model is a perfect fit that produces exact predictions (which is almost never the case). The good R2 values would usually come close to one, and the closeness is a matter of the phe- nomenon being modeled—whereas an R2 value of 0.3 for a linear regression model in social sciences can be considered good enough, an R2 value of 0.7 in engineering might be considered as not a good enough fit. The improvement in the regression model can be achieved by adding more explanatory variables or using different data transforma- tion techniques, which would result in comparative increases in an R2 value. Figure 3.14 shows the process flow of developing regression models. As can be seen in the process flow, the model development task is followed by the model assessment task in which not only is the fit of the model assessed, but because of restrictive assumptions with which the linear models have to comply, the validity of the model also needs to be put under the microscope.
154 Part I • Introduction to Analytics and AI
What Are the Most Important Assumptions in Linear Regression?
Even though they are still the choice of many for data analyses (both for explanatory and for predictive modeling purposes), linear regression models suffer from several highly restrictive assumptions. The validity of the linear model built depends on its ability to comply with these assumptions. Here are the most commonly pronounced assumptions:
1. Linearity. This assumption states that the relationship between the response variable and the explanatory variables is linear. That is, the expected value of the response variable is a straight-line function of each explanatory variable while holding all other explanatory variables fixed. Also, the slope of the line does not depend on the values of the other variables. It also implies that the effects of dif- ferent explanatory variables on the expected value of the response variable are additive in nature.
2. Independence (of errors). This assumption states that the errors of the response variable are uncorrelated with each other. This independence of the errors is weaker
Tabulated Data
Data Assessment
Scatter plot
Correlations
Model Fitting
Transform data
Estimate parameters
Model Assessment
Test assumptions
Assess model fit
Deployment
One-time use
Recurrent use
FIGURE 3.14 A Process Flow for Developing Regression Models.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 155
than actual statistical independence, which is a stronger condition and is often not needed for linear regression analysis.
3. Normality (of errors). This assumption states that the errors of the response vari- able are normally distributed. That is, they are supposed to be totally random and should not represent any nonrandom patterns.
4. Constant variance (of errors). This assumption, also called homoscedasticity, states that the response variables have the same variance in their error regardless of the values of the explanatory variables. In practice, this assumption is invalid if the response variable varies over a wide enough range/scale.
5. Multicollinearity. This assumption states that the explanatory variables are not correlated (i.e., do not replicate the same but provide a different perspective of the information needed for the model). Multicollinearity can be triggered by having two or more perfectly correlated explanatory variables presented to the model (e.g., if the same explanatory variable is mistakenly included in the model twice, one with a slight transformation of the same variable). A correlation-based data assessment usually catches this error.
There are statistical techniques developed to identify the violation of these assump- tions and techniques to mitigate them. The most important part for a modeler is to be aware of their existence and to put in place the means to assess the models to make sure that they are compliant with the assumptions they are built on.
Logistic Regression
Logistic regression is a very popular, statistically sound, probability-based classifica- tion algorithm that employs supervised learning. It was developed in the 1940s as a complement to linear regression and linear discriminant analysis methods. It has been used extensively in numerous disciplines, including the medical and social sciences fields. Logistic regression is similar to linear regression in that it also aims to regress to a mathematical function that explains the relationship between the response vari- able and the explanatory variables using a sample of past observations (training data). Logistic regression differs from linear regression with one major point: its output (re- sponse variable) is a class as opposed to a numerical variable. That is, whereas linear regression is used to estimate a continuous numerical variable, logistic regression is used to classify a categorical variable. Even though the original form of logistic regres- sion was developed for a binary output variable (e.g., 1/0, yes/no, pass/fail, accept/ reject), the present-day modified version is capable of predicting multiclass output variables (i.e., multinomial logistic regression). If there is only one predictor variable and one predicted variable, the method is called simple logistic regression (similar to calling linear regression models with only one independent variable simple linear regression).
In predictive analytics, logistic regression models are used to develop probabilis- tic models between one or more explanatory/predictor variables (which can be a mix of both continuous and categorical in nature) and a class/response variable (which can be binomial/binary or multinomial/multiclass). Unlike ordinary linear regression, logis- tic regression is used for predicting categorical (often binary) outcomes of the response variable—treating the response variable as the outcome of a Bernoulli trial. Therefore, logistic regression takes the natural logarithm of the odds of the response variable to create a continuous criterion as a transformed version of the response variable. Thus, the logit transformation is referred to as the link function in logistic regression—even though the response variable in logistic regression is categorical or binomial, the logit is the con- tinuous criterion on which linear regression is conducted. Figure 3.15 shows a logistic
156 Part I • Introduction to Analytics and AI
regression function where the odds are represented in the x-axis (a linear function of the independent variables), whereas the probabilistic outcome is shown in the y-axis (i.e., response variable values change between 0 and 1).
The logistic function, f1y2 in Figure 3.15 is the core of logistic regression, which can take values only between 0 and 1. The following equation is a simple mathematical representation of this function:
f1y2 = 1 1 + e-1b0 + b1x 2
The logistic regression coefficients (the bs) are usually estimated using the maximum likelihood estimation method. Unlike linear regression with normally distributed residu- als, it is not possible to find a closed-form expression for the coefficient values that maxi- mizes the likelihood function, so an iterative process must be used instead. This process begins with a tentative starting solution, then revises the parameters slightly to see if the solution can be improved, and repeats this iterative revision until no improvement can be achieved or is very minimal, at which point the process is said to have completed/ converged.
Sports analytics—use of data and statistical/analytics techniques to better manage sports teams/organizations—has been gaining tremendous popularity. Use of data-driven analytics techniques has become mainstream for not only professional teams but also col- lege and amateur sports. Application Case 3.4 is an example of how existing and readily available public data sources can be used to predict college football bowl game outcomes using both classification and regression-type prediction models.
Time-Series Forecasting
Sometimes the variable that we are interested in (i.e., the response variable) might not have distinctly identifiable explanatory variables, or there might be too many of them in a highly complex relationship. In such cases, if the data are available in a desired format, a prediction model, the so-called time series, can be developed. A time series is a sequence of data points of the variable of interest, measured and represented at successive points in time spaced at uniform time intervals. Examples of time series include monthly rain volumes in a geographic area, the daily closing value of the stock market indexes, and
f (y) 1
26 24 22 0 2 4 6
b0 1 b1x
0.5
FIGURE 3.15 The Logistic Function.
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 157
Predicting the outcome of a college football game (or any sports game, for that matter) is an interesting and challenging problem. Therefore, challenge-seeking researchers from both academics and industry have spent a great deal of effort on forecasting the out- come of sporting events. Large amounts of historic data exist in different media outlets (often publicly available) regarding the structure and outcomes of sporting events in the form of a variety of numeri- cally or symbolically represented factors that are assumed to contribute to those outcomes.
The end-of-season bowl games are very impor- tant to colleges in terms of both finance (bring- ing in millions of dollars of additional revenue) and reputation—for recruiting quality students and highly regarded high school athletes for their athletic pro- grams (Freeman & Brewer, 2016). Teams that are selected to compete in a given bowl game split a purse, the size of which depends on the specific bowl (some bowls are more prestigious and have higher payouts for the two teams), and therefore securing an invitation to a bowl game is the main goal of any division I-A college football program. The decision makers of the bowl games are given the authority to select and invite bowl-eligible (a team that has six
wins against its Division I-A opponents in that season) successful teams (as per the ratings and rankings) that will play in an exciting and competitive game, attract fans of both schools, and keep the remaining fans tuned in via a variety of media outlets for advertising.
In a recent data mining study, Delen et al. (2012) used eight years of bowl game data along with three popular data mining techniques (decision trees, neural networks, and support vector machines) to predict both the classification-type outcome of a game (win versus loss) and the regression-type out- come (projected point difference between the scores of the two opponents). What follows is a shorthand description of their study.
The Methodology
In this research, Delen and his colleagues followed a popular data mining methodology, CRISP-DM (Cross-Industry Standard Process for Data Mining), which is a six-step process. This popular meth- odology, which is covered in detail in Chapter 4, provided them with a systematic and structured way to conduct the underlying data mining study and hence improved the likelihood of obtaining accurate
Application Case 3.4 Predicting NCAA Bowl Game Outcomes
(Continued )
158 Part I • Introduction to Analytics and AI
and reliable results. To objectively assess the pre- diction power of the different model types, they used a cross-validation methodology k-fold cross- validation. Details on k-fold cross-validation can be found in Chapter 4. Figure 3.16 graphically illustrates the methodology employed by the researchers.
Data Acquisition and Data Preprocessing
The sample data for this study are collected from a variety of sports databases available on the Web,
including jhowel.net, ESPN.com, Covers.com, ncaa.org, and rauzulusstreet.com. The data set included 244 bowl games representing a com- plete set of eight seasons of college football bowl games played between 2002 and 2009. Delen et al. also included an out-of-sample data set (2010– 2011 bowl games) for additional validation pur- poses. Exercising one of the popular data mining rules of thumb, they included as much relevant information in the model as possible. Therefore, after an in-depth variable identification and
Classification & Regression Trees
Neural Networks
X1
X2
Support Vector Machines
M ax
im um
-m ar
gin h yp
er pla
ne
M argin
Data Collection, Organization, Cleaning, and Transformation
Raw Data Sources
Built Classification
Models
Test Model
Tabulate the Results
Built Regression
Models
Transform and Tabulate Results
Compare the Prediction Results
Test Model
Classification Modeling
Regression Modeling
DBs
Output: Binary (win/loss) Output: Integer (point difference)
Win Loss
Win
Loss
...
......
...
10 %
10 %
10 %
10 % 10 %
10 %
10 %
10 %
10 % 10 % 10 %
10 %
10 %
10 % 10 %
10 %
10 %
10 %
10 % 10 %
FIGURE 3.16 The Graphical Illustration of the Methodology Employed in the Study.
Application Case 3.4 (Continued)
http://jhowel.net
http://ESPN.com
http://Covers.com
http://rauzulusstreet.com
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 159
collection process, they ended up with a data set that included 36 variables, of which the first 6 were the identifying variables (i.e., name and the year of the bowl game, home and away team names, and their athletic conferences—see variables 1–6 in
Table 3.5), followed by 28 input variables (which included variables delineating a team’s seasonal sta- tistics on offense and defense, game outcomes, team composition characteristics, athletic conference char- acteristics, and how they fared against the odds—see
TABLE 3.5 Description of Variables Used in the Study
No Cat Variable Name Description
1 ID1 YEAR Year of the bowl game
2 ID BOWLGAME Name of the bowl game
3 ID HOMETEAM Home team (as listed by the bowl organizers)
4 ID AWAYTEAM Away team (as listed by the bowl organizers)
5 ID HOMECONFERENCE Conference of the home team
6 ID AWAYCONFERENCE Conference of the away team
7 I12 DEFPTPGM Defensive points per game
8 I1 DEFRYDPGM Defensive rush yards per game
9 I1 DEFYDPGM Defensive yards per game
10 I1 PPG Average number of points a given team scored per game
11 I1 PYDPGM Average total pass yards per game
12 I1 RYDPGM Team’s average total rush yards per game
13 I1 YRDPGM Average total offensive yards per game
14 I2 HMWIN% Home winning percentage
15 I2 LAST7 How many games the team won out of their last 7 games
16 I2 MARGOVIC Average margin of victory
17 I2 NCTW Nonconference team winning percentage
18 I2 PREVAPP Did the team appear in a bowl game previous year
19 I2 RDWIN% Road winning percentage
20 I2 SEASTW Winning percentage for the year
21 I2 TOP25 Winning percentage against AP top 25 teams for the year
22 I3 TSOS Strength of schedule for the year
23 I3 FR% Percentage of games played by freshmen class players for the year
24 I3 SO% Percentage of games played by sophomore class players for the year
25 I3 JR% Percentage of games played by junior class players for the year
26 I3 SR% Percentage of games played by senior class players for the year
27 I4 SEASOvUn% Percentage of times a team went over the O/U3 in the current season
28 I4 ATSCOV% Against the spread cover percentage of the team in previous bowl games
(Continued )
160 Part I • Introduction to Analytics and AI
TABLE 3.5 (Continued)
No Cat Variable Name Description
29 I4 UNDER% Percentage of times a team went under in previous bowl games
30 I4 OVER% Percentage of times a team went over in previous bowl games
31 I4 SEASATS% Percentage of covering against the spread for the current season
32 I5 CONCH Did the team win their respective conference championship game
33 I5 CONFSOS Conference strength of schedule
34 I5 CONFWIN% Conference winning percentage
35 O1 ScoreDiff4 Score difference (HomeTeamScore – AwayTeamScore)
36 O2 WinLoss4 Whether the home team wins or loses the game
1ID: Identifier variables; O1: output variable for regression models; O2: output variable for classification models. 2Offense/defense; I2: game outcome; I3: team configuration; I4: against the odds; I5: conference stats. 3Over/Under—Whether or not a team will go over or under the expected score difference. 4Output variables—ScoreDiff for regression models and WinLoss for binary classification models.
variables 7–34 in Table 3.5), and finally the last two were the output variables (i.e., ScoreDiff—the score difference between the home team and the away team represented with an integer number—and WinLoss—whether the home team won or lost the bowl game represented with a nominal label).
In the formulation of the data set, each row (a.k.a. tuple, case, sample, example, etc.) represented a bowl game, and each column stood for a variable (i.e., identifier/input or output type). To represent the game-related comparative characteristics of the two opponent teams in the input variables, Delen et al. calculated and used the differences between the measures of the home and away teams. All these variable values are calculated from the home team’s perspective. For instance, the variable PPG (average number of points a team scored per game) repre- sents the difference between the home team’s PPG and away team’s PPG. The output variables repre- sent whether the home team wins or loses the bowl game. That is, if the ScoreDiff variable takes a posi- tive integer number, then the home team is expected to win the game by that margin; otherwise (if the ScoreDiff variable takes a negative integer number), the home team is expected to lose the game by that margin. In the case of WinLoss, the value of the out- put variable is a binary label, “Win” or “Loss,” indi- cating the outcome of the game for the home team.
The Results and Evaluation
In this study, three popular prediction techniques are used to build models (and to compare them to each other): artificial neural networks, decision trees, and support vector machines. These prediction techniques are selected based on their capability of modeling both classification and regression-type prediction problems and their popularity in recently published data mining literature. More details about these popular data min- ing methods can be found in Chapter 4.
To compare predictive accuracy of all models to one another, the researchers used a stratified k-fold cross-validation methodology. In a stratified version of k-fold cross-validation, the folds are created in a way that they contain approximately the same proportion of predictor labels (i.e., classes) as the original data set. In this study, the value of k is set to 10 (i.e., the com- plete set of 244 samples are split into 10 subsets, each having about 25 samples), which is a common prac- tice in predictive data mining applications. A graphical depiction of the 10-fold cross-validations was shown earlier in this chapter. To compare the prediction mod- els that were developed using the aforementioned three data mining techniques, the researchers chose to use three common performance criteria: accuracy, sen- sitivity, and specificity. The simple formulas for these metrics were also explained earlier in this chapter.
Application Case 3.4 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 161
The prediction results of the three model- ing techniques are presented in Tables 3.6 and 3.7. Table 3.6 presents the 10-fold cross-validation results of the classification methodology in which the three data mining techniques are formulated to have a binary-nominal output variable (i.e., WinLoss). Table 3.7 presents the 10-fold cross- validation results of the regression-based classifica- tion methodology in which the three data mining techniques are formulated to have a numerical out- put variable (i.e., ScoreDiff). In the regression-based classification prediction, the numerical output of the models is converted to a classification type by label- ing the positive WinLoss numbers with a “Win” and
negative WinLoss numbers with a “Loss” and then tabulating them in the confusion matrixes. Using the confusion matrices, the overall prediction accuracy, sensitivity, and specificity of each model type are calculated and presented in Tables 3.6 and 3.7. As the results indicate, the classification-type prediction methods performed better than regression-based classification-type prediction methodology. Among the three data mining technologies, classification and regression trees produced better prediction accuracy in both prediction methodologies. Overall, classification and regression tree classification mod- els produced a 10-fold cross-validation accuracy of 86.48 percent followed by support vector machines
TABLE 3.6 Prediction Results for the Direct Classification Methodology
Prediction Method (classification1)
Confusion Matrix
Accuracy2 (in %)
Sensitivity (in %)
Specificity (in %)
Win Loss
ANN (MLP) Win 92 42 75.00 68.66 82.73
Loss 19 91
SVM (RBF) Win 105 29 79.51 78.36 80.91
Loss 21 89
DT (C&RT) Win 113 21 86.48 84.33 89.09
Loss 12 98
1The output variable is a binary categorical variable (Win or Loss). 2Differences were significant.
TABLE 3.7 Prediction Results for the Regression-Based Classification Methodology
Prediction Method (regression based1)
Confusion Matrix
Accuracy2
Sensitivity
Specificity
Win Loss
ANN (MLP) Win 94 40 72.54 70.15 75.45
Loss 27 83
SVM (RBF) Win 100 34 74.59 74.63 74.55
Loss 28 82
DT (C&RT) Win 106 28 77.87 76.36 79.10
Loss 26 84
1The output variable is a numerical/integer variable (point-diff). 2Differences were sig p 6 0.01.
(Continued )
162 Part I • Introduction to Analytics and AI
daily sales totals for a grocery store. Often, time series are visualized using a line chart. Figure 3.17 shows an example time series of sales volumes for the years 2008 through 2012 on a quarterly basis.
Time-series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values. The time-series plots/charts look and feel very similar to simple linear regression in that, as was the case in simple linear regression, in time series there are two variables: the response variable and the time variable presented in a scatter plot. Beyond this appearance similarity, there is hardly any other commonality between the two. Although regression analysis is often employed in testing theories to see if current values of one or more explanatory variables explain (and hence predict) the response variable, the time-series models are focused on extrapolating on their time-varying behavior to estimate the future values.
Time-series-forecasting assumes that all of the explanatory variables are aggregated into the response variable as a time-variant behavior. Therefore, capturing the time- variant behavior is the way to predict the future values of the response variable. To do that, the pattern is analyzed and decomposed into its main components: random variations, time trends, and seasonal cycles. The time-series example shown in Figure 3.17 illustrates all of these distinct patterns.
The techniques used to develop time-series forecasts range from very simple (the naïve forecast that suggests today’s forecast is the same as yesterday’s actual) to very complex like ARIMA (a method that combines autoregressive and moving average pat- terns in data). Most popular techniques are perhaps the averaging methods that include simple average, moving average, weighted moving average, and exponential smoothing. Many of these techniques also have advanced versions when seasonality and trend can also be taken into account for better and more accurate forecasting. The accuracy of a method is usually assessed by computing its error (calculated deviation between actuals and forecasts for the past observations) via mean absolute error (MAE), mean squared error (MSE), or mean absolute percent error (MAPE). Even though they all use the same
(with a 10-fold cross-validation accuracy of 79.51 percent) and neural networks (with a 10-fold cross- validation accuracy of 75.00 percent). Using a t-test, researchers found that these accuracy values were significantly different at 0.05 alpha level; that is, the decision tree is a significantly better predictor of this domain than the neural network and support vec- tor machine, and the support vector machine is a significantly better predictor than neural networks.
The results of the study showed that the classification-type models predict the game out- comes better than regression-based classification models. Even though these results are specific to the application domain and the data used in this study and therefore should not be generalized beyond the scope of the study, they are exciting because deci- sion trees are not only the best predictors but also the best in understanding and deployment, com- pared to the other two machine-learning techniques
employed in this study. More details about this study can be found in Delen et al. (2012).
Questions for Case 3.4
1. What are the foreseeable challenges in predicting sporting event outcomes (e.g., college bowl games)?
2. How did the researchers formulate/design the prediction problem (i.e., what were the inputs and output, and what was the representation of a single sample—row of data)?
3. How successful were the prediction results? What else can they do to improve the accuracy?
Sources: D. Delen, D. Cogdell, and N. Kasap, “A Comparative Analysis of Data Mining Methods in Predicting NCAA Bowl Outcomes,” International Journal of Forecasting, 28, 2012, pp. 543–552; K. M. Freeman, and R. M. Brewer, “The Politics of American College Football,” Journal of Applied Business and Economics, 18(2), 2016, pp. 97–101.
Application Case 3.4 (Continued)
Chapter 3 • Nature of Data, Statistical Modeling, and Visualization 163
core error measure, these three assessment methods emphasize different aspects of the error, some penalizing larger errors more so than the others.