#This assignment is a tutorial so have a bit of fun with it
#If you would like to explore some additional options give it a try
#Goal is to provide some meaningful info to the restaurant owner
#Some notes below
#Both PCA and FA provide useful summary info for multivariate data, but
#all of the original variables are needed for their calculation, so
#the big question is can we use them to find a subset of variables to
#predict overall score?
#Also,trying to give meaningful labels to components is really hard.
#When the variables are on different scales you need to work with the
#correlation matrix. For this assignment they are on same scale so
#we will work with the raw data.
#PCA only helps if the original variables are correlated, if they
#are independent PCA will not help.
#Approach takes two steps
#First step find the dimensionality of the data, that is the
#number of original variables to be retained
#Second step find which ones, more on this below
# import packages for this example
import pandas as pd
import numpy as np # arrays and math functions
import matplotlib.pyplot as plt # static plotting
from sklearn.decomposition import PCA, FactorAnalysis
import statsmodels.formula.api as smf # R-like model specification
#Set some display options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 120)
#Read in the restaurant dataset
food_df = pd.read_csv('C:/Users/Jahee Koo/Desktop/MSPA/2018_Winter_410_regression/HW03 PCA/FACTOR1.csv')
#A good step to take is to convert all variable names to lower case
food_df.columns = [s.lower() for s in food_df.columns]
print(food_df)
print('')
print('----- Summary of Input Data -----')
print('')
# show the object is a DataFrame
print('Object type: ', type(food_df))
# show number of observations in the DataFrame
print('Number of observations: ', len(food_df))
# show variable names
print('Variable names: ', food_df.columns)
# show descriptive statistics
print(food_df.describe())
# show a portion of the beginning of the DataFrame
print(food_df.head())
#look at correlation structure
cdata = food_df.loc[:,['overall','taste','temp','freshness','wait','clean','friend','location','parking','view']]
corr = cdata[cdata.columns].corr()
print(corr)
#Use the correlation matrix to help provide advice to the restaurant owner
#Look at four different models and compare them
#Which model do you think is best and why?
#Model 1 full regression model
#Model 2 select my reduced regression model taste, wait and location
#Model 3 Full PCA model
#Model 4 Reduced PCA model with parking, taste and clean
#Model 5 FA model
#First find the PCA
#Second find the FA
#Run the models
#Compare the models and show VIF for each model
#PCA
print('')
print('----- Principal Component Analysis -----')
print('')
pca_data = food_df.loc[:,['taste','temp','freshness','wait','clean','friend','location','parking','view']]
pca = PCA()
P = pca.fit(pca_data)
print(pca_data)
np.set_printoptions(threshold=np.inf)
np.around([pca.components_], decimals=3)
#Note per Everett p209 pick the three variable