ECON 318 Homework 6
Due 11/08 in class
Instructor: Yu-Wei Hsieh
library(readxl) library(stargazer) library(ggplot2) library(dplyr)
Note: (1) Homework should be submitted in pdf/world format generated from RMarkdown (2) Please include your answers, analysis, code, reasoning and the key steps, for instance the tables/plots produced by R. Simply writing down the solution earns 0 point (3) Plagiarism is not accepted. Any similar homework will get zero point.
Q1 [15pt]
Suppose you want to estimate the seasonal effect on the revenue. There is a constant term included in the regression as usual. How many dummies are needed to perform such analysis?
I need 3 dummies for the 4 seasons.
Q2 [20pt]
Use the data in gpa2 and GPA2_description for this exercise.
Q1_data <- read_excel(path = "gpa2.xls", sheet = 1, col_names = FALSE)
## New names: ## * `` -> `..1` ## * `` -> `..2` ## * `` -> `..3` ## * `` -> `..4` ## * `` -> `..5` ## * … and 7 more
colnames(Q1_data) <- c("sat", "tothrs", "colgpa", "athlete", "verbmath", "hsize", "hsrank", "hsperc", "female", "stargazer()ite", "black", "hsizesq")
1. Using all observations and regress colgpa on hsperc and sat.
Q1_model1 <- lm(colgpa ~ hsperc + sat, data = Q1_data) stargazer(Q1_model1, type = "text")
## ## =============================================== ## Dependent variable: ## --------------------------- ## colgpa ## ----------------------------------------------- ## hsperc -0.014*** ## (0.001) ## ## sat 0.001*** ## (0.0001) ## ## Constant 1.392*** ## (0.072) ## ## ----------------------------------------------- ## Observations 4,137 ## R2 0.273 ## Adjusted R2 0.273 ## Residual Std. Error 0.562 (df = 4134) ## F Statistic 777.917*** (df = 2; 4134) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
summary(Q1_model1)
## ## Call: ## lm(formula = colgpa ~ hsperc + sat, data = Q1_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.6007 -0.3581 0.0329 0.3963 1.7599 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.392e+00 7.154e-02 19.45 <2e-16 *** ## hsperc -1.352e-02 5.495e-04 -24.60 <2e-16 *** ## sat 1.476e-03 6.531e-05 22.60 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5615 on 4134 degrees of freedom ## Multiple R-squared: 0.2734, Adjusted R-squared: 0.2731 ## F-statistic: 777.9 on 2 and 4134 DF, p-value: < 2.2e-16
2. Reestimate the model using only the first 2,070 observations
Q1_model2 <- lm(colgpa ~ hsperc + sat, data = Q1_data[1:2070, ]) stargazer(Q1_model2, type = "text")
## ## =============================================== ## Dependent variable: ## --------------------------- ## colgpa ## ----------------------------------------------- ## hsperc -0.013*** ## (0.001) ## ## sat 0.001*** ## (0.0001) ## ## Constant 1.436*** ## (0.098) ## ## ----------------------------------------------- ## Observations 2,070 ## R2 0.283 ## Adjusted R2 0.282 ## Residual Std. Error 0.539 (df = 2067) ## F Statistic 407.392*** (df = 2; 2067) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
summary(Q1_model2)
## ## Call: ## lm(formula = colgpa ~ hsperc + sat, data = Q1_data[1:2070, ]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.28027 -0.34910 0.04051 0.38046 1.69464 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.436e+00 9.778e-02 14.69 <2e-16 *** ## hsperc -1.275e-02 7.185e-04 -17.74 <2e-16 *** ## sat 1.468e-03 8.858e-05 16.58 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5395 on 2067 degrees of freedom ## Multiple R-squared: 0.2827, Adjusted R-squared: 0.282 ## F-statistic: 407.4 on 2 and 2067 DF, p-value: < 2.2e-16
3. Find the ratio of the standard erros on hsperc from 1. and 2. what do you find? why?
summary(Q1_model1)$coefficients[2, 2]/summary(Q1_model2)$coefficients[2, 2]
## [1] 0.7647155
The standard error on hsperc from 1. is smaller since the sample size is larger.
4. Add female, verbmath and their interaction terms into the regression using all observations.
Q1_model3 <- lm(colgpa ~ hsperc + sat + female + verbmath + female * verbmath, data = Q1_data) stargazer(Q1_model3, type = "text")
## ## =============================================== ## Dependent variable: ## --------------------------- ## colgpa ## ----------------------------------------------- ## hsperc -0.013*** ## (0.001) ## ## sat 0.002*** ## (0.0001) ## ## female 0.143 ## (0.106) ## ## verbmath -0.064 ## (0.084) ## ## female:verbmath 0.011 ## (0.118) ## ## Constant 1.243*** ## (0.102) ## ## ----------------------------------------------- ## Observations 4,137 ## R2 0.285 ## Adjusted R2 0.285 ## Residual Std. Error 0.557 (df = 4131) ## F Statistic 330.074*** (df = 5; 4131) ## =============================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
It appears that being a female student and the ratio of verbal and math score on the SAT as well as the interaction between these has no significant effect on GPA after fall semester. Combined SAT score and high school percentile does have a signifiacnt impact.
Q3 [20pt]
Load package ggplot2 and type data(diamonds) to load the data set. The definition of table and depth can be found in the following picture
diamond.jpg
1. A diamond’s quality can be measured by cut, ordered by Ideal, Premium, Very Good, Good, and Fair. Create dummy to represent Ideal and Premium, and to represent Very Good and Good.
dia <- diamonds dia <- mutate(dia, D1 = as.numeric(cut == "Premium" | cut == "Ideal"), D2 = as.numeric(cut == "Very Good" | cut == "Good"))
2. Regress price on carat, depth, table, and , all interactions terms between dummies and quantitative variables (carat, depth and table). Interpret your result
Q3_model1 <- lm(price ~ carat + depth + table + D1 + D2 + carat * D1 + carat * D2 + depth * D1 + depth * D2 + table * D1 + table * D2, data = dia) stargazer(Q3_model1, type = "text")
## ## ================================================== ## Dependent variable: ## ------------------------------ ## price ## -------------------------------------------------- ## carat 6,002.377*** ## (73.086) ## ## depth -86.946*** ## (12.186) ## ## table -2.707 ## (11.185) ## ## D1 738.068 ## (1,434.572) ## ## D2 5,020.093*** ## (1,426.486) ## ## carat:D1 1,997.748*** ## (75.073) ## ## carat:D2 1,854.563*** ## (77.410) ## ## depth:D1 54.620*** ## (15.176) ## ## depth:D2 -17.224 ## (14.397) ## ## table:D1 -82.709*** ## (12.073) ## ## table:D2 -80.461*** ## (12.434) ## ## Constant 3,807.443*** ## (1,258.219) ## ## -------------------------------------------------- ## Observations 53,940 ## R2 0.858 ## Adjusted R2 0.858 ## Residual Std. Error 1,501.182 (df = 53928) ## F Statistic 29,728.630*** (df = 11; 53928) ## ================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
We see no significant effect on a diamond’s price from being D1 (ideal or premium), the table size, or the interaction of being D2 with a diamond’s depth once the other characteristics and interactions are accounted for. Value of a diamond is decreasing in size of table for both D1 and D2 daimonds. For D1 diamonds, depth has a positive impact though overall we see value decreases with depth. A diamond’s carat seems to have the largest impact on value. Being D2 over fair also significantly increases value.
3. Create a random sample of size 1000 from the diamonds data. Draw the scatterplot of carat vs log(price), color coded by cut.
Q3_sample <- sample_n(diamonds, 1000) ggplot(Q3_sample, aes(x = carat, y = log(price), colour = cut)) + geom_point()
HW6_Solution_Ruozi_files/figure-docx/unnamed-chunk-9-1.png 4. List the distinct categories of color. What is their ordering?
unique(Q3_sample$color)
## [1] H G J D F E I ## Levels: D < E < F < G < H < I < J
Q4 [45pt] (Just Answer the Question; No R Command)
According to past series of Bond films, the average number of people that are killed by Bond shows substantial variations among different Bond actors, as shown in the following graph. In particular, Pierce Brosnan ranks #1 on this list. To study whether the revenue of the film are affected by the number of people that Bond killed, we performed regression analysis on the available data set. The data are based on 23 past Bond films with all the 6 Bond actors. For each film, we have information on the adjusted worldwide gross (in 1000 dollars), the average rating (on a 1-10 basis with 10 being the best), rating, film budget,the number of people Bond killed and others killed in each film, bond actors and the year of the film. To start with, we build up the following model to see if the number of people that Bond killed in the film would affect the worldwide gross, Where log(gross) is the logarithm of the worldwide gross, Bond kills is the number of people that Bond killed in the film, Pierce is a dummy variable indicating whether the Bond actor is Pierce. The following table shows the estimation results.
=============================================== Dependent variable: ————————— log(Gross) ———————————————– Bond kills 0.02** (0.002, 0.04)
Pierce -0.53** (-1.03, -0.04)
Constant 13.05*** (12.79, 13.31)
Observations 23 R2 0.21 Adjusted R2 0.14 Residual Std. Error 0.32 (df = 20) F Statistic 2.73* (df = 2; 20) =============================================== Note: p<0.1; p<0.05; p<0.01
Answer Q1-Q4 using the regression results above.
1. Does the number of people Bond killed significantly affect the worldwide gross at 5% level? Interpret the estimated coefficient of Bond kills.
If one more person is killed in the movie, then the worldwide gross will drop down by 2%. It’s significant at 5% level.
1. Interpret the estimated coefficient of Pierce.
The worldwide gross of movies with Piece is lower than those without Pierce by 53%.
1. Is the regression overall significant at 5% level?
The regression overall is not siginicant at 5% level since the p-value for the F-statistics is larger than 5%.
1. What does the adjusted R^2 measure?
It measures how much variation of the log of gross revenue can be explained by number of people Bond killed and Pierce as the Bond actor. At the same time, adjusted R^2 penalizes more variables.
Suppose you believe that the decade of 1990’s is the booming age for Bond films, so you include a time dummy variable decade90 into the model.
## ## ======================================================== ## Dependent variable: ## ------------------------------------ ## log(Gross) ## (1) (2) ## -------------------------------------------------------- ## `Bond kills` 0.02** 0.02** ## (0.002, 0.04) (0.002, 0.04) ## ## Pierce -0.53** -0.42 ## (-1.03, -0.04) (-1.15, 0.30) ## ## decade90 -0.16 ## (-0.89, 0.58) ## ## Constant 13.05*** 13.05*** ## (12.79, 13.31) (12.78, 13.31) ## ## -------------------------------------------------------- ## Observations 23 23 ## R2 0.21 0.22 ## Adjusted R2 0.14 0.10 ## Residual Std. Error 0.32 (df = 20) 0.32 (df = 19) ## F Statistic 2.73* (df = 2; 20) 1.80 (df = 3; 19) ## ======================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
1. From the above results of model 2, why do you think the dummy Pierce becomes insignificant?
The revenue in the decade of 1990’s was actually relatively low. And Pierce happened to be Bond actor in the decade of 1990’s. So the effect of Pierce being a Bond actor is now being explained by both ‘Pierce’ and ‘decade90’. They are not siginificant any more.
Now suppose that you run a new model with the interaction term Bond Kills:Pierce, which equals to the product of dummy Pierce and variable Bond kills.
## ## ========================================================================== ## Dependent variable: ## ------------------------------------------------------ ## log(Gross) ## (1) (2) (3) ## -------------------------------------------------------------------------- ## `Bond kills` 0.02** 0.02** 0.02** ## (0.002, 0.04) (0.002, 0.04) (0.004, 0.04) ## ## Pierce -0.53** -0.42 0.05 ## (-1.03, -0.04) (-1.15, 0.30) (-1.37, 1.47) ## ## decade90 -0.16 ## (-0.89, 0.58) ## ## `Bond kills`:Pierce -0.02 ## (-0.06, 0.02) ## ## Constant 13.05*** 13.05*** 13.01*** ## (12.79, 13.31) (12.78, 13.31) (12.72, 13.29) ## ## -------------------------------------------------------------------------- ## Observations 23 23 23 ## R2 0.21 0.22 0.24 ## Adjusted R2 0.14 0.10 0.12 ## Residual Std. Error 0.32 (df = 20) 0.32 (df = 19) 0.32 (df = 19) ## F Statistic 2.73* (df = 2; 20) 1.80 (df = 3; 19) 2.04 (df = 3; 19) ## ========================================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
1. Interpret the estimated coefficient of Bond Kills:Pierce. Comparing model 1 and 3, do you think it is a good idea to include the interaction term? Why
Compared to movies without Pierce, the additional gross revenue of movies with Pierce drops by 2% if one more person is killed by Bond.
I think it’s a good idea to include the interaction term because we can see from model 3, Pierce himself can’t be associated with lower revenue. But rather audience disliked to see Pierce killing more people in the movie. Then model 1 is misleading.
Now we turn to study the effect of Bond kills on the average rating, which ranges from 1 to 10. Considering that each of the actors may appeal to specific group or specific generation of audiences, since each of them may represent different time and style, we include several dummy variables in the model for each actor. Moreover, we believe that not only does Bond kills matter, the number of people killed by others (for instance, the supporting actors) also matters.
## ## ====================================================== ## Dependent variable: ## --------------------------- ## Rating ## ------------------------------------------------------ ## `Bond kills` 0.05** ## (0.02) ## ## `Others kills` -0.01** ## (0.004) ## ## `Bond actor`George Lazenby 0.16 ## (0.66) ## ## `Bond actor`Pierce Brosnan -1.82*** ## (0.48) ## ## `Bond actor`Roger Moore -0.76* ## (0.40) ## ## `Bond actor`Sean Connery 0.53 ## (0.45) ## ## `Bond actor`Timothy Dalton -0.69 ## (0.48) ## ## Constant 6.65*** ## (0.42) ## ## ------------------------------------------------------ ## Observations 23 ## R2 0.68 ## Adjusted R2 0.53 ## Residual Std. Error 0.51 (df = 15) ## F Statistic 4.48*** (df = 7; 15) ## ====================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
1. Interpret the estimated coefficients of Bond kills and Other kills. Comparing the estimated coefficients for Bond kills and Other kills.
The more people killed by Bond, the higher the rating. While the more people killed by others, the lower the rating. They are both significant at 5% level.
1. Which actor is the base category?
Daniel Craig
1. According to the estimates (ignoring significance at this moment), who was the best and who was the worst at boosting the ratings among the 6 Bond actors?
Sean Connery was the best and Pierce Brosnan was the worst.