Assignment 5
Due Thursday October 20 at 11:59pm on Blackboard
As before, the questions without solutions are an assignment: you need to do these questions yourself and hand them in (instructions below).
The assignment is due on the date shown above. An assignment handed in after the deadline is late, and may or may not be accepted (see course outline). My solutions to the assignment questions will be available when everyone has handed in their assignment.
You are reminded that work handed in with your name on it must be entirely your own work.
Assignments are to be handed in on Blackboard. Instructions are at http://www.utsc.utoronto.ca/
~butler/c32/blackboard-assgt-howto.html, in case you forgot since last week. Markers’ comments and grades will be available on Blackboard as well.
1. I said before that obtaining a ggplot normal quantile plot with a line is not automatic, but let’s explore how it might be done, since the ideas are not difficult ones.
(a) This question assesses the normality of a chi-squared distribution. (Don’t worry if you’ve never heard of the chi-squared distribution before. It’s the one that is used to obtain P-values for any of the various chi-squared tests. If you’ve done any of those you might have used a chi-squared table. If not, it’s fine.) R has a function rchisq that generates random values from this distribution. It takes two things as input, the number of random values to generate, and the number of degrees of freedom. A chi-squared distribution with large degrees of freedom is more normal-like, in the same way that a t distribution with large degrees of freedom is indistinguishable from a normal distribution.
Generate 50 random values from a chi-squared distribution with 5 degrees of freedom, and save them into a data frame (containing just that one column of values).
Solution: First, I need to make sure that my answer is reproducible (so I can talk about it), plus I need the “tidyverse” for later:
set.seed(457299)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
1
http://www.utsc.utoronto.ca/~butler/c32/blackboard-assgt-howto.html
http://www.utsc.utoronto.ca/~butler/c32/blackboard-assgt-howto.html
The obvious way is to do it in two steps:
z=rchisq(50,5)
df=data.frame(z)
summary(df)
## z
## Min. : 0.9193
## 1st Qu.: 2.6898
## Median : 4.9005
## Mean : 5.9352
## 3rd Qu.: 8.1580
## Max. :20.2781
You can also do it in one shot:
df=data.frame(z=rchisq(50,5))
summary(df)
## z
## Min. : 0.7468
## 1st Qu.: 3.0280
## Median : 4.3933
## Mean : 4.7460
## 3rd Qu.: 5.9517
## Max. :18.5339
The variable names are of no consequence, but it will be less confusing to avoid x and y because of what is coming up.
(A chi-squared distribution has mean equal to its degrees of freedom, so the mean should be 5. That looks OK here.)
(b) On a normal quantile plot such as the one produced by qqnorm-qqline, the line goes through the first and third quartiles of the data (on the y-axis) and the first and third quartiles of a standard normal distribution (on the x-axis). Calculate these, calling them y and x respectively.
Solution: For the data:
y=quantile(df$z,c(0.25,0.75))
y
## 25% 75%
## 3.028047 5.951711
For the standard normal distribution, the function qnorm produces the values of z that have the given probabilities of being less than them (the “inverse CDF”, if you will, or “reading the table backwards”):
x=qnorm(c(0.25,0.75))
x
## [1] -0.6744898 0.6744898
(c) Work out the slope and intercept of the straight line joining these two points. (The slope is rise over run; the intercept is whatever it has to be to make a line with the slope you just calculated pass through one of the points; it doesn’t matter which one.) See if you can find a lazy way of
Page 2
getting the slope and intercept, and justify why it works.
Solution: Let’s suppose the line has the formula y = a+ bx. We first get b as rise over run:
b=(y[2]-y[1])/(x[2]-x[1])
b
## 75%
## 2.167315
then we stop and think a bit. Pick one of the points, say the first one (and use that for x and y); we just found b, so the only unknown thing is a. A tiny amount of algebra produces a = y − bx:
a=y[1]-b*x[1]
a
## 25%
## 4.489879
I said there was a lazy way. For that, note that the regression line through two points must actually go through those two points (so that the sum of squares of residuals can be and is zero).1 In R, lm does regression, or you could even do it by hand if you were determined enough:
yy=lm(y~x)
summary(yy)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals: