Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free Quotes Post Your Requirements

Which of these events is not part of a decathlon

25/10/2021 Client: muhammad11 Deadline: 2 Day

R Language Assignment

"A study of the Decathlon dataset" author: "A Student" email: "StudentA@cardiff.ac.uk" date: "`r format(Sys.time(), '%d %B %Y')`" fontsize: 11pt fontfamily: times geometry: margin=1in output: bookdown::pdf_document2: toc: true number_sections: true keep_tex: true citation_package: natbib fig_caption: true #toc_depth: 1 highlight: haddock df_print: kable extra_dependencies: caption: ["labelfont={bf}"] #pdf_document: # extra_dependencies: ["flafter"] #pdf_document2: # extra_dependencies: ["float"] bibliography: [refs.bib] biblio-style: apalike link-citations: yes abstract: We demonstrate how various descriptive and inferenrial statistical methods can be applied to the `Decathlon` dataset, and how their results might be interpreted and presented. We study the evolution of athlete performance as function of time, and in show that while the best performances appear to increase with time, the mean and median performances appear to decrease over the same period. We illustrate the non-homogeneity of the current decathlon scoring scheme, and give some insight into the profile of the best-performing decathletes. We also perform a correlation analysis to explore relationships between the different events of the decathlon, and finally present the results of a logistic-regresion analysis and demonstrate that to a certain extent it is possible to distinguish between French and German decathletes by the scores the achieve on a subset of the decathlon events. --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) knitr::opts_chunk$set(echo = TRUE) ``` ```{r, include=FALSE} library(corrplot) library(Hmisc) library(car) library(ppcor) library(ggpubr) library(MASS) library(pROC) # read data Decathlon <- read.csv('decathlon.csv', header=TRUE) Nobs <- nrow(Decathlon) # number of entries ``` # Introduction {#sec:intro} The `Decathlon` data set records the performances of elite decathletes in international competitions over the period from 1985 to 2006. The decathlon is a combined event in athletics consisting of ten track-and-field events; the current decathlon world-record holder is the French decathlete Kevin Mayer, who achieved a score of $9{,}126$ points in 2018. One may refer to the following Wikipedia page for more information of the decathlon: [https://en.wikipedia.org/wiki/Decathlon](https://en.wikipedia.org/wiki/Decathlon). The data set consists of $7{,}968$ observations of $24$ variables. These names of the variables are listed hereafter: ```{r, echo = FALSE} noquote('Variables names:') print(names(Decathlon)) ``` An entry of the data set consists of the total number of points scored by a decathlete, the name and nationality of the decathlete, and the year the performance was achieved. The raw performances for each of the 10 events are reported (with time in seconds and distance or height in meters), together with the number of points scored for these events. There are $2{,}709$ different decathletes of $107$ different nationalities in the data set. **Remark**. For $435$ entries of the data set, we observed a difference of $1$ between the variable `Totalpoints`, corresponding to the total number of points scored by a decathlete for that performance, and the sum of the points scored during the $10$ events. We decided to apply a correction to the corresponding entries of the variable `Totalpoints`, so that they are all equal to the sum of the points scored during the $10$ events. ```{r} # Correction of `Totalpoints` Decathlon$Totalpoints <- rowSums(Decathlon[,15:24]) ``` In our study we pay a particular attention to the total number of points, the points scored during each event, the nationality of the decathletes, and the year the performances were achieved. The Pearson correlations between the total number of points and the points scored during the various events is illustrated by the correlogram shown in Figure \@ref(fig:correlo-pts-evt), while scatter plots and histograms showing the points scored during the three throw events (shot-put, discus and javelin throw) are presented in Figure \@ref(fig:pairs-throw-evt) (see appendix). ```{r correlo-pts-evt, echo = FALSE, fig.height = 4, fig.width = 4, fig.cap = "Graphical representation of the Pearson correlations between the total number of points and the points scored during each event.", fig.align = "center"} CorrMat <- cor(Decathlon[c(1,15:24)]) corrplot(CorrMat, method="circle") ``` # Performance across years {#sec:perf-year} In this section, we investigate the evolution of the overall performance (variable `Totalpoints`) as function of the year these performances were achieved (variable `yearEvent`). The number of obsevations for each year (or season) varies between $321$ and $399$. A graphical representation of the evolution of the best, mean and median preformace as a function of the year is shown in Figure \@ref(fig:Evo-Perf-Year). ```{r Evo-Perf-Year, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Evolution of the best, mean and median performance as a function of the year. In each case, the regression line (dashed black line) is shown togeter with the corresponding $95\\%$ confidence intervals (purple dashed curves; prediction without noise term).", fig.align = "center"} Ind_Best_Year <- rep(0,22) Mean_Perf_Year <- rep(0,22) Median_Perf_Year <- rep(0,22) for (i in 0:21){ Ind_Best_Year[i+1] <- which.max(Decathlon$Totalpoints * as.numeric(Decathlon$yearEvent == (1985+i))) Mean_Perf_Year[i+1] <- mean(Decathlon$Totalpoints[Decathlon$yearEvent == (1985+i)]) Median_Perf_Year[i+1] <- median(Decathlon$Totalpoints[Decathlon$yearEvent == (1985+i)]) } Top_Perf_Year <- Decathlon$Totalpoints[Ind_Best_Year] Perf_Year <- data.frame(Year=1985:2006, Top=Top_Perf_Year, Mean=Mean_Perf_Year, Median=Median_Perf_Year) mod1 <- lm(Top ~ Year, data=Perf_Year) mod2 <- lm(Mean ~ Year, data=Perf_Year) mod3 <- lm(Median ~ Year, data=Perf_Year) #summary(mod3) par(mfrow=c(1,2)) xcoord <-seq(1985, 2006,length.out=101) newData <- data.frame(Year=xcoord) #pred.w.plim <- predict(mod1, newData, # interval = "prediction", level=0.95) pred.w.clim <- predict(mod1, newData, interval = "confidence", level=0.95) matplot(xcoord, #cbind(pred.w.clim, pred.w.plim[,-1]), cbind(pred.w.clim), lty = c(2,2,2), lwd=3, #col = c('black', 'green3', 'green3', 'purple', 'purple'), col = c('black', 'purple', 'purple'), type = "l", xlab = 'Year', ylab = 'Total points', ylim = c(min(Top_Perf_Year),max(Top_Perf_Year)), #main='prediction for Male' ) lines(1985:2006, Top_Perf_Year, type='o', pch=19, col='red', cex=1, lwd=2) legend('topleft', c('Best'), col=c('red'), lty=1, lwd=2, pch=19, pt.cex=1) ################################# #pred.w.plim <- predict(mod2, newData, # interval = "prediction", level=0.95) pred.w.clim <- predict(mod2, newData, interval = "confidence", level=0.95) matplot(xcoord, #cbind(pred.w.clim, pred.w.plim[,-1]), cbind(pred.w.clim), lty = c(2,2,2), lwd=3, #col = c('black', 'green3', 'green3', 'purple', 'purple'), col = c('black', 'purple', 'purple'), type = "l", xlab = 'Year', ylab = 'Total points', ylim = c(min(Median_Perf_Year), max(Mean_Perf_Year)), #main='prediction for Male' ) lines(1985:2006, Mean_Perf_Year, type='o', pch=19, col='chartreuse3', cex=1, lwd=2) #pred.w.plim <- predict(mod3, newData, # interval = "prediction", level=0.95) pred.w.clim <- predict(mod3, newData, interval = "confidence", level=0.95) matlines(xcoord, #cbind(pred.w.clim, pred.w.plim[,-1]), cbind(pred.w.clim), lty = c(2,2,2), lwd=3, #col = c('black', 'green3', 'green3', 'purple', 'purple'), col = c('black', 'purple', 'purple'), type = "l", xlab = 'Year', ylab = 'Total points', ylim = c(min(Median_Perf_Year),max(Mean_Perf_Year)), #main='prediction for Male' ) lines(1985:2006, Median_Perf_Year, type='o', pch=19, col='orange', cex=1, lwd=2) legend('topright', c('Mean', 'Median'), col=c('chartreuse3', 'orange'), lty=1, lwd=2, pch=19, pt.cex=1) ``` ## Evolution of performances {#sec:evo-perf} Figure \@ref(fig:Evo-Perf-Year) suggests that the best season performance increases with time, while the mean and median season performances appear to decrease with time. This observation is confirmed by tests of *Spearman's rank correlation* between the time variables and the best, mean and median season performances. The results are shown in Table \@ref(tab:Evo-Perf-Corr-Test), where we observe * for the best season performance a positive Spearman correlation of $0.398$, statistically significant at the significance level $\alpha=0.05$; * for the mean season performance a negative Spearman correlation of $-0.551$, statistically significant at the significance level $\alpha=0.01$; * for the median season performance a negative Spearman correlation of -0.581, statistically significant at the significance level $\alpha=0.01$. ```{r Evo-Perf-Corr-Test, echo=FALSE} CorrT1 <- cor.test(1985:2006, Top_Perf_Year, method='spearman', alternative='greater') CorrT2 <- cor.test(1985:2006, Mean_Perf_Year, method='spearman', alternative='less') CorrT3 <- cor.test(1985:2006, Median_Perf_Year, method='spearman', alternative='less', exact=FALSE) Evo_Perf_Test_res <- data.frame(performance=c('Best', 'Mean', 'Median'), alternative=c(CorrT1$alternative,CorrT2$alternative,CorrT3$alternative), p.value=c(CorrT1$p.value,CorrT2$p.value,CorrT3$p.value), estimate=c(CorrT1$estimate,CorrT2$estimate,CorrT3$estimate)) knitr::kable( Evo_Perf_Test_res, caption = 'Result of the Spearman\'s rank correlation tests between the year and the season\'s best, mean and median performances.', align = 'cccc', booktabs = TRUE)%>%kable_styling(latex_options = "HOLD_position") ``` ## Comparison of mean season performances {#sec:comp-season-perf} Figure \@ref(fig:CI-Var-Mean-Year) gives an overview of the sample mean and sample variance of the performances achieved each year as represented by the `Totalpoints` variable; the corresponding confidence intervals (at the confidence level $70\%$) are also presented. We observe that 1988 has the largest sample mean and the second-largest sample variance among all years, while 1996 has the second-largest sample mean and the largest sample variance. Interestingly, 1988 and 1996 were both Olympic years, with the 1988 Games now infamous for many proven doping cases. We also observe a relatively strong overlap between the confidence intervals corresponding to these two years. The number of observations for each year are relatively large (at least $321$ observations), so by the central limit theorem the normal approximations used to compute these intervals are likely to be reasonably good. ```{r CI-Var-Mean-Year, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Sample mean and sample variance for the performances achieved each season, and correponding confidence intervals at the confidence level $70\\%$. The horizontal intervals corresponds to the means (in orange), and the vertical ones to the variances (in green).", fig.align = "center"} VARxCIxComp<-function(Obs, alpha){ Nsamp<-length(Obs) sss2<- var(Obs) LowBound<- (Nsamp-1) * sss2 / qchisq(1-alpha/2, df=Nsamp-1) UpBound <- (Nsamp-1) * sss2 / qchisq(alpha/2, df=Nsamp-1) return(c(LowBound, UpBound)) } alpha <- 0.3 Mean_TotPts_Year <- rep(0,22) Var_TotPts_Year <- rep(0,22) CI_mean_TotPts_Year <- matrix(0, 22, 2) CI_var_TotPts_Year <- matrix(0, 22, 2) Rec_numb_obs <- rep(0,22) for (i in 0:21){ obs <- Decathlon$Totalpoints[Decathlon$yearEvent == (1985 + i)] CI_mean <- t.test(obs, conf.level=1-alpha)$conf.int CI_mean_TotPts_Year[i+1,] <- CI_mean Mean_TotPts_Year[i+1] <- mean(obs) CI_var_TotPts_Year[i+1,] <- VARxCIxComp(obs, alpha) Var_TotPts_Year[i+1] <- var(obs) } plot(c(),c(), type='p', pch=3, cex=0.5, lwd=2, col='blue', xlab='Mean', ylab='Variance', xlim=c(7275, 7460), ylim=c(130000,215000)) for (i in 1:22){ lines(CI_mean_TotPts_Year[i,], c(1,1)*Var_TotPts_Year[i], lty=1, lwd=3, col='orange') lines(c(1,1)*Mean_TotPts_Year[i], CI_var_TotPts_Year[i,], lty=1, lwd=3, col='chartreuse3') } for (i in 1:22){ text(Mean_TotPts_Year[i]+3, Var_TotPts_Year[i], sprintf('%02d',(1984+i) %% 100), pos=3, cex = 1.1) } points(Mean_TotPts_Year, Var_TotPts_Year, type='p', pch=3, cex=0.5, lwd=2, col='blue') ``` Three different tests for homogeneity of variances return p-values between $0.052$ and $0.102$, as shown in Table \@ref(tab:TotPts-Year-Homosced). These results indicate that there is no strong statistical evidence to suggest that the variance of athlete performances differs across years: for each of the three tests we do not reject the null hypothesis of homoscedasticity at significance level $\alpha=0.05$. ```{r TotPts-Year-Homosced, echo=FALSE} Decathlon$yearEvent_AsFactor <- factor(Decathlon$yearEvent) Thmscd1 <- bartlett.test(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) Thmscd2 <- fligner.test(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) Thmscd3 <- leveneTest(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) Test_Tot_Year_Homosced <- data.frame(Test=c('Bartlett', 'Fligner-Killeen', 'Levene'), p.value=c(Thmscd1$p.value, Thmscd2$p.value, Thmscd3[1,3])) knitr::kable( Test_Tot_Year_Homosced, caption = 'Tests for homogeneity of variance for the performances (i.e. scores achieved by the decathletes) achieved each year.', align = 'cc', booktabs = TRUE)%>%kable_styling(latex_options = "HOLD_position") ``` Next we apply a one-way ANOVA to see whether there is a significant difference between the mean performance achieved each year, and find that it returns a p-value smaller than $0.0002$ (see below), which suggests that there is a statistically significant difference between the mean performance for at least two years. Notice that the number of observations for each year is roughly the same, so the data are approximately balanced. Tests indicate that the data are not normally distributed, however the number of observations in each year group is relatively large (at least than $321$), so by the central limit theorem the conclusion of this ANOVA is likely to be valid. We can also report that ANOVA performed with the `oneway.test` function (which does not assume assuming equality of variance) returns a p-value of similar magnitude. ```{r} AOV1 <- aov(Totalpoints ~ yearEvent_AsFactor, data=Decathlon) print(summary(AOV1)) ``` ## Comparison between 1988 and 1996 {#sec:1988-1996} In Figure \@ref(fig:Evo-Perf-Year) we observe that the sample mean of the overall performances for 1988 is the largest among all years. (A similar observation holds for the sample median, while 1988 interestingly also has the the lowest season's best performance.) We now test to see whether there is a statistically significant difference between the mean performances for $1988$ and $1996$, the year having the second-largest sample mean of the overall performance. ```{r, echo=FALSE} PvalFt <- (var.test(Decathlon$Totalpoints[Decathlon$yearEvent == 1988], Decathlon$Totalpoints[Decathlon$yearEvent == 1996], alternative = "two.sided", conf.level=0.95))$p.value PvalTt <- (t.test(Decathlon$Totalpoints[Decathlon$yearEvent == 1988], Decathlon$Totalpoints[Decathlon$yearEvent == 1996], alternative = "two.sided", conf.level=0.95))$p.value noquote( sprintf('F-test (compararison of variances); p-value=%.4f', PvalFt) ) noquote( sprintf('t-test (compararison of means); p-value=%.4f', PvalTt) ) ``` The $F$-test for equality of variances indicates that there is no statistical evidences to suggest a difference between the variance of the two samples. The two-sample $t$-test then indicates that there is no statistical evidence to suggest a difference between the mean of the two samples. Note that that although the two samples might not be normally distributed, the relatively large sample sizes ensure that normal approximation will work well here. # Differences between events {#sec:diff-evts} Figure \@ref(fig:Diff-Pts-Evt) illustrates the sample mean, sample median and sample variance of the number of points scored in each of the $10$ decathlon events, where we observe that the scheme for awarding points does not appear to be homogeneous. For example, decathletes on average seem to score more points for the 110m hurdles than for the javelin event, and the variance of the points scored for the pole vault appears to be significantly larger that the variance of the points scored for the 100m. ```{r Diff-Pts-Evt, echo = FALSE, fig.height = 3.5, fig.width = 9, fig.cap = "Sample mean, median and variance od the number of points scored during each event. Values for the mean and the median read on the left-hand-side axis, and on the right-hand-side axis for the variance. ", fig.align = "center"} EvtNames <- colnames(Decathlon)[15:24] VarEvt <- apply(Decathlon[,15:24], 2, var) MedianEvt <- apply(Decathlon[,15:24], 2, median) MeanEvt <- apply(Decathlon[,15:24], 2, mean) par(mar=c(3,3,1,3)) CEX1 <- 2 CEX2 <- 1.8 CEX3 <- 1.2 plot(1:10, MedianEvt, type='o', pch=17, lwd=3, cex=CEX2, col='orange', xlab='', ylab='', xaxt='n') lines(1:10, MeanEvt, type='o', lty=2, pch=18, lwd=3, cex=CEX1, col='chartreuse3') axis(1, at=seq(1,10,by=1), labels=EvtNames, las=0) legend('topright', c('Mean', 'Median', 'Variance'), col=c('chartreuse3', 'orange', 'firebrick2'), lty=c(2,1,1), lwd=3, pch=c(18,17,19), pt.cex=c(CEX1,CEX2, CEX3), y.intersp=1.3) # Allow a second plot on the same graph par(new=TRUE, par(xpd=FALSE)) colNsupp<-'blue' plot(1:10, VarEvt, type='o', pch=19, lwd=3, cex=CEX3, col='firebrick2', xlab='', ylab='', xaxt='n', yaxt='n') axis(4, at=seq(4000,10000,by=2000), labels=seq(4000,10000,by=2000), col='black', col.axis='black' ,las=0) ``` ## Difference between the median number of points scored during each event {#sec:diff-med-evt} To test the statistical significance of the apparent non-homogeneity of the decathlon scoring scheme suggested by Figure \@ref(fig:Diff-Pts-Evt), we perform a Kruskal-Wallis rank sum test to see whether there is a significant difference between the median number of points scored during each event (see below). The obtained p-value is extremely small, so there is strong statistical evidence to indicate that the median number of points differ for at least two events. ```{r, echo=FALSE} PtsEvtNames <- colnames(Decathlon)[15:24] Event <- c() Pts_Event <- c() for (i in 1:10){ Event <- c(Event, rep(PtsEvtNames[i], Nobs)) Pts_Event <- c(Pts_Event, unname(Decathlon[,14+i])) } Event <- factor(Event) kruskal.test(Pts_Event ~ Event) ``` Next we perform a Wilcoxon rank-sum test (with continuity correction) to test whether the number of points scored for the pole-vault event is larger that for the javelin-throw event. We perform the test for *paired* observations, because the same decathlete achieved both scores. The result of the test are summarized below. ```{r, echo=FALSE} #noquote(colnames(Decathlon)[c(20,23)]) noquote('Difference between the medians of Ppv and Pjt:') Wil_Pv_Jt <- wilcox.test(Decathlon[,20], Decathlon[,23], alternative='greater', paired=TRUE, exact=FALSE, correct=TRUE) noquote(sprintf('wilcox.test p-value=%.4f', Wil_Pv_Jt$p.value)) ``` The obtained p-value is extremely small so there is strong statistical evidence to suggest that the median number of points scored during the pole-vault event is larger that for the javelin-throw event. To a lesser extent, we also observe a statistically significant difference between the median number of points scored for 100m and long-jump events, even though these two look relatively close in Figure \@ref(fig:Diff-Pts-Evt)): ```{r, echo=FALSE} #noquote(colnames(Decathlon)[c(15,16)]) noquote('Difference between the medians of P100m and Plj:') Wil_100_Lj <- wilcox.test(Decathlon[,15], Decathlon[,16], alternative='two.sided', paired=TRUE, exact=FALSE, correct=TRUE) noquote(sprintf('wilcox.test p-value=%.4f', Wil_100_Lj$p.value)) ``` ## Profile of the season best performers {#sec:profil-best} To identify which events appears to be the more decisive in determining the overall winner, for each year we compute the in-season ranking for each event of the decathlete with the best overall performance, illustrated in Figure \@ref(fig:in-season-rank-top). ```{r in-season-rank-top, echo = FALSE, fig.height = 5, fig.width = 9, fig.cap = "Boxplot of the of in-season rank achieved by the season best-performers for each event of the Decathlon.", fig.align = "center"} Year_Perf_rank <- matrix(0,22,10) for (i in 0:21){ for (j in 15:24){ Season_Rank <- (apply(-Decathlon[(Decathlon$yearEvent == (1985+i)),15:24], 2, rank))[1,] Year_Perf_rank[(i+1),] <- Season_Rank } } colnames(Year_Perf_rank) <- colnames(Decathlon)[15:24] rownames(Year_Perf_rank) <- 1985:2006 options(repr.plot.width=10, repr.plot.height=7) boxplot(Year_Perf_rank, col='mediumorchid3', ylab='In-season ranking') ``` Although only a descriptive analysis, the rank-based analysis associated with Figure \@ref(fig:in-season-rank-top) seems to highlight some interesting features * The event appearing as the less decisive is the 1550m (in view of Figure \@ref(fig:in-season-rank-top)). This can in our opinion be at least partially explained by the following reasons. The 1500m is the last event of the $10$, occurring at the end of the second day, so that only the decathletes in close fight for the victory or to beat their personal record have an interest to try to perform well during this event (and they are certainly all tired by the two days of competition). The 1500m is also the only pure-resistance event (the other event including a resistance component is the 400m), so that there is no real interest for a decathlete to specialise in this event. This observation is in total agreement with the fact that the 1500m appears to be the event which is the less correlated with the others and with the total score achieved by a decathlete; see Figure \@ref(fig:correlo-pts-evt). * The event appearing as the most decisive is the the 110m hurdles; in the data, the best season performer acheived the best or second-best season performance at the 110m hurdles in $50\%$ of the cases. This might be explained by the fact that the 110m-hurdles requires very good explosivity and speed ability, combined with a strong technique and an excellent coordination; the risk of falling during the race is also very high in comparison to the other track events. * The second, third and fourth most decisive events appear to be the 100m, the long jump and the 400m, respectively. Although being a jump event, the qualities to perform well at the long jump are very similar to the qualities required to be a good sprinter (the correlation between the score achieved during the long jump and the 100m is actually relative strong, at $0.49$). These observation suggest that top-performing decathletes excel in the "speed-related" events, and perform relatively well in the other events except for the 1500m. ## Partial correlation between events {#sec:partial-corr} To further investigate the relationships between the different events of the decathlon, we perform a partial correlation analysis by computing the partial correlations between the points scored for every pair of events while controlling for all the other event. The computed partial correlations are illustrated in Figure \@ref(fig:partial-correlo-pts-evt). ```{r partial-correlo-pts-evt, echo = FALSE, fig.height = 4, fig.width = 4, fig.cap = "Graphical representation of the partial correlations between the points scored during every pairs of events while controlling for all the other event.", fig.align = "center" } AllPcor <- pcor(as.matrix(Decathlon[,c(15:24)]) )$estimate corrplot(AllPcor, method="circle") ``` Interestingly, and in comparison to the correlations shown in Figure \@ref(fig:correlo-pts-evt), if we control for all the other events, we observe only a few relatively strong partial correlations between pairs of events (all of which are significant at $\alpha=0.01$). In particular we observe that: * `P100m` and `P1500` are negatively correlated, * `P400m` and `P1500` are positively correlated, * `P400m` and `P100m` are positively correlated, * `Psp` and `Pdt` are positively correlated. The positive partial correlations for the pairs `P400m-P100m` and `P400m-P1500`, and the negative partial correlation for the pair `P1500-P100m`, might be consequence of the fact that decathletes performing well at the 400m may either be very fast or have good resistance. The relatively strong partial correlation between `Psp` and `Pdt` is also not surprising, because the shot put and discuss throw require very similar physical abilities and technical skills. # Differences between French and German decathletes {#sec:french-german} In this section, we explore the possibility of differentiating between French and German decathletes by using a logistic regression model based on the points scored during a selection of events. To this extent, we extract the entries of the data set corresponding to the points scored during the 10 events by the French and German decathletes, and gather these entries in a data set named `data_FRAxGER`; a categorical variable `IsFrench` is added to this data set, taking the value $1$ when the entry corresponds to a French decathlete, and $0$ for a German decathlete. The resulting data set has $1{,}297$ entries, with $40.17\%$ corresponding to French decathletes. Notice that no decathletes appear more than $12$ times in the extracted data set. ```{r, echo=FALSE} Ind_FRA <- (1:Nobs)[Decathlon$Nationality == 'FRA'] Ind_GER <- (1:Nobs)[Decathlon$Nationality == 'GER'] Ind_FRAxGER <- c(Ind_FRA,Ind_GER) n_FRA <- length(Ind_FRA) n_GER <- length(Ind_GER) n_FRAxGER <- n_FRA + n_GER noquote(sprintf('Number of French entries: %d', n_FRA)) noquote(sprintf('Number of German entries: %d', n_GER)) IsFrench <- rep(0,n_FRAxGER) IsFrench[1:n_FRA] <- 1 IsFrench <- factor(IsFrench) Diff_FRA <- length(unique(Decathlon$DecathleteName[Ind_FRA])) Diff_GER <- length(unique(Decathlon$DecathleteName[Ind_GER])) noquote(sprintf('Number of different French decathletes: %d', Diff_FRA)) noquote(sprintf('Number of different German decathletes: %d', Diff_GER)) data_FRAxGER <- Decathlon[Ind_FRAxGER, 15:24] data_FRAxGER$IsFrench <- IsFrench ``` To select a set of relevant events, we use the `stepAIC` function of the `MASS` package. The stepwise procedure is initialsed from the logistic model based on the points scored during the 10 events. The resulting model, named `step.model`, depends on $7$ events, as described below (where the coefficients of the resulting model are given). ```{r, echo=FALSE} full.model <- glm(IsFrench ~ Plj + Pdt + P400m + Ppv + Pjt + Phj + P110h + P1500 + Psp + P100m, family=binomial(link='logit'), data=data_FRAxGER) step.model <- full.model %>% stepAIC(trace = FALSE) #summary(step.model) noquote('Coefficients of step.model:') print(step.model$coefficient[1:5]) print(step.model$coefficient[6:8]) ``` The ROC curve corresponding to the model `step.model` is given in Figure \@ref(fig:ROC-curve). Although not especially impressive, the model appears to be able to identify some statistically significant differences between the French and German decathletes. Based on the coefficients of `step.model`, French decathletes appear, to a certain extent, to be better their German counterparts in discuss throw, pole vault, high jump and 100m (positive coefficients), while German decathletes on average seem to perform better in the 400m, 110m-hurdles and shot-put event (negative coefficients). ```{r ROC-curve, echo = FALSE, fig.height = 3, fig.width = 3, fig.cap = "ROC curve for the logistic regression model \\texttt{step.model}; the orange dashed lines correspond to the sensiblility and specificity of the resulting binary classifier for a decision treshold at $40\\%$.", fig.align = "center"} confusion.glm <- function(data, model, tresh){ prediction <- ifelse(predict(model, data, type='response') > tresh, TRUE, FALSE) confusion <- table(prediction, as.logical(model$y)) confusion <- cbind(confusion, c(1 - confusion[1,1]/(confusion[1,1]+confusion[2,1]), 1 - confusion[2,2]/(confusion[2,2]+confusion[1,2]))) confusion <- as.data.frame(confusion) names(confusion) <- c('FALSE', 'TRUE', 'class.error') confusion } prob <- predict(step.model, type='response') data_FRAxGER$prob <- prob ROC_curve <- roc(IsFrench ~ prob, data=data_FRAxGER, levels=c(0,1), direction='<') options(repr.plot.width=7, repr.plot.height=7) plot(ROC_curve) # Confusion matrix at a given decision treshold Confu_T <- confusion.glm(data_FRAxGER, step.model, 0.4) SensitivityT <- Confu_T[2,2]/sum(Confu_T[,2]) SpecificityT <- Confu_T[1,1]/sum(Confu_T[,1]) abline(v=SpecificityT, col='orange', lwd=2, lty=2) abline(h=SensitivityT , col='orange', lwd=2, lty=2) ``` With a decision threshold set at $40\%$, the binary classifier based on the logistic model `setp.model` correctly classifies $69.29\%$ of the French decathletes, and $67.27\%$ of the German decathletes, as reported below (see also Figure \@ref(fig:ROC-curve)). Note that the threshold was set at $40\%$ to obtain balanced percentage of classification errors. ```{r, echo=FALSE} noquote('Confusion matrix for step.model with 40% decision treshold:') print(Confu_T) ``` Note also that in view of their Z-values, the long jump and the 110m hurdles events appear less influent than the other variables. Thus we removed these two predictors from the model, however the conclusions drawn from an analysis of the reduced model are similar to the conclusions based on `step.model`. ```{r, echo=FALSE} noquote('p-values for the Z-values of step.model:') print(coef(summary(step.model))[,4][1:5]) print(coef(summary(step.model))[,4][6:8]) ``` # Conclusion In this report, we have used various statistical tools to explore the `Decathlon` data set. We performed some *correlation analyses*, both parametric and non-parametric, and some *tests on means and variances* using the $t$-test, $F$-test and ANOVA, together with various tests for checking conditions. We also performed some *non-paremetric tests* for the median of certain quantities, including the Wilcoxon and Kruskal-Wallis rank-sum tests, and computed *confidence intervals* for certain means and variances. In addition, we computed and discussed some *linear regression* and *logistic regression* models, and used various *graphical representation tools* to illustrate the data and our findings. For the parametric tests, sample sizes were generally large enough to ensure the *validity of the normal approximation framework*. Our main observations can be summarised as follows: 1. the best season performances appear to increase with the years, while the mean and median performances appear to decrease over the same period (Section \@ref(sec:perf-year));

Homework is Completed By:

Writer	Writer Name	Amount	Client Comments & Rating
ONLINE	Instant Homework Helper 4.8 4305 Orders Completed	$36	She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up! 5.00
Answer.docx Turnitin Report.pdf Contact Writer For Solution Contact Writer For Solution

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 3 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 6 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 12 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

6 writers have sent their proposals to do this homework:

Writer	Writer Name	Offer	Chat
ONLINE	Top Quality Assignments I am a PhD writer with 10 years of experience. I will be delivering high-quality, plagiarism-free work to you in the minimum amount of time. Waiting for your message. 4.9 1071 Orders Completed	$35	Chat With Writer
ONLINE	Phd Writer This project is my strength and I can fulfill your requirements properly within your given deadline. I always give plagiarism-free work to my clients at very competitive prices. 0 Orders Completed	$24	Chat With Writer
ONLINE	Smart Tutor I can assist you in plagiarism free writing as I have already done several related projects of writing. I have a master qualification with 5 years’ experience in; Essay Writing, Case Study Writing, Report Writing. 4.9 1008 Orders Completed	$35	Chat With Writer
ONLINE	Top Academic Tutor I have written research reports, assignments, thesis, research proposals, and dissertations for different level students and on different subjects. 4.7 1344 Orders Completed	$38	Chat With Writer
ONLINE	Engineering Help I am a professional and experienced writer and I have written research reports, proposals, essays, thesis and dissertations on a variety of topics. 4.8 1176 Orders Completed	$40	Chat With Writer
ONLINE	Supreme Essay Writer I have read your project details and I can provide you QUALITY WORK within your given timeline and budget. 4.8 1890 Orders Completed	$23	Chat With Writer