R Studio Project
-Find a classification of the various countries
-Use knn on the dataset.
Notes:
-Do not use (for loop) in this project. I do not want it to be complicated.
Population Development
In this report, we explore the population development indicator of middle east countries which includes Saudi Arabia, United Arab Emirates, Bahrain, Yemen, Egypt, Iran, Iraq, Jordan, Kuwait, Lebanon, Oman, Palestine, Qatar, Syria, Turkey, and Cyprus. The population development index is dependent on many factors; however, we have identified few of them. Purpose of this study is to explore the relation of these attributes on the population development index of countries.
The list of attributes is shown below:
· HEALTH - Health Index
· POP - Population Index
· POSC - Peace and Order Index
· ECONSZSC - Size of Economy Score
· AI - Air Index
· LOWBWT - Low Birth Weight
· NTLWTHSC - National Wealth Index Score
· BLDTOT - Total Built Land
· IWI - Inland Water Index
· LANDDSC - Land Diversity Score
· LANDQSC - Land Quality Score
Life expectancy at birth expressed as an index using using Health. POP presents the Population living with 100 km of a coast divided by the area of coastal lands. Peace and Order Score is the average of two unweighted indicators: peace, represented by deaths from armed conflicts per year or military expenditure as a percentage of Gross Domestic Product, whichever gives the lower score, and crime, represented by the unweighted average of the homicide rate and other violent crimes. ECONSZSC specifies the size of economy score, based on GDP/person, in current international purchasing power parity dollars. Air Index is the lower of a global atmosphere score and a local air quality score. Low Birth Weight Percentage (LOWBWT) is the percentage of babies whose birth weight is less than 2500 grams, as a percentage of babies born alive. National Wealth Index Score (NTLWTHSC) = the average of three weighted indicators: Size of the economy (Size score), inflation and unemployment score (IU score), and debt (Debt score). Built land (BLDTOT) is land that is "occupied by buildings, transport infrastructure (roads, railways, docks, airports, etc.) and other human structures, including mines and quarries, waste tips, derelict land, and urban and suburban parks and gardens." Inland Water Index or IWI is the lowest of three sub-elements: inland water diversity, water withdrawal, and inland water quality. Land Diversity Score is the average of two weighted indicators: land modification and conversion and land protection. Land Quality Score consists of one indicator: the area of degraded land as a percentage of the area of cultivated and modified land, weighted according to severity of degradation.
Analysis
Relation among these indicators can be visualized using the following plots.From the above plots and observation, we can formulate the hypothesis to compare the economy score and net wealth score index in this region. These two indicators should be highly correlated with an assumption that the net wealth score index contributes towards the size of the economy score of the country. This can be visualized to see its relation more closely.
> View(df)
Hypothesis:
We can see the effect of net wealth score with the economy score. It can be further investigated using statistical test.
Hypothesis can be defined as:
H0: There is not correlation between NTLWTHSC (Net Wealth Score) and ECONSZSC (Size of Economy Score).
H1: There is significant correlation between NTLWTHSC (Net Wealth Score) and ECONSZSC (Size of Economy Score).
We have made Pearson's product moment correlation coefficient test for analyzing the correlation.
Test Results:
t = 8.1924, df = 11, p-value = 0.000005208
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: 0.7683433 0.9782794
sample estimates: cor 0.9269206
Since p-value 0.000005208, is very less than the significance level which refers to alpha = 0.05, we can reject the null hypothesis and concludes that there is significance level of correlation between Net Wealth Score and Size of Economy Score.
R Code:
> library(dplyr)
> library(reshape)
> library(ggplot2)
> middle_east <- c("Bahrain", "Cyprus", "Egypt", "Iran", "Iraq", "Jordan", "Kuwait", "Lebanon", "Oman", "Palestine", "Qatar", "Saudi Arabia", "Syria", "Turkey", "United Arab Emirates", "Yemen")
> indicators <- c("Country_Standard", "HEALTH", "POP", "POSC", "ECONSZSC", "AI", "LOWBWT", "NTLWTHSC","BLDTOT","IWI","LANDDSC", "LANDQSC")
> df = data.frame(Country_Standard = middle_east)
> df_selected <- merge(world_dataset, df, by = "Country_Standard")
> df_selected <- df_selected[, indicators]
> df <- data.frame(sapply(df_selected, as.character), stringsAsFactors=FALSE)
> country <- df$Country_Standard
> df <- data.frame(sapply(df[,-1], as.numeric), stringsAsFactors=FALSE)
> df$Country <- country
> d1 = data.frame(Indicator = "HEALTH", VALUE = df$HEALTH)
> d2 = data.frame(Indicator = "ECONSZSC", VALUE = df$ECONSZSC)
> d <- rbind(d1, d2)
> pairs(df[,-12], pch = 21)
> ggplot(data = df, mapping = aes(x= NTLWTHSC, y=ECONSZSC))+
+ geom_point()
> cor.test(df$NTLWTHSC, df$ECONSZSC)
Pearson's product-moment correlation
data: df$NTLWTHSC and df$ECONSZSC
t = 8.1924, df = 11, p-value = 5.208e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7683433 0.9782794
sample estimates:
cor
0.9269206
> ggplot(data = df, mapping = aes(x= Country, y=NTLWTHSC))+
+ geom_point()
> ggplot(data = df, mapping = aes(x= Country, y=ECONSZSC))+
+ geom_point()