d=read.csv("county.csv")
View(d)
# Name of the dataset is "d".
# View the dataset. This is about US counties,
# such as county's state, population, income, etc.
# QUESTIONS
# 1) What percentage of the dataset is populated with NA values?
# Hint: Calculate total NAs then divide by (number of rows X number of columns)
# 2) Show the county name and population in 2017 for the state of Connecticut
# where unemployment rate is greater than 5.0.
# 3) Show the county names, states and per capita income whose population
# change is positive and unemployement rate is less than 2.
# 4) We are wondering about population change from 2010 to 2017
# in counties where poverty rate is greater than 20 and
# unemployment rate is greater than 8.
# For this purpose, create a new variable using the following formula
# as a new column of the dataset d.
# popChng17_20 = (pop2017 - pop2010)/pop2010.
#
# 5) The population in "Hoonah Angoon Census Area" of Alaska in 2017
# is missing (i.e. NA). However, a quick Google search showed that
# this population is actually 2139. Now replace the NA in this
# spot with 2139.
# 6) We are wondering about mean poverty level for metro and non-metro counties
# in the state of Connecticut. Calculate them. Which one is higher?
# (NOTE: To ignore NAs, use na.rm = TRUE when calculating means)
# 7) Which year has the highest variation in terms of county populations?
# Is it 2000, 2010 or 2017? (NOTE: To ignore NAs, use na.rm = TRUE )
# 8) Create a histogram for homeownership variable
# with 40 bars. Comment on the skewness of the distribution.
# 9) Create a boxplot showing poverty variable for categories metro and non-metro areas
# (HINT: use y ~ x notation in the boxplot.) Comment on the plot
# as to which location has higher poverty overall?
# 10) Assume that you are investigating the variables that
# could be associated with 'poverty' variable. Create scatter
# plots on a 2 x 2 panel for poverty vs. unemployment_rate,
# homeownership, per_capita_income, and pop_change.
# Which variables seem to be associated with poverty?
# (NOTE: If you get "figure margins too large", enlarge the plotting
# window to the left and upward)
#################################################
# In this part, use ggplot2 and dplyr packages
#################################################
library(ggplot2)
library(dplyr)
# 11) We wonder about the change in unemployement rate
# as education level changes.
# For this purpose, create a boxplot for unemployment_rate vs. median_edu.
# Label x-axis as "Education Level"
# Comment on the chart. How does the unemployement rate change?
# 12) Using dplyr and ggplot2, find population change from 2010 to 2017
# in counties where poverty rate is greater than 20 and
# unemployment rate is greater than 8.
# For this purpose, mutate a new variable
# called popChng17_20 = (pop2017 - pop2010)/pop2010.
# Then, create a ggplot boxplot showing popChng17_20 vs metro.
# 13) Group the dataset by State, and then summarize using
# count, mean unemployement rate, and mean per capita income.
# Sort the result by mean unemployment rate.
# (Hint: Remove NAs when calculating means: na.rm = TRUE)