Recent Orders

Our Reviews

Sample Papers

How It Works

Get First 2 Pages Of Your Homework Absolutely Free!

Messages

Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free Quotes Post Your Requirements

Cars per capita 1 datacamp

04/12/2021 Client: muhammad11 Deadline: 2 Day

Data Visualization with R Rob Kabacoff 2018-09-03

Contents

Welcome 7

Preface 9

How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Prequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1 Data Preparation 11

1.1 Importing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Cleaning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Introduction to ggplot2 19

2.1 A worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Placing the data and mapping options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3 Graphs as objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Univariate Graphs 35

3.1 Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Bivariate Graphs 63

4.1 Categorical vs. Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Quantitative vs. Quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3 Categorical vs. Quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Multivariate Graphs 103

5.1 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Maps 115

6.1 Dot density maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.2 Choropleth maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4 CONTENTS

7 Time-dependent graphs 127

7.1 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2 Dummbbell charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.3 Slope graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.4 Area Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8 Statistical Models 139

8.1 Correlation plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.4 Survival plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.5 Mosaic plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9 Other Graphs 153

9.1 3-D Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.2 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

9.3 Bubble charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

9.4 Flow diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.5 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

9.6 Radar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

9.7 Scatterplot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

9.8 Waterfall charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.9 Word clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

10 Customizing Graphs 183

10.1 Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

10.2 Colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

10.3 Points & Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

10.4 Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

10.5 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

10.6 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10.7 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

11 Saving Graphs 219

11.1 Via menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

11.2 Via code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

11.3 File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

11.4 External editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

CONTENTS 5

12 Interactive Graphs 223

12.1 leaflet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

12.2 plotly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

12.3 rbokeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

12.4 rCharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

12.5 highcharter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

13 Advice / Best Practices 231

13.1 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

13.2 Signal to noise ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

13.3 Color choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

13.4 y-Axis scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

13.5 Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

13.6 Going further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

13.7 Final Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

A Datasets 241

A.1 Academic salaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A.2 Starwars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A.3 Mammal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A.4 Marriage records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.5 Fuel economy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.6 Gapminder data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.7 Current Population Survey (1985) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.8 Houston crime data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

A.9 US economic timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.10 Saratoga housing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.11 US population by age and year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.12 NCCTG lung cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.13 Titanic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

A.14 JFK Cuban Missle speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

A.15 UK Energy forecast data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

A.16 US Mexican American Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

B About the Author 245

C About the QAC 247

6 CONTENTS

Welcome

R is an amazing platform for data analysis, capable of creating almost any type of graph. This book helps you create the most popular visualizations - from quick and dirty plots to publication-ready graphs. The text relies heavily on the ggplot2 package for graphics, but other approaches are covered as well.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Interna- tional License.

My goal is make this book as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.

8 CONTENTS

Preface

How to use this book

You don’t need to read this book from start to finish in order to start building effective graphs. Feel free to jump to the section that you need and then explore others that you find interesting.

Graphs are organized by

• the number of variables to be plotted

• the type of variables to be plotted • the purpose of the visualization

Chapter Description Ch 1 provides a quick overview of how to get your data into R and how to prepare it

for analysis. Ch 2 provides an overview of the ggplot2 package. Ch 3 describes graphs for visualizing the distribution of a single categorical (e.g. race)

or quantitative (e.g. income) variable. Ch 4 describes graphs that display the relationship between two variables. Ch 5 describes graphs that display the relationships among 3 or more variables. It is

helpful to read chapters 3 and 4 before this chapter. Ch 6 provides a brief introduction to displaying data geographically. Ch 7 describes graphs that display change over time. Ch 8 describes graphs that can help you interpret the results of statistical models. Ch 9 covers graphs that do not fit neatly elsewhere (every book needs a miscellaneous

chapter). Ch 10 describes how to customize the look and feel of your graphs. If you are going to

share your graphs with others, be sure to skim this chapter. Ch 11 covers how to save your graphs. Different formats are optimized for different

purposes. Ch 12 provides an introduction to interactive graphics. Ch 13 gives advice on creating effective graphs and where to go to learn more. It’s

worth a look. The Appendices describe each of the datasets used in this book, and provides a short blurb about

the author and the Wesleyan Quantitative Analysis Center.

There is no one right graph for displaying data. Check out the examples, and see which type best fits your needs.

10 CONTENTS

Prequisites

It’s assumed that you have some experience with the R language and that you have already installed R and RStudio. If not, here are some resources for getting started:

• A (very) short introduction to R • DataCamp - Introduction to R with Jonathon Cornelissen • Quick-R • Getting up to speed with R

Setup

In order to create the graphs in this guide, you’ll need to install some optional R packages. To install all of the necessary packages, run the following code in the RStudio console window.

pkgs <- c("ggplot2", "dplyr", "tidyr", "mosaicData", "carData", "VIM", "scales", "treemapify", "gapminder", "ggmap", "choroplethr", "choroplethrMaps", "CGPfunctions", "ggcorrplot", "visreg", "gcookbook", "forcats", "survival", "survminer", "ggalluvial", "ggridges", "GGally", "superheat", "waterfalls", "factoextra", "networkD3", "ggthemes", "hrbrthemes", "ggpol", "ggbeeswarm")

install.packages(pkgs)

Alternatively, you can install a given package the first time it is needed.

For example, if you execute

library(gapminder)

and get the message

Error in library(gapminder) : there is no package called ‘gapminder’

you know that the package has never been installed. Simply execute

install.packages("gapminder")

once and

library(gapminder)

will work from that point on.

https://cran.r-project.org/
https://www.rstudio.com/products/RStudio/#Desktop
https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
https://www.datacamp.com/courses/free-introduction-to-r
http://www.statmethods.net
Chapter 1

Data Preparation

Before you can visualize your data, you have to get it into R. This involves importing the data from an external source and massaging it into a useful format.

1.1 Importing data

R can import data from almost any source, including text files, excel spreadsheets, statistical packages, and database management systems. We’ll illustrate these techniques using the Salaries dataset, containing the 9 month academic salaries of college professors at a single institution in 2008-2009.

1.1.1 Text files

The readr package provides functions for importing delimited text files into R data frames.

library(readr)

# import data from a comma delimited file Salaries <- read_csv("salaries.csv")

# import data from a tab delimited file Salaries <- read_tsv("salaries.txt")

These function assume that the first line of data contains the variable names, values are separated by commas or tabs respectively, and that missing data are represented by blanks. For example, the first few lines of the comma delimited file looks like this.

"rank","discipline","yrs.since.phd","yrs.service","sex","salary" "Prof","B",19,18,"Male",139750 "Prof","B",20,16,"Male",173200 "AsstProf","B",4,3,"Male",79750 "Prof","B",45,39,"Male",115000 "Prof","B",40,41,"Male",141500 "AssocProf","B",6,6,"Male",97000

Options allow you to alter these assumptions. See the documentation for more details.

https://www.rdocumentation.org/packages/readr/versions/0.1.1/topics/read_delim
12 CHAPTER 1. DATA PREPARATION

1.1.2 Excel spreadsheets

The readxl package can import data from Excel workbooks. Both xls and xlsx formats are supported.

library(readxl)

# import data from an Excel workbook Salaries <- read_excel("salaries.xlsx", sheet=1)

Since workbooks can have more than one worksheet, you can specify the one you want with the sheet option. The default is sheet=1.

1.1.3 Statistical packages

The haven package provides functions for importing data from a variety of statistical packages.

library(haven)

# import data from Stata Salaries <- read_dta("salaries.dta")

# import data from SPSS Salaries <- read_sav("salaries.sav")

# import data from SAS Salaries <- read_sas("salaries.sas7bdat")

1.1.4 Databases

Importing data from a database requires additional steps and is beyond the scope of this book. Depending on the database containing the data, the following packages can help: RODBC, RMySQL, ROracle, RPostgreSQL, RSQLite, and RMongo. In the newest versions of RStudio, you can use the Connections pane to quickly access the data stored in database management systems.

1.2 Cleaning data

The processes of cleaning your data can be the most time-consuming part of any data analysis. The most important steps are considered below. While there are many approaches, those using the dplyr and tidyr packages are some of the quickest and easiest to learn.

Package Function Use dplyr select select variables/columns dplyr filter select observations/rows dplyr mutate transform or recode variables dplyr summarize summarize data dplyr group_by identify subgroups for further processing tidyr gather convert wide format dataset to long format tidyr spread convert long format dataset to wide format

https://db.rstudio.com/rstudio/connections/
1.2. CLEANING DATA 13

Examples in this section will use the starwars dataset from the dplyr package. The dataset provides descriptions of 87 characters from the Starwars universe on 13 variables. (I actually prefer StarTrek, but we work with what we have.)

1.2.1 Selecting variables

The select function allows you to limit your dataset to specified variables (columns).

library(dplyr)

# keep the variables name, height, and gender newdata <- select(starwars, name, height, gender)

# keep the variables name and all variables # between mass and species inclusive newdata <- select(starwars, name, mass:species)

# keep all variables except birth_year and gender newdata <- select(starwars, -birth_year, -gender)

1.2.2 Selecting observations

The filter function allows you to limit your dataset to observations (rows) meeting a specific criteria. Multiple criteria can be combined with the & (AND) and | (OR) symbols.

library(dplyr)

# select females newdata <- filter(starwars,

gender == "female")

# select females that are from Alderaan newdata <- select(starwars,

gender == "female" & homeworld == "Alderaan")

# select individuals that are from # Alderaan, Coruscant, or Endor newdata <- select(starwars,

homeworld == "Alderaan" | homeworld == "Coruscant" | homeworld == "Endor")

# this can be written more succinctly as newdata <- select(starwars,

homeworld %in% c("Alderaan", "Coruscant", "Endor"))

1.2.3 Creating/Recoding variables

The mutate function allows you to create new variables or transform existing ones.

14 CHAPTER 1. DATA PREPARATION

library(dplyr)

# convert height in centimeters to inches, # and mass in kilograms to pounds newdata <- mutate(starwars,

height = height * 0.394, mass = mass * 2.205)

The ifelse function (part of base R) can be used for recoding data. The format is ifelse(test, return if TRUE, return if FALSE).

library(dplyr)

# if height is greater than 180 # then heightcat = "tall", # otherwise heightcat = "short"

newdata <- mutate(starwars, heightcat = ifelse(height > 180,

"tall", "short")

# convert any eye color that is not # black, blue or brown, to other newdata <- mutate(starwars,

eye_color = ifelse(eye_color %in% c("black", "blue", "brown"), eye_color, "other")

# set heights greater than 200 or # less than 75 to missing newdata <- mutate(starwars,

height = ifelse(height < 75 | height > 200, NA, height)

1.2.4 Summarizing data

The summarize function can be used to reduce multiple values down to a single value (such as a mean). It is often used in conjunction with the by_group function, to calculate statistics by group. In the code below, the na.rm=TRUE option is used to drop missing values before calculating the means.

library(dplyr)

# calculate mean height and mass newdata <- summarize(starwars,

mean_ht = mean(height, na.rm=TRUE), mean_mass = mean(mass, na.rm=TRUE))

newdata

## # A tibble: 1 x 2 ## mean_ht mean_mass

1.2. CLEANING DATA 15

## ## 1 174. 97.3

# calculate mean height and weight by gender newdata <- group_by(starwars, gender) newdata <- summarize(newdata,

mean_ht = mean(height, na.rm=TRUE), mean_wt = mean(mass, na.rm=TRUE))

newdata

## # A tibble: 5 x 3 ## gender mean_ht mean_wt ## ## 1 female 165. 54.0 ## 2 hermaphrodite 175. 1358. ## 3 male 179. 81.0 ## 4 none 200. 140. ## 5 120. 46.3

1.2.5 Using pipes

Packages like dplyr and tidyr allow you to write your code in a compact format using the pipe %>% operator. Here is an example.

library(dplyr)

# calculate the mean height for women by species newdata <- filter(starwars,

gender == "female") newdata <- group_by(species) newdata <- summarize(newdata,

mean_ht = mean(height, na.rm = TRUE))

# this can be written as newdata <- starwars %>% filter(gender == "female") %>% group_by(species) %>% summarize(mean_ht = mean(height, na.rm = TRUE))

The %>% operator passes the result on the left to the first parameter of the function on the right.

1.2.6 Reshaping data

Some graphs require the data to be in wide format, while some graphs require the data to be in long format.

You can convert a wide dataset to a long dataset using

library(tidyr) long_data <- gather(wide_data,

key="variable", value="value", sex:income)

16 CHAPTER 1. DATA PREPARATION

Table 1.2: Wide data

id name sex age income 01 Bill Male 22 55000 02 Bob Male 25 75000 03 Mary Female 18 90000

Table 1.3: Long data

id name variable value 01 Bill sex Male 02 Bob sex Male 03 Mary sex Female 01 Bill age 22 02 Bob age 25 03 Mary age 18 01 Bill income 55000 02 Bob income 75000 03 Mary income 90000

Conversely, you can convert a long dataset to a wide dataset using

library(tidyr) wide_data <- spread(long_data, variable, value)

1.2.7 Missing data

Real data are likely to contain missing values. There are three basic approaches to dealing with missing data: feature selection, listwise deletion, and imputation. Let’s see how each applies to the msleep dataset from the ggplot2 package. The msleep dataset describes the sleep habits of mammals and contains missing values on several variables.

1.2.7.1 Feature selection

In feature selection, you delete variables (columns) that contain too many missing values.

data(msleep, package="ggplot2")

# what is the proportion of missing data for each variable? pctmiss <- colSums(is.na(msleep))/nrow(msleep) round(pctmiss, 2)

## name genus vore order conservation ## 0.00 0.00 0.08 0.00 0.35 ## sleep_total sleep_rem sleep_cycle awake brainwt ## 0.00 0.27 0.61 0.00 0.33 ## bodywt ## 0.00

Sixty-one percent of the sleep_cycle values are missing. You may decide to drop it.

1.2. CLEANING DATA 17

1.2.7.2 Listwise deletion

Listwise deletion involves deleting observations (rows) that contain missing values on any of the variables of interest.

# Create a dataset containing genus, vore, and conservation. # Delete any rows containing missing data. newdata <- select(msleep, genus, vore, conservation) newdata <- na.omit(newdata)

1.2.7.3 Imputation

Imputation involves replacing missing values with “reasonable” guesses about what the values would have been if they had not been missing. There are several approaches, as detailed in such packages as VIM, mice, Amelia and missForest. Here we will use the kNN function from the VIM package to replace missing values with imputed values.

# Impute missing values using the 5 nearest neighbors library(VIM) newdata <- kNN(msleep, k=5)

Basically, for each case with a missing value, the k most similar cases not having a missing value are selected. If the missing value is numeric, the mean of those k cases is used as the imputed value. If the missing value is categorical, the most frequent value from the k cases is used. The process iterates over cases and variables until the results converge (become stable). This is a bit of an oversimplification - see Imputation with R Package VIM for the actual details.

Important caveate: Missing values can bias the results of studies (sometimes severely). If you have a significant amount of missing data, it is probably a good idea to consult a statistician or data scientist before deleting cases or imputing missing values.

https://www.jstatsoft.org/article/view/v074i07/v74i07.pdf
https://www.jstatsoft.org/article/view/v074i07/v74i07.pdf
18 CHAPTER 1. DATA PREPARATION

Chapter 2

Introduction to ggplot2

This section provides an brief overview of how the ggplot2 package works. If you are simply seeking code to make a specific type of graph, feel free to skip this section. However, the material can help you understand how the pieces fit together.

2.1 A worked example

The functions in the ggplot2 package build up a graph in layers. We’ll build a a complex graph by starting with a simple graph and adding additional elements, one at a time.

The example uses data from the 1985 Current Population Survey to explore the relationship between wages (wage) and experience (expr).

# load data data(CPS85 , package = "mosaicData")

In building a ggplot2 graph, only the first two functions described below are required. The other functions are optional and can appear in any order.

2.1.1 ggplot

The first function in building a graph is the ggplot function. It specifies the

• data frame containing the data to be plotted

• the mapping of the variables to visual properties of the graph. The mappings are placed within the aes function (where aes stands for aesthetics).

# specify dataset and mapping library(ggplot2) ggplot(data = CPS85,

mapping = aes(x = exper, y = wage))

Why is the graph empty? We specified that the exper variable should be mapped to the x-axis and that the wage should be mapped to the y-axis, but we haven’t yet specified what we wanted placed on the graph.

https://ggplot2.tidyverse.org/
20 CHAPTER 2. INTRODUCTION TO GGPLOT2

0 20 40

exper

w ag

Figure 2.1: Map variables

2.1. A WORKED EXAMPLE 21

2.1.2 geoms

Geoms are the geometric objects (points, lines, bars, etc.) that can be placed on a graph. They are added using functions that start with geom_. In this example, we’ll add points using the geom_point function, creating a scatterplot.

In ggplot2 graphs, functions are chained together using the + sign to build a final plot.

# add points ggplot(data = CPS85,

mapping = aes(x = exper, y = wage)) + geom_point()

0 20 40

exper

w ag

The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.

# delete outlier library(dplyr) plotdata <- filter(CPS85, wage < 40)

# redraw scatterplot ggplot(data = plotdata,

mapping = aes(x = exper, y = wage)) + geom_point()

A number of parameters (options) can be specified in a geom_ function. Options for the geom_point function include color, size, and alpha. These control the point color, size, and transparency, respectively. Trans-

22 CHAPTER 2. INTRODUCTION TO GGPLOT2

0 20 40

exper

w ag

Figure 2.2: Remove outlier

2.1. A WORKED EXAMPLE 23

0 20 40

exper

w ag

Figure 2.3: Modify point color, transparency, and size

parency ranges from 0 (completely transparent) to 1 (completely opaque). Adding a degree of transparency can help visualize overlapping points.

# make points blue, larger, and semi-transparent ggplot(data = plotdata,

mapping = aes(x = exper, y = wage)) + geom_point(color = "cornflowerblue",

alpha = .7, size = 3)

Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).

# add a line of best fit. ggplot(data = plotdata,

mapping = aes(x = exper, y = wage)) + geom_point(color = "cornflowerblue",

alpha = .7, size = 3) +

geom_smooth(method = "lm")

Wages appears to increase with experience.

24 CHAPTER 2. INTRODUCTION TO GGPLOT2

0 20 40

exper

w ag

Figure 2.4: Add line of best fit

2.1. A WORKED EXAMPLE 25

2.1.3 grouping

In addition to mapping variables to the x and y axes, variables can be mapped to the color, shape, size, transparency, and other visual characteristics of geometric objects. This allows groups of observations to be superimposed in a single graph.

Let’s add sex to the plot and represent it by color.

# indicate sex using color ggplot(data = plotdata,

mapping = aes(x = exper, y = wage, color = sex)) +

geom_point(alpha = .7, size = 3) +

geom_smooth(method = "lm", se = FALSE, size = 1.5)

0 20 40

exper

w ag

sex

The color = sex option is placed in the aes function, because we are mapping a variable to an aesthetic. The geom_smooth option (se = FALSE) was added to suppresses the confidence intervals.

It appears that men tend to make more money than women. Additionally, there may be a stronger relation- ship between experience and wages for men than than for women.

26 CHAPTER 2. INTRODUCTION TO GGPLOT2

$10

$15

$20

$25

0 10 20 30 40 50

exper

w ag

sex

Figure 2.5: Change colors and axis labels

2.1.4 scales

Scales control how variables are mapped to the visual characteristics of the plot. Scale functions (which start with scale_) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.

# modify the x and y axes and specify the colors to be used ggplot(data = plotdata,

mapping = aes(x = exper, y = wage, color = sex)) +

geom_point(alpha = .7, size = 3) +

geom_smooth(method = "lm", se = FALSE, size = 1.5) +

scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5),

label = scales::dollar) + scale_color_manual(values = c("indianred3",

"cornflowerblue"))

We’re getting there. The numbers on the x and y axes are better, the y axis uses dollar notation, and the

2.1. A WORKED EXAMPLE 27

colors are more attractive (IMHO). Here is a question. Is the relationship between experience, wages and sex the same for each job sector? Let’s repeat this graph once for each job sector in order to explore this.

2.1.5 facets

Facets reproduce a graph for each level a given variable (or combination of variables). Facets are created using functions that start with facet_. Here, facets will be defined by the eight levels of the sector variable.

# reproduce plot for each level of job sector ggplot(data = plotdata,

mapping = aes(x = exper, y = wage, color = sex)) +

geom_point(alpha = .7) + geom_smooth(method = "lm",

se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5),

label = scales::dollar) + scale_color_manual(values = c("indianred3",

"cornflowerblue")) + facet_wrap(~sector)

It appears that the differences between mean and women depend on the job sector under consideration.

2.1.6 labels

Graphs should be easy to interpret and informative labels are a key element in achieving this goal. The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.

# add informative labels ggplot(data = plotdata,

mapping = aes(x = exper, y = wage, color = sex)) +

geom_point(alpha = .7) + geom_smooth(method = "lm",

se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5),

label = scales::dollar) + scale_color_manual(values = c("indianred3",

"cornflowerblue")) + facet_wrap(~sector) + labs(title = "Relationship between wages and experience",

subtitle = "Current Population Survey", caption = "source: http://mosaic-web.org/", x = " Years of Experience", y = "Hourly Wage", color = "Gender")

28 CHAPTER 2. INTRODUCTION TO GGPLOT2

sales service

manuf other prof

clerical const manag

0 10 20 30 40 50 0 10 20 30 40 50

0 10 20 30 40 50

$10

$15

$20

$25

$10

$15

$20

$25

$10

$15

$20

$25

exper

w ag

sex

Figure 2.6: Add job sector, using faceting

2.1. A WORKED EXAMPLE 29

sales service

manuf other prof

clerical const manag

0 10 20 30 40 50 0 10 20 30 40 50

0 10 20 30 40 50

$0 $5

$10 $15 $20 $25

$0 $5

$10 $15 $20 $25

$0 $5

$10 $15 $20 $25

Years of Experience

H ou

rly W

ag e

Gender

Current Population Survey

Relationship between wages and experience

source: http://mosaic−web.org/

Now a viewer doesn’t need to guess what the labels expr and wage mean, or where the data come from.

2.1.7 themes

Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph. Let’s use a cleaner theme.

# use a minimalist theme ggplot(data = plotdata,

mapping = aes(x = exper, y = wage, color = sex)) +

geom_point(alpha = .6) + geom_smooth(method = "lm",

se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5),

label = scales::dollar) + scale_color_manual(values = c("indianred3",

"cornflowerblue")) + facet_wrap(~sector) + labs(title = "Relationship between wages and experience",

subtitle = "Current Population Survey", caption = "source: http://mosaic-web.org/", x = " Years of Experience",

30 CHAPTER 2. INTRODUCTION TO GGPLOT2

sales service

manuf other prof

clerical const manag

0 10 20 30 40 50 0 10 20 30 40 50

0 10 20 30 40 50

$0 $5

$10 $15 $20 $25

$0 $5

$10 $15 $20 $25

$0 $5

$10 $15 $20 $25

Years of Experience

H ou

rly W

ag e

Gender

Current Population Survey

Relationship between wages and experience

source: http://mosaic−web.org/

Figure 2.7: Use a simpler theme

y = "Hourly Wage", color = "Gender") +

theme_minimal()

Now we have something. It appears that men earn more than women in management, manufacturing, sales, and the “other” category. They are most similar in clerical, professional, and service positions. The data contain no women in the construction sector. For management positions, wages appear to be related to experience for men, but not for women (this may be the most interesting finding). This also appears to be true for sales.

Of course, these findings are tentative. They are based on a limited sample size and do not involve statistical testing to assess whether differences may be due to chance variation.

2.2 Placing the data and mapping options

Plots created with ggplot2 always start with the ggplot function. In the examples above, the data and mapping options were placed in this function. In this case they apply to each geom_ function that follows.

You can also place these options directly within a geom. In that case, they only apply only to that specific geom.

Consider the following graph.

2.2. PLACING THE DATA AND MAPPING OPTIONS 31

0 20 40

exper

w ag

sex

Figure 2.8: Color mapping in ggplot function

# placing color mapping in the ggplot function ggplot(plotdata,

aes(x = exper, y = wage, color = sex)) +

geom_point(alpha = .7, size = 3) +

geom_smooth(method = "lm", formula = y ~ poly(x,2), se = FALSE, size = 1.5)

Since the mapping of sex to color appears in the ggplot function, it applies to both geom_point and geom_smooth. The color of the point indicates the sex, and a separate colored trend line is produced for men and women. Compare this to

# placing color mapping in the geom_point function ggplot(plotdata,

aes(x = exper, y = wage)) +

geom_point(aes(color = sex), alpha = .7, size = 3) +

32 CHAPTER 2. INTRODUCTION TO GGPLOT2

0 20 40

exper

w ag

sex

Figure 2.9: Color mapping in ggplot function

geom_smooth(method = "lm", formula = y ~ poly(x,2), se = FALSE, size = 1.5)

Since the sex to color mapping only appears in the geom_point function, it is only used there. A single trend line is created for all observations.

Most of the examples in this book place the data and mapping options in the ggplot function. Additionally, the phrases data= and mapping= are omitted since the first option always refers to data and the second option always refers to mapping.

2.3 Graphs as objects

A ggplot2 graph can be saved as a named R object (like a data frame), manipulated further, and then printed or saved to disk.

# prepare data data(CPS85 , package = "mosaicData") plotdata <- CPS85[CPS85$wage < 40,]

2.3. GRAPHS AS OBJECTS 33

# create scatterplot and save it myplot <- ggplot(data = plotdata,

aes(x = exper, y = wage)) + geom_point()

Homework is Completed By:

Writer	Writer Name	Amount	Client Comments & Rating
ONLINE	Instant Homework Helper 4.8 4305 Orders Completed	$36	She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up! 5.00
Answer.docx Turnitin Report.pdf Contact Writer For Solution Contact Writer For Solution

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 3 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 6 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

100% Plagiarism Free
Proper APA/MLA/Harvard Referencing
Delivery in 12 Hours After Placing Order
Free Turnitin Report
Unlimited Revisions
Privacy Guaranteed

Order Now

6 writers have sent their proposals to do this homework:

Writer	Writer Name	Offer	Chat
ONLINE	Accounting Homework Help I have assisted scholars, business persons, startups, entrepreneurs, marketers, managers etc in their, pitches, presentations, market research, business plans etc. 4.9 1428 Orders Completed	$41	Chat With Writer
ONLINE	Math Specialist As an experienced writer, I have extensive experience in business writing, report writing, business profile writing, writing business reports and business plans for my clients. 4.9 1407 Orders Completed	$23	Chat With Writer
ONLINE	Innovative Writer I am a professional and experienced writer and I have written research reports, proposals, essays, thesis and dissertations on a variety of topics. 4.8 1113 Orders Completed	$46	Chat With Writer
ONLINE	Engineering Solutions I am a professional and experienced writer and I have written research reports, proposals, essays, thesis and dissertations on a variety of topics. 4.8 1680 Orders Completed	$34	Chat With Writer
ONLINE	Engineering Mentor I find your project quite stimulating and related to my profession. I can surely contribute you with your project. 4.8 2499 Orders Completed	$28	Chat With Writer
ONLINE	Professional Coursework Help I will provide you with the well organized and well research papers from different primary and secondary sources will write the content that will support your points. 4.8 1470 Orders Completed	$50	Chat With Writer