Chapter #1:
Beginning of the End … Or the End of the
Beginning?
The past few years have been challenging for Good Tunes & More (GT&M), a
business that traces its roots to Good Tunes, a store that exclusively sold music
CDs and vinyl records.
GT&M first broadened its merchandise to include home entertainment
and computer systems (the “More”), and then undertook an expansion to take
advantage of prime locations left empty by bankrupt former competitors. Today,
GT&M finds itself at a crossroads. Hoped-for increases in revenues that have
failed to occur and declining profit margins due to the competitive pressures of
online sellers have led management to reconsider the future of the business.
While some investors in the business have argued for an orderly retreat,
closing
stores and limiting the variety of merchandise, GT&M CEO Emma Levia
has decided to “double down” and expand the business
by purchasing Whitney
Wireless, a successful three-store chain that sells smartphones
and other mobile
devices.
Levia foresees creating a brand new “A-to-Z” electronics retailer but
first must establish a fair and reasonable price for the privately held Whitney
Wireless.
To do so, she has asked a group of analysts to identify the data that
would be helpful in setting a price for the wireless business. As part of that
group, you quickly realize that you need the data that would help to verify the
contents of the wireless company’s basic financial statements.
You focus on data associated with the company’s profit and loss statement
and quickly realize the need for sales and expense-related
variables.
You begin to
think about what the data for
such variables would look
like and how to collect those
data. You realize that you are
starting to apply the DCOVA
framework to the objective
of helping Levia acquire
Whitney Wireless.
Chapter Defining and
1 Collecting Data
Tyler Olson/Shutterstock
contents
1.1 Defining Variables
1.2 Collecting Data
1.3 Types of Sampling Methods
1.4 Types of Survey Errors
Think About This: New Media
Surveys/Old Sampling Problems
Using Statistics: Beginning of
the End … Revisited
Chapter 1 Excel Guide
Chapter 1 Minitab Guide
Objectives
Understand issues that arise
when defining variables
How to define variables
How to collect data
Identify the different ways to
collect a sample
Understand the types of
survey errors
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.1 Defining Variables 11
When Emma Levia decides to purchase Whitney Wireless, she has defined a new
goal or business objective for GT&M. Business objectives can arise from any
level of management and can be as varied as the following:
• A marketing analyst needs to assess the effectiveness of a new online advertising campaign.
• A pharmaceutical company needs to determine whether a new drug is more effective
than those currently in use.
• An operations manager wants to improve a manufacturing or service process.
• An auditor needs to review a company’s financial transactions to determine whether the
company is in compliance with generally accepted accounting principles.
Establishing an objective marks the end of a problem definition process. This end triggers
the new process of identifying the correct data to support the objective. In the GT&M scenario,
having decided to buy Whitney Wireless, Levia needs to identify the data that would be helpful
in setting a price for the wireless business. This process of identifying the correct data triggers
the start of applying the tasks of the DCOVA framework. In other words, the end of problem
definition marks the beginning of applying statistics to business decision making.
Identifying the correct data to support a business objective is a two-part job that requires
defining variables and collecting the data for those variables. These tasks are the first two tasks
of the DCOVA framework first defined in Section GS.1 and which can be restated here as:
• Define the variables that you want to study to solve a problem or meet an objective.
• Collect the data for those variables from appropriate sources.
This chapter discusses these two tasks which must always be done before the Organize, Visualize,
and Analyze tasks.
Defining variables at first may seem to be the simple process of making the list of things one
needs to help solve a problem or meet an objective. However, consider the GT&M scenario.
Most would quickly agree that yearly sales of Whitney Wireless would be part of the data
needed to meet Levia’s objective, but just placing “yearly sales” on a list could lead to confusion
and miscommunication: Does this variable refer to sales per year for the entire chain or
for individual stores? Does the variable refer to net or gross sales? Are the yearly sales values
expressed in number of units or as currency amounts such as U.S. dollar sales?
These questions illustrate that for each variable of interest that you identify you must supply
an operational definition, a universally accepted meaning that is clear to all associated
with an analysis. Operational definitions should also classify the variable, as explained in the
next section, and may include additional facts such as units of measures, allowed range of
values, and definitions of specific variable values, depending on how the variable is classified.
Classifying Variables by Type
When you operationally define a variable, you must classify the variable as being either categorical
or numerical. Categorical variables (also known as qualitative variables) take categories
as their values. Numerical variables (also known as quantitative variables) have values
that represent a counted or measured quantity. Classification also affects a variable’s operational
definition and getting the classification correct is important because certain statistical methods
can be applied correctly to one type or the other, while other methods may need a specific mix
of variable types.
Categorical variables can take the form of yes-and-no questions such as “Do you have a
Twitter account?” (in which yes and no form the variable’s two categories) or describe a trait
or characteristic that has many categories such as undergraduate class standing (which might
have the defined categories freshman, sophomore, junior, and senior). When defining a categorical
variable, the list of permissible category values must be included and each category
1.1 Defining Variables
Student Tip
Providing operational
definitions for concepts
is important, too, when
writing a textbook! The
end-of-chapter Key
Terms gives you an index
of operational definitions
and the most fundamental
definitions are
presented in boxes such
as the page 3 box that
defines variable and data.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
12 Chapter 1 Defining and Collecting Data
value should be defined, too, e.g., that a “freshman” is a student who has completed fewer
than 32 credit hours. Overlooking these requirements can lead to confusion and incorrect data
collection. In one famous example, when persons were asked by researchers to fill in a value
for the categorical variable sex, many answered yes and not male or female, the values that the
researchers intended. (Perhaps this is the reason that gender has replaced sex on many data collection
forms—gender’s operational definition is more self-apparent.)
The operational definitions of numerical variables are affected by whether the variable being
defined is discrete or continuous. Discrete variables such as “number of items purchased”
or “total amount paid” are numerical values that arise from a counting process. Continuous
variables such as “time spent on checkout line” or “distance from home to store” have numerical
values that arise from a measuring process and those values depend on the precision of the
measuring instrument used. For example, “time spent on checkout line” might be 2, 2.1, 2.14,
or 2.143 minutes, depending on the precision of the timing instrument being used. Units of
measures and the level of precision should be part of the operational definitions of continuous
variables, e.g., “tenths of a second” for “time spent on checkout line.” The definitions of any
numerical variable can include the allowed range of values, such as “must be greater than 0”
for “number of items purchased.”
When defining variables for survey collection (discussed in Section 1.2), thinking about
the responses you seek helps classify variables as Table 1.1 demonstrates. Thinking about how
a variable will be used to solve a problem or meet an objective can also be helpful when you
define a variable. The variable age might be a numerical (discrete) variable in some cases or
might be categorical with categories such as child, young adult, middle-aged, and retirement
aged in other contexts.
Problems for Section 1.1
Learning the Basics
1.1 Four different beverages are sold at a fast-food restaurant:
soft drinks, tea, coffee, and bottled water. Explain why the
type of beverage sold is an example of a categorical variable.
1.2 U.S. businesses are listed by size: small, medium, and large. Explain
why business size is an example of a categorical variable.
1.3 The time it takes to download a video from the Internet is
measured. Explain why the download time is a continuous
numerical variable.
Applying the Concepts
SELF
Test
1.4 For each of the following variables, determine
whether the variable is categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
a. Number of cellphones in the household
b. Monthly data usage (in MB)
c. Number of text messages exchanged per month
d. Voice usage per month (in minutes)
e. Whether the cellphone is used for email
1.5 The following information is collected
Question Responses Variable Type
Do you have a Facebook
profile?
❑ Yes ❑ No Categorical
How many text messages have
you sent in the past three days?
______ Numerical
(discrete)
How long did the mobile app
update take to download?
______ seconds Numerical
(continuous)
Problems for Section 1.1
Learning the Basics
1.1 Four different beverages are sold at a fast-food restaurant:
soft drinks, tea, coffee, and bottled water. Explain why the
type of beverage sold is an example of a categorical variable.
1.2 U.S. businesses are listed by size: small, medium, and large. Explain
why business size is an example of a categorical variable.
1.3 The time it takes to download a video from the Internet is
measured. Explain why the download time is a continuous
numerical variable.
Applying the Concepts
SELF
Test
1.4 For each of the following variables, determine
whether the variable is categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
a. Number of cellphones in the household
b. Monthly data usage (in MB)
c. Number of text messages exchanged per month
d. Voice usage per month (in minutes)
e. Whether the cellphone is used for email
1.5 The following information is collected from students upon
exiting the campus bookstore during the first week of classes.
a. Amount of time spent shopping in the bookstore
b. Number of textbooks purchased
c. Academic major
d. Gender
Classify each of these variables as categorical or numerical. If the
variable is numerical, determine whether the variable is discrete or
continuous.
1.6 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Name of Internet service provider
b. Time, in hours, spent surfing the Internet per week
c. Whether the individual uses a mobile phone to connect to the
Internet
d. Number of online purchases made in a month
e. Where the individual uses social networks to find sought-after
information
Learn More
Read the Short Takes for
Chapter 1 for more examples
of classifying variables
as either
categorical or numerical.
Ta ble 1 . 1
Identifying Types of
Variables
Question Responses Variable Type
Do you have a Facebook
profile?
❑ Yes ❑ No Categorical
How many text messages have
you sent in the past three days?
______ Numerical
(discrete)
How long did the mobile app
update take to download?
______ seconds Numerical
(continuous)
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.2 Collecting Data 13
1.2 Collecting Data
After defining the variables that you want to study, you can proceed with the data collection
task. Collecting data is a critical task because if you collect data that are flawed by biases,
ambiguities, or other types of errors, the results you will get from using such data with even
the most sophisticated statistical methods will be suspect or in error. (For a famous example of
flawed data collection leading to incorrect results, read the Think About This essay on page 21.)
Data collection consists of identifying data sources, deciding whether the data you collect
will be from a population or a sample, cleaning your data, and sometimes recoding variables.
The rest of this section explains these aspects of data collection.
Data Sources
You collect data from either primary or secondary data sources. You are using a primary data
source if you collect your own data for analysis. You are using a secondary data source if the
data for your analysis have been collected by someone else.
You collect data by using any of the following:
• Data distributed by an organization or individual
• The outcomes of a designed experiment
• The responses from a survey
• The results of conducting an observational study
• Data collected by ongoing business activities
Market research companies and trade associations distribute data pertaining to specific industries
or markets. Investment services provide business and financial data on publicly listed
companies. Syndicated services such as The Nielsen Company provide consumer research data to
telecom and mobile media companies. Print and online media companies also distribute data that
they may have collected themselves or may be republishing from other sources.
The outcomes of a designed experiment are a second data source. For example, a consumer
electronics company might conduct an experiment that compares the sales of mobile
electronics merchandise for different store locations. Note that developing a proper experimental
design is mostly beyond the scope of this book, but Chapter 10 discusses some of the
fundamental experimental design concepts.
Survey responses represent a third type of data source. People being surveyed are asked
questions about their beliefs, attitudes, behaviors, and other characteristics. For example,
people could be asked which store location for mobile electronics merchandise is preferable.
(Such a survey could lead to data that differ from the data collected from the outcomes of the
1.7 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Amount of money spent on clothing in the past month
b. Favorite department store
c. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, or weekend)
d. Number of pairs of shoes owned
1.8 Suppose the following information is collected from Robert
Keeler on his application for a home mortgage loan at the Metro
County Savings and Loan Association.
a. Monthly payments: $2,227
b. Number of jobs in past 10 years: 1
c. Annual family income: $96,000
d. Marital status: Married
Classify each of the responses by type of data.
1.9 One of the variables most often included in surveys is income.
Sometimes the question is phrased “What is your income
(in thousands of dollars)?” In other surveys, the respondent is
asked to “Select the circle corresponding to your income level”
and is given a number of income ranges to choose from.
a. In the first format, explain why income might be considered
either discrete or continuous.
b. Which of these two formats would you prefer to use if you
were conducting a survey? Why?
1.10 If two students score a 90 on the same examination,
what arguments could be used to show that the underlying
variable—test score—is continuous?
1.11 The director of market research at a large department store
chain wanted to conduct a survey throughout a metropolitan area
to determine the amount of time working women spend shopping
for clothing in a typical month.
a. Indicate the type of data the director might want to collect.
b. Develop a first draft of the questionnaire needed in (a) by writing
three categorical questions and three numerical questions
that you feel would be appropriate for this survey
One of the variables most often included in surveys is income.
Sometimes the question is phrased “What is your income
1.2 Collecting Data
After defining the variables that you want to study, you can proceed with the data collection
task. Collecting data is a critical task because if you collect data that are flawed by biases,
ambiguities, or other types of errors, the results you will get from using such data with even
the most sophisticated statistical methods will be suspect or in error. (For a famous example of
flawed data collection leading to incorrect results, read the Think About This essay on page 21.)
Data collection consists of identifying data sources, deciding whether the data you collect
will be from a population or a sample, cleaning your data, and sometimes recoding variables.
The rest of this section explains these aspects of data collection.
Data Sources
You collect data from either primary or secondary data sources. You are using a primary data
source if you collect your own data for analysis. You are using a secondary data source if the
data for your analysis have been collected by someone else.
You collect data by using any of the following:
• Data distributed by an organization or individual
• The outcomes of a designed experiment
• The responses from a survey
• The results of conducting an observational study
• Data collected by ongoing business activities
Market research companies and trade associations distribute data pertaining to specific industries
or markets. Investment services provide business and financial data on publicly listed
companies. Syndicated services such as The Nielsen Company provide consumer research data to
telecom and mobile media companies. Print and online media companies also distribute data that
they may have collected themselves or may be republishing from other sources.
The outcomes of a designed experiment are a second data source. For example, a consumer
electronics company might conduct an experiment that compares the sales of mobile
electronics merchandise for different store locations. Note that developing a proper experimental
design is mostly beyond the scope of this book, but Chapter 10 discusses some of the
fundamental experimental design concepts.
Survey responses represent a third type of data source. People being surveyed are asked
questions about their beliefs, attitudes, behaviors, and other characteristics. For example,
people could be asked which store location for mobile electronics merchandise is preferable.
(Such a survey could lead to data that differ from the data collected from the outcomes of the
1.7 For each of the following variables, determine whether the
variable is categorical or numerical. If the variable is numerical,
determine whether the variable is discrete or continuous.
a. Amount of money spent on clothing in the past month
b. Favorite department store
c. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, or weekend)
d. Number of pairs of shoes owned
1.8 Suppose the following information is collected from Robert
Keeler on his application for a home mortgage loan at the Metro
County Savings and Loan Association.
a. Monthly payments: $2,227
b. Number of jobs in past 10 years: 1
c. Annual family income: $96,000
d. Marital status: Married
Classify each of the responses by type of data.
1.9 One of the variables most often included in surveys is income.
Sometimes the question is phrased “What is your income
(in thousands of dollars)?” In other surveys, the respondent is
asked to “Select the circle corresponding to your income level”
and is given a number of income ranges to choose from.
a. In the first format, explain why income might be considered
either discrete or continuous.
b. Which of these two formats would you prefer to use if you
were conducting a survey? Why?
1.10 If two students score a 90 on the same examination,
what arguments could be used to show that the underlying
variable—test score—is continuous?
1.11 The director of market research at a large department store
chain wanted to conduct a survey throughout a metropolitan area
to determine the amount of time working women spend shopping
for clothing in a typical month.
a. Indicate the type of data the director might want to collect.
b. Develop a first draft of the questionnaire needed in (a) by writing
three categorical questions and three numerical questions
that you feel would be appropriate for this survey.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
14 Chapter 1 Defining and Collecting Data
designed experiment of the previous paragraph.) Surveys can be affected by any of the four
types of errors that are discussed in Section 1.4.
Observational study results are a fourth data source. A researcher collects data by directly
observing a behavior, usually in a natural or neutral setting. Observational studies are a common
tool for data collection in business. For example, market researchers use focus groups
to elicit unstructured responses to open-ended questions posed by a moderator to a target audience.
Observational studies are also commonly used to enhance teamwork or improve the
quality of products and services.
Data collected by ongoing business activities are a fifth data source. Such data can be
collected from operational and transactional systems that exist in both physical “bricks-andmortar”
and online settings but can also be gathered from secondary sources such as third-party
social media networks and online apps and website services that collect tracking and usage data.
For example, a bank might analyze a decade’s worth of financial transaction data to identify
patterns of fraud, and a marketer might use tracking data to determine the effectiveness of a
website.
Sources for big data (see Section GS.3) tend to be a mix of primary and secondary sources
of this last type. For example, a retailer interested in increasing sales might mine Facebook
and
Twitter accounts to identify sentiment about certain products or to pinpoint top influencers and
then match those data to its own data collected during customer transactions.
Populations and Samples
You collect your data from either a population or a sample. A population consists of all the
items or individuals about which you want to reach conclusions. All the GT&M sales transactions
for a specific year, all the full-time students enrolled in a college, and all the registered
voters in Ohio are examples of populations. In Chapter 3, you will learn that when you analyze
data from a population you compute parameters.
A sample is a portion of a population selected for analysis. The results of analyzing a
sample are used to estimate characteristics of the entire population. From the three examples
of populations just given, you could select a sample of 200 GT&M sales transactions randomly
selected by an auditor for study, a sample of 50 full-time students selected for a marketing
study, and a sample of 500 registered voters in Ohio contacted via telephone for a political
poll. In each of these examples, the transactions or people in the sample represent a portion of
the items or individuals that make up the population. In Chapter 3, you will learn that when
you analyze data from a sample you compute statistics .
You collect data from a sample when any of the following applies:
• Selecting a sample is less time consuming than selecting every item in the population.
• Selecting a sample is less costly than selecting every item in the population.
• Analyzing a sample is less cumbersome and more practical than analyzing the entire
population.
Structured Versus Unstructured Data
The data you collect may be formatted in a variety of ways, some of which add to the data
collection task. For example, suppose that you wanted to collect electronic financial data
about a sample of companies. That data might exist as tables of data, the contents of standardized
documents such as fill-in-the-blank surveys, a continuous stream of data such as a
stock ticker, or text messages or emails delivered from email systems or social media websites.
Some of these forms, such as a set of text messages have very little or no repeating
structure, are examples of unstructured data. Although unstructured data forms can form a
part of a big data collection,
collecting data in unstructured forms for the statistical methods
discussed in this book requires conversion of the data to a structured form. For example,
after collecting text messages,
you could convert their contents to a structured form by defining
a set of variables that might include a numerical variable that counts the number of
words in the message and various categorical variables that help classify the content of the
message.
Learn More
Read the Short Takes
for Chapter 1 for a further
discussion about data
sources.
Student Tip
To help remember the
difference between a
sample and a population,
think of a pie. The
entire pie represents the
population, and the pie
slice that you select is
the sample.
Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.
Copyright © 2016 by Pearson Education, Inc.
ISBN: 978-1-323-26258-0
1.2 Collecting Data 15
Electronic Formats and Encodings
The same form of data can exist in more than one electronic format, with some formats more
immediately usable than others. For example, a table of data might exist as a scanned image
or as data in a worksheet file. The worksheet data could be immediately used in a statistical
analysis, but the scanned image would need to be first converted to worksheet data using a
character-scanning program that can recognize numbers in an image.
Data can also be encoded in more than one way, as you may have learned in an information
systems course. Different encodings may affect the recorded precision of values for
continuous variables and lead to values more imprecise or values that convey a false sense of
precision, such as a time measurement that gets encoded in ten-thousandths of a second when
the original measurement was only in tenths of a second. This changed precision can violate
the operational definition of a continuous variable and sometimes affect results calculated.
Data Cleaning
Whatever ways you choose to collect data, you may find irregularities in the values you collect
such as undefined or impossible values. For a categorical variable, an undefined value would
be a value that does not represent one of the categories defined for the variable. For a numerical
variable, an impossible value would be a value that falls outside a defined range of possible
values for the variable. For a numerical variable without a defined range of possible values,
you might also find outliers, values that seem excessively different from most of the rest of the
values. Such values may or may not be errors, but they demand a second review.
Values that are missing are another type of irregularity. A missing value is a value that was
not able to be collected (and therefore not available to be analyzed). For example, you would
record a nonresponse to a survey question as a missing value. You can represent missing values
in Minitab by using an asterisk value for a numerical variable or by using a blank value for a
categorical variable, and such values will be properly excluded from analysis. The more limited
Excel has no special values that represent a missing value. When using Excel, you must
find and then exclude missing values manually.
When you spot an irregularity in the data you have collected, you may have to “clean” the
data. Although a full discussion of data cleaning is beyond the scope of this book (see reference
8), you can learn more about the ways you can use Excel or Minitab for data cleaning in
the Short Takes for Chapter 1.
Recoding Variables
After you have collected data, you may discover that you need to reconsider the categories that
you have defined for a categorical variable or that you need to transform a numerical variable
into a categorical variable by assigning the individual numeric data values to one of several
groups. In either case, you can define a recoded variable that supplements or replaces the
original variable in your analysis.