The Quantitative Analysis Approach & Ex. Of Quantitative Analysis In Research
Research is a formal or informal process to answer questions where the answer is not known. Sounds weird – if you have a question, isn’t the answer obviously unknown. Not really, while you may not know, for example, whether males and females at the national level are paid the same for equal work, a lot of others have done work in this area and do know the answer. So, if you looked up their work and found out the answer, you would not really be doing research – since someone already knows the answer. You would technically be doing a literature review – finding out what is already known about the topic. If that information answers your inquiry, then you are done.
If not, then we get into research. Using what is already known and unknown about a topic, we frame a research question and set out to find the information needed to answer it. This process – whether done formally for a research paper or informally to answer a question for the boss – can be simple or complex. At a minimum, it will involve identifying and collecting some data that can be used to answer your question. How to set up this process, identify the people or things you need to collect the data on, how to select a representative sample, etc. are all parts of a research course.
The basic research model involves:
· Recognizing a problem or issue,
· Reviewing what is already known about the issue,
· Clearly defining the question to be answered,
· Identifying the information needed to answer the question,
· Developing the sampling process/approach to obtain the data,
· Collecting the data,
· Analyzing the data (turning data into information), and
· Using the analysis results to answer the research question.
Note that in most formal research studies, everything is planned before the first piece of data is collected. Even how the data will be analyzed is defined. This is often quite important as when planning the analysis, it may become clear that additional information is needed. One company research study on customer needs had to be tossed as when the results were reported it became clear that not all customer groups responded in the same way, and the study did not allow the company to identify which groups had what reactions – so, no actions could be taken that were tailored to each group.
Our course is Statistics in Research. This implies we are focused on the planning of how to analyze the data we will be collecting. As we examine statistical tools and approaches, we will look at published research studies to see how these tools are used; and how these tools can be used to answer research questions.
So, what does statistics in research involve? Before answering this, we need to look at how we get the data used in studies. In general, research formal academic studies or less formal organizational projects collect a sample of all possible data (called the population) that relates to the question being studied. This occurs for a variety of reasons including the cost or time involved, the impossibility of gathering everything to practical limitations (destructive testing, for example, destroys the product so it cannot use everything), etc.
Since we are limited to only a portion of the information/data, our decisions are made with some degree of uncertainty; our sample data results will be “close” to the population values but not the exact value. In some courses, this idea is presented as decision making under uncertainty; but this is the reality for most research and managerial decision making. Statistics has the ability to recognize and quantify this uncertainty – a way of “improving the odds” in our decision making, so to speak.
Once we know the data we will be working with (either thru the plan or having the actual values), we have two distinct activities. Describing the data thru graphs/tables and summary descriptive statistics starts our journey. We finish it with inferential statistics – making judgements about what populations look like based on sample results (much like the opinion polls we are exposed to constantly). We will deal with inferential statistics in a few weeks, as we first need to describe and summarize the data we will be working with.
A couple of definitions we need to know:
· Population: includes all the “things” we are interested in; for example, the population of the US would include everyone living in the country.
· Sample: involves only a selected sub-group of the population
· Random Sample: a sample where every member of the population has an initial equal chance of being selected; the only way to obtain a sample that is truly representative of the entire population.
· Parameter: a characteristic of the population; the average age of everyone in the US would be a parameter.
· Statistic: a characteristic of a sample; the average age of everyone you know who attends school would be a statistic as the group is a sub-group of all students.
· Descriptive Statistics: measures that summarize characteristics of a group.
· Inferential Statistics: measures that summarize the characteristics of a random sample and are used to infer population parameters.
· Statistical test: a quantitative technique to make a judgement about population parameters based upon random sample outcomes (statistics).
What kind of data do we have?
All data has some value, but not all data is equally useful. In general, we have four kinds of data (NOIR):
· Nominal: these are basically names or labels. Examples of nominal data include gender (Male or Female), names of cars (Ford, Chevrolet, Dodge, etc.), cities and states, flowers, etc. Anything where the name/label just indicates a difference from something else that is similar is nominal level data. Nominal level data are used in two ways. First, we can count them – how many males and females exist in a group, for example. Second, we can use them as group labels to identify different groups, and list other characteristics in each group; for example, a compensation study might want a list of all male and female salaries.
· Ordinal: these variables add a sense of order to the difference, but where the differences are not the same between levels. Often, these variables are based on judgement calls creating labels that can be placed in a rank order, such as good, better, best. Ordinal data are also frequently used as grouping variables. There are some statistical techniques that are used to identify differences in these variables, but they will not be covered in this class.
· Interval: these variables have a constant difference between successive values. Temperature is a common example – the difference between, for example, 45 and 46 degrees is the same amount of heat as between 87 and 88 degrees. Interval level data are the first level that we can do manipulation on, such as determining averages, differences, etc. Note: Often analysts will assume that personal judgement scores (by definition ordinal) such as scores in Olympic events such as skating or responses on a questionnaire scale using scores of 1 to 5 are interval for analysis purposes even though it cannot be proven the differences are constant.
· Ratio – these are interval measures that add a 0 point that means none. For example, 0 dollars in your wallet means no money, while a temperature of 0 degrees does not mean no heat. Ratio level variables include length, time, volume, etc. For analysis purposes, ratio and interval data are treated the same.
Descriptive Statistics
OK, let’s start making sense out of what might initially appear to be big, messy, and overwhelming: a data set. Generally, the first thing we want to do with our data – develop clues as to what the data is hiding from us – is to summarize the data into descriptive statistics. In general, descriptive statistics provide information in four areas:
· Location: these show central tendencies and include mean, mode, and median;
· Consistency: these show the variability in the data and include range, variation, and standard deviation;
· Position: these measures show relative placement of data, where a particular data point lies within the data set, and include measures such as z-score, percentile, quartile; and
· Likelihood: these show how common or rare a particular outcome is, and involve probability estimates such as, empirical, theoretical, subjective probabilities.
Note that these are not the complete list of possible descriptive statistics. Excel’s Descriptive Statistics function includes a couple of measures that focus on data distribution shape. These have some specialized uses that we will not be getting into.
Location or Center
Perhaps the most often asked question about data sets is what is the average? Unfortunately, average is a somewhat imprecise term that could refer to any of three measures of location (AKA central tendency). So, as researchers/analysts we need to be more precise and use mean, median, and mode.
· Mean, AKA the most typical meaning of average, is the sum of all the values divided by the count. For example, the mean of 1, 2, 3, 4, and 5 = 1+2+3+4+5/5 = 15/5 = 3. The mean is generally the best measure for any data set as it uses all the data values, and requires interval or ratio level data.
· The median is the middle value in an ordered (listed from low to high) data set. For example, the median of 1, 2, 3, 4, and 5 is 3, the middle value. If we have an even number of values, the median is the average of the middle two values. Medians can be found on ordinal, interval, or ratio level data.
· The mode is the most frequently occurring value. A data set may have no modes or one or more. Modes may occur with any level of data. The data set 1,1,2,2,2,2,3,8,8,9 has a primary mode of 2, and two secondary modes of 1 and 8.
While these all tell us something about where the data might be clustered, they can provide very different views of the data. Consider an example heard back in High School. At that time, the mean (average) per capita income for citizens of Kuwait was about $25,000; the median (middle) income was around $125; and the mode (most common) was $25! The very high (due to oil revenues) income of the Royal family accounted for much of this difference, but just consider the different impressions we get about the country depending on which value we look at.
Consistency/Variation
While they do not have the popularity of their location cousins, knowing the variation within the data is as – and some say even more – important as knowing the central tendency for us to understand what the data is trying to tell us. Very consistent data, with little variation, has a mean that is very representative of the data and is unlikely to change much if we resample the population. Data with a large amount of variation tends to have unstable means – values that change a lot with multiple samples. Inconsistent data is often a problem for businesses, particularly manufacturing operations, as it means the results they produce differ, and often will not meet the quality specifications. Predictions based on data with large variations are rarely useful. Consider attempting to estimate how long it would take you to get to work if your route had frequent traffic accidents that made the travel time different every day.
The key measures of variation are:
· Range, which equals the maximum value minus the minimum value. For our example data set of 1, 2, 3, 4, and 5, the range is 5 – 1 = 4.
· Variance, which is the average of the square of sum of the differences between each value in the data set from the mean. To get the variance, find the mean of the data, subtract this value from each of the data points, square this result (to get rid of the negative differences), add them up and divide by the total count. For our example data set, this would look like:
·
Value
Mean
Difference
Squared
1
3
-2
4
2
3
-1
1
3
3
0
0
4
3
1
1
5
3
2
4
Sum = 0
Sum = 10
Variance =
10/5 =
2
The problem with variance is that it expressed as units squared. So, if our data set were dollars, the variance would be 2 dollars squared – how should we interpret dollars squared?
· Standard Deviation is the positive square root of the variance. It returns the dispersion measure back to one that is in the same units as the original data, so we can compare it to the data values. For our example, the standard deviation is the square root of 2 dollars squared, or 1.4 dollars. This much easier to understand measure tells us that the average difference is 1.4 dollars away from the mean value of 3 dollars. Later we will use this measure in additional ways.
The variance and standard deviation calculation formulas differ depending on whether we have a population or sample. When we find these values for a population, the entire group we are interested in, we divide the numerator by the sample size. However, when we have a sample of the entire group, as we have with our group; we divide the numerator by the (sample size count – 1). This is an adjustment that increase the sample value estimates to take into account that we most likely do not have the extreme low and extreme high value from the population in our sample, so its variation is less than the group we want to describe (the population). Fortunately, Excel recognizes this and lets us choose the population or sample versions of the variance and standard deviation calculations.
Distributions. Location and variation measures are important for summarizing the data set. Important as they are, they do not always give us all the information we need. At times we need to examine the distribution of the data. Also called shape, this shows us how all the data values relate to the other values with the sample.
One important tool in analyzing data sets is graphical analysis – looking at how data sets are distributed when graphed. One example will show how powerful these techniques can be. One very common graph is a histogram – a count of how many times a certain value occurs. For example, if you tossed a pair of dice 50 times, you might get the following results. The table
Outcomes from tossing a pair of dice
Total showing
2
3
4
5
6
7
8
9
10
11
12
Frequency seen
1
2
4
3
9
12
7
5
4
1
2
Outcomes from tossing a pair of dice
shows the results we got. Distributions are studied to see if the shape/counts fits into expected patterns. This analysis is often accompanied by graphical analysis, which will be covered next week.
Position Measures
Central tendency and variation are group descriptive measures – particularly the mean and standard deviation, which use all the values in the data set in their calculation. At times; however, we are concerned with specific values with in the distribution, such as:
· Quartiles,
· Percentiles, or centiles,
· The 5-number summary, or
· Z-score.
Quartiles and Percentiles. These measures divide the data into groups, four with the quartile and 100 with the percentile. The general percentile formula lets us find percentiles, deciles (the 10% divisions), and/or quartiles, although Excel will do this for us. The formula is:
Lp = (n+1) * P/100; where
Lp is the count of the desired percentile (25 would be the location of the first quartile, for example), n is the size/count of the data set, and P is the desired percentile; using 25, 50, or 75 gives the quartile points, while using 10, 20, etc. would give the decile points.
5-Number Summary. as its name suggests, identifies five key values in a data set: minimum value, 1st quartile, median or 2nd quartile, 3rdquartile, and maximum values. The following examples show how we can compare different groups, even if we are not sure of what is being measured. We can now compare the three data distributions with the 5-number summary:
· Males: 0.871, 1.018, 1.057, 1.134, 1.175
· Females: 0.957, 1.025, 1.069, 1.129, 1.211
· Overall: 0.871, 1.018, 1.063, 1.133, 1.211
In comparing these, we see that, on this measure, the males start and end a bit lower than do the females.
Z-score. What is often of more value is when analysts want to look at where specific measures lie within each range. It measures showing how far from the mean a specific data point lies, measured in standard deviation units. These are specific measures that use the data distribution. One of the more commonly observed distribution in real life is the normal, or bell-shaped curve. The mean = mode = median in the middle, and values away from the mean fall of in a similar pattern in each tail. A picture of 3 normal curves that have data ranges that overlap is shown below. Many naturally occurring outcomes fit this distribution.
A z-score for a specific data value within a normal curve distribution is found by subtracting the mean from that score, and dividing the result by the standard deviation of the data set. For example, in our example data set (1, 2, 3, 4, and 5), the z score for 2 would be (2-3)/1.4 = -1/1.4 = -0.71. The negative value means that 2 is below (or less than) the mean, and is 0.71 standard deviation units away from the mean (0.74 times the standard deviation of 1.4 = 1).
The Z-score provides a measure of how many standard deviations a particular score lies from the mean, and in what direction (above or below). The Z-score formula is:
Z = (individual score – mean) / (standard deviation)
Using this measure, we can easily examine relative placement of scores. For example, a compa-ratio of 1.06 would have Z-scores of 0.04 for males, -0.13 for females, and -0.03 for the overall group. Thus, we can see that a person with this compa-ratio is slightly above average for males, but below average for the overall group and for females. Note: this will be discussed in more detail in chapter 7 (Week 3), for now we are merely noting a location measure.
Likelihood/Probability
Probability is the likelihood that an event will happen. For example, if we toss a fair coin, we have a 50/50 chance, or a probability of .5 of getting a head. If we pick a date between 1 and 7, we have a 1 out of 7 chances (or a probability of 1/7 = .14 or 14%) that it will be a Wednesday in the current month. Statisticians recognize three types of probabilities:
· Theoretical – based on a theory, for example – since a die (half of a pair of dice) has 6 sides, and our theory says each face is equally likely to show up when we toss it; we therefore expect that will see a 1 1/6th of the number of times we toss it (assuming we toss it a lot).
· Empirical – count based; if we see that an accident happens on our way to work 5 times(days) within every 4 weeks, we can say the probability of an accident today is 5/20 or 25% since there are 20 work days within a 4-week period. An empirical probability equals the number of successes we see divided by the number of times we could have seen the outcome.
· Subjective – a guess based on some experience or feeling.
There are some basic probability rules that will be helpful during the course. The probability
· of something (an event) happening is called P(event),
· of two things happening together – called joint probability: P(A and B),
· of either one or the other but not both events occurring – P(A or B),
· of something occurring given that something else has occurred, conditional probability: P(A|B) (read as probability of A given B).
· Compliment rule: P(not A) = 1- p(A).
Two other issues are needed to understand probability, the idea of mutually exclusive means that the elements of one data set do not belong to another – for example, males and pregnant are mutually exclusive data sets. The other term we frequently hear with probability is collectively exhaustive – this simply means that all members of the data set are listed.
Some rules, which apply for both theoretical and empirical based probabilities, for dealing with these different probability situations include:
· P(event) = (number of success)/(number of attempts or possible outcomes)
· P(A and B) = P(A)*P(B) for independent events or P(A)*P(B|A) for dependent events (This last is called conditional probability the probability of B occurring given that A has occurred).
· P(A or B) = P(A) + P(B) – P(A and B); if A and B cannot occur together (such as the example of male and pregnant) then P(A and B) = 0.
· P(A|B) = P(A and B)/P(B).
Clearly even describing data can be more complex than many of us initially thought. It is this wide range of options that make planning for what we will do with the data so important. In this week’s discussions, we will look at these issues in two ways. First, in discussion 2, we will discuss the content of this lecture – what is unclear, what is new, surprising, etc.