Discussion Board Forum 4
Topic: Generalizing from Survey Findings
Application of Course Concepts
Using your O'Sullivan et al. text, answer the following exercises from Chapter 9:
Exercise 9.2 Section A: Getting Started
Exercise 9.2 Section B: Small Group Exercise
Exercise 9.4 Section A: Getting Started
Textbook
CHAPTER 9
Generalizing from Survey Findings
Applying Inferential Statistics
You may be so used to hearing what a poll found that you rarely stop and think about how a sample of roughly 1,000 people can accurately represent the country’s population. Similarly, you may have heard reports that researchers analyzing survey data found that “people who drink moderate amounts of red wine have fewer heart attacks” or that “senior citizens who play bridge are less likely to suffer from dementia.” In presenting the findings from sample data, pollsters and researchers rely on inferential statistics. The statistics we have presented thus far are descriptive statistics, and you can use them to summarize and describe any set of data. Inferential statistics estimate parameters and indicate if a relationship probably occurred by chance. To use inferential statistics correctly you must have data from a probability sample. Otherwise, you have to rely on descriptive statistics to present your findings and make decisions.
In this chapter we cover sampling statistics and tests of statistical significance. Sampling statistics guide decisions that allow us to infer who will win an election and how Americans feel about the economy. Next we discuss tests of significance, which help us decide if a relationship that exists in data from a sample probably occurred by chance. Tests of statistical significance cannot tell us that red wine will decrease heart attacks or that playing bridge will prevent dementia. Tests of significance simply eliminate chance as an explanation for the relationships; however, the statistical findings may stimulate more rigorous research to explain why a relationship exists and to eliminate alternative explanations.
SAMPLING STATISTICS
You should have a basic knowledge of sampling statistics to determine sample size and to interpret findings from a probability sample. To apply and interpret sampling statistics correctly you need to understand the terms parameter, sampling error, standard error, confidence interval, and confidence level. A parameter is a characteristic of the population you are studying. An example of a parameter is the percentage of all food pantry users who are unemployed. Statistics refer to a characteristic of a sample, such as the number of sampled pantry users who are unemployed. With statistics from probability samples, you can estimate parameters. We are sure that you do not expect a sample of 400, 1000, or any other size to indicate the exact percentage of unemployed. Sampling error refers to the difference between the parameter and a statistic. You can use the sampling error to mathematically estimate a parameter. The sampling error is based on sample size and the sample’s standard deviation, which estimates the population’s variability.
The standard error is the standard deviation of a theoretical distribution of values of a variable. Think of it this way, if you drew an infinite number of random samples of the same size, 95 percent of them would yield a statistic, such as a mean, that is within ± 1.96 standard errors from the parameter. Although we can’t draw an infinite number of samples we know that 95 out of a 100 random samples will fall within ± 1.96 standard errors of the parameter; 5 out of 100 of the samples will fall outside this interval.
The confidence level is the probability that the parameter will fall within a given range. With a 95 percent confidence level, the level commonly used in social science research, you know that in 95 random samples out of 100 a parameter will fall within ± 1.96 standard errors of a statistic. We refer to this characteristic as the 95 percent confidence level. Occasionally you may see a 99 percent confidence level, which would be ±2.58 standard errors. At a 99 percent confidence level, the parameter is less likely to fall outside the confidence interval, but the confidence interval is wider.
Confidence intervals identify the range where the parameter probably falls. To determine the confidence interval you have to know the value of the standard error. Let’s imagine sampling food pantry users to estimate the percentage unemployed. If you drew a large number of probability samples of the same size and graphed the percentage unemployed for each sample, the graph would take on a bell shape. Most of the samples should cluster near its center. A few samples, however, will be located at the far ends of the graph. These samples indicate a much lower percentage of unemployed or a markedly higher percentage of unemployed. From looking at the graph you would feel confident in estimating that the parameter, the actual percentage unemployed, is somewhere in the interval where the samples are clustered. The interval is the confidence interval.
Let’s use an example to review our discussion and to help you understand the various terms and what they tell you.
Problem: An agency needs to report the average earnings of participants in a job training program after they complete the program.
Population: All participants who have completed the job training program within the past 2 years.
Sampling strategy: Construct and contact a probability sample of 100 participants and ask for each person’s current annual salary.
Finding: The annual average (mean) salary of the sampled participants is $25,000. The standard error, estimated from sample data, is $250.
Interpretation: The agency can be 95 percent confident that the average salary of all recent training graduates is between $24,510 and $25,490 (1.96 standard errors below the mean and 1.96 standard errors above the mean). In other words, the confidence interval is between $24,510 and $25,490. There is a 5 percent probability that the parameter lies outside the confidence interval.
Explanation: If all possible random samples each consisting of 100 job training participants were drawn, the average salaries found in 95 percent of the samples would fall between 1.96 standard errors below and 1.96 standard errors above the parameter, which is the true average. We do not know if any given sample falls within this confidence interval.
Pollsters typically refer to the sampling error when they report proportional data, that is, data reporting the percentage of cases for each value or category. By convention the sampling error equals the standard error at the 95 percent confidence level. For example, if 60 percent of the sample had finished high school and the sampling error is 3 percent (1.96 times the standard error of 1.5 percent), you could report that you are 95 percent confident that between 57 and 63 percent of all trainees completed high school. Since you are most likely to work with proportional data, let’s examine the accuracy of the statement, “The results from the full survey of 1,000 randomly selected adults have a margin of sampling error of plus or minus 3 percentage points.”
Let’s start by calculating the sampling error for a 95 percent confidence level. The equation is
where
p = the proportion in a given category;
n = the sample size.
The largest possible sampling error assumes maximum variability, that is, a 50-50 split. So if p = 0.5, 1 – p = 0.5, and n = 1,000, the sampling error is 3.1 percent. Depending on how you will use the data, you can decide whether you estimate the sampling error for all the proportional findings using a 50-50 split or if you should calculate the error for each finding. If we used the sample’s finding of 60 percent high school graduates as the value of p, the sampling error would be 3 percent. For nominal and ordinal variables with more than two categories you can treat the values as dichotomies. For example, assume that the sample of food pantry users include full-time workers, part-time workers, unemployed workers, and retirees. To estimate the parameter for the percentage unemployed, have p equal the percentage unemployed in the sample, and (1 – p) equal the percentage in all other categories.
The sampling error of 3 percent applies to the entire sample. It does not apply to a specific group within the sample. Let’s assume that you have a sample of 500 food pantry users. Let’s also assume that 75 percent of them are high school graduates. To estimate the percentage of pantry users who are high school graduates you would decide on one of the following equations to calculate the sampling error.
Equation 1 assumes maximum variability and gives us the largest sampling error for a sample of 500. You may prefer this equation if you are examining several characteristics of food pantry users and a rough estimate is adequate. This is often the case. On the other hand if you need a more precise estimate you may prefer Equation 2. In both cases there is a 5 percent probability that your estimate is wrong and the parameter is outside the confidence interval.
Let’s step back and use a simple example to underscore some key points. A study found that 86 percent of the 245 working mothers surveyed reported feeling stress.1 For a 95 percent confidence level the sampling error is 4.4 percent. Using the sampling error, you would estimate that between 81.6 and 90.4 percent of working mothers feel stress. The 86 percent is the statistic. You estimated that the parameter, the actual percentage of stressed working moms, falls within the confidence interval, that is, between 81.6 and 90.4 percent. With a 95 percent confidence level there is a 5 percent chance that your estimate is wrong; 5 times out of 100 a sample of 245 will either underestimate or overestimate the location of the parameter.
Sample Size
A characteristic of samples that you may wish to remember is that the larger the sample the smaller the sampling error and vice versa. To decide on the sample size, you need to decide how accurate you want the sample to be. Of course, practical considerations such as the amount of time and money available may limit sample size. The following principles highlight how accuracy, confidence level, and population variability guide decisions about sample size.
Accuracy: The greater the accuracy desired, the larger the sample needs to be.
Confidence level: The more confidence desired, the larger the sample needs to be.
Population variability: The more diverse the population, the larger the sample needs to be.
Although the accuracy improves with larger samples, the amount of improvement may become less and less. It is a classic case of diminishing returns. When the sample size is small—say 100—increasing it to 400 will greatly improve its accuracy. However, an increase from 2,000 to 2,300 will bring little improvement, although the additional costs of adding 300 respondents is likely to be the same in both cases. For this reason, a sampling error of 3 to 5 percent and a 95 percent confidence level are common in social science research.
To illustrate how to compute the sample size, assume we want to see what percentage of residents has used a food pantry within the past year. We have decided to accept a sampling error of 4 percent. The equation to find the sample size for proportional data is
Just as we did with estimating the sampling error we will assume maximum variability, that is, 50 percent in one category and 50 percent in the other.
If you have little or no information about the population, the easiest and most conservative approach is to assume that the population is split 50-50. In the above example 600 residents is the maximum sample needed for a 4 percent sampling error. If you are familiar with the population you may prefer to make a less conservative estimate of the variability. If you estimated that no more than 30 percent of the population has used a food pantry your calculation to determine sample size with a 4 percent sampling error would be
Another important factor in determining sample size is how much analysis you plan to do. Small samples will not withstand extensive analysis. With a small sample and several independent or control variables, you can find yourself studying individual cases—nobody wants to generalize from one or two cases. Similarly, as you study groups within a sample, subsamples will be smaller. For example, a statewide sample of 600 may be inadequate for examining differences between counties.
TESTS OF STATISTICAL SIGNIFICANCE
Students often find tests of statistical significance difficult to grasp the first time they encounter them. The word significance implies that the tests are very important. In a sense they are, yet the information they provide is modest. Let’s begin by imagining that investigators randomly selected 100 people at a rock concert, took their blood pressure, and asked questions about their lifestyle. (We will leave you to imagine the logistics required to sample and collect data from attendees at a rock concert.) If the wine drinkers had lower blood pressure, you need to ask, “Is the relationship between drinking wine and blood pressure a coincidence, and if we collected data from the entire population of concert attendees, would we have found the same relationship?”
Using the language of researchers we might ask if wine drinking and blood pressure are independent of each other or if the two variables are randomly related (same as not related). Asking if variables are randomly related or independent of one another is the same as asking “Could this relationship have occurred by chance?” If a statistical test suggests that in the population—the rock concert attendees—the relationship between wine drinking and blood pressure is nonrandom, the relationship is said to be statistically significant.
At the core of statistical significance is hypothesis testing, which has its own terminology. The terminology, which is what students often find confusing, reflects careful statistical and epistemological thinking underlying hypothesis testing.
Before we go into more detail we should let you know that respected social scientists consider statistical significance overrated, misunderstood, and frequently misused.2 We agree. Nevertheless, we cannot in good conscience completely ignore the topic: The word significance, the time spent on hypothesis testing in statistics courses, and the frequent appearances of significance tests in research reports all suggest that tests of statistical significance are extremely important. But, a statistically significant relationship may not be strong, important, or valuable. Large samples may show that trivial relationships are statistically significant. A finding of statistical significance does not tell us that the research was conducted correctly. The measures may have been unreliable or the data may be from a nonprobability sample.
Our goal is to help you correctly interpret what a test of statistical significance tells you, and what it doesn’t tell you. As part of our discussion we include the calculations for two common tests. Although you may never do the calculations yourself they may help you better understand the tests. Even though our discussion may seem terminology-heavy we will focus only on the most relevant terms.
The process for determining if two variables have a nonrandom relationship in the population has four steps:
1. State the null and alternative hypotheses.
2. Select an alpha level, that is, the amount of risk you are willing to accept that you are wrong if you reject the null hypothesis.
3. Select and compute a test statistic.
4. Make a decision.
Stating the Null Hypothesis
A hypothesis states a relationship between two variables. The null hypothesis (H0) postulates no relationship, or a random relationship, between the same two variables. The alternative hypothesis (HA), also called the research hypothesis, postulates a relationship between the variables. Let’s go back to our blood pressure study at the rock concert. Had the investigators examined the relationships between blood pressure and other variables such as exercise, diet, and medications, some relationships would have been strong, others weak. The first question they would want to ask is, “What is the probability we found a relationship by chance?” To answer this question they state the null hypothesis and the alternative hypothesis.
H0:
There is not a relationship between drinking wine and blood pressure.
HA:
The blood pressure of wine drinkers is lower than that of non–wine drinkers.
Remember the subjects represent only one sample. You must allow for the possibility that sampling error may account for the lower blood pressure of wine drinkers. Different samples may yield different results. Another sample from the same population may show no relationship between drinking wine and blood pressure.
To decide if the null hypothesis is probably true in the population, you should use a test of statistical significance. If the test suggests that the null hypothesis is probably untrue, you would reject your null hypothesis and accept the alternate hypothesis. In doing so, you risk making a mistake and accepting an untrue alternative hypothesis. Rejecting a true null hypothesis is called a Type I error. A Type I error occurs if, based on the sample, you decide that the alternative hypothesis is true in the population, when in fact it is untrue. A Type I error may be thought of as a false alarm; in other words it calls attention to a relationship that does not exist.
Alternatively, you may fail to reject an untrue null hypothesis and assume that a relationship does not exist in the population when in fact it does. In making this decision your mistake is to discount a true alternative hypothesis. Failing to reject an untrue null hypothesis is called a Type II error: based on the sample, you decided that the alternative hypothesis was untrue, when in fact it was true. The following illustrates our discussion of the two types of error.
Type I error: Based on sample data we reject the null hypothesis and accept the alternative hypothesis that drinking wine decreases blood pressure; in reality drinking wine may not be related to a decrease in blood pressure.
Type II error: Based on sample data we fail to reject the null hypothesis and assume that drinking wine is not related to blood pressure when in reality it is.
The bottom line is that a test of statistical significance helps you to decide whether your alternative hypothesis is true or untrue in the population. Based on your initial analysis, you already know if it is true or not in your sample. We should note that all a test of significance does is to say that the observed relationships probably did not occur by chance. It does not tell you if the relationship is stronger, weaker, or the same in the population. Nor does it prove that wine drinking will lower blood pressure. Factors other than drinking wine could cause the lower blood pressure. All the test of significance does is to help eliminate chance as one of the factors in accepting a hypothesis.
Selecting an Alpha Level
Traditionally, researchers select a criterion for rejecting the null hypothesis prior to starting their analysis. This criterion, referred to as the alpha (α) level, reports the probability that the variables are unrelated in the population. The alpha level is a number between 0 and 1. Common alpha levels for hypothesis testing are 0.05, 0.01, and 0.001.
If you decrease the probability of a Type I error you increase the probability of a Type II error. If you change α from 0.05 to 0.01 your chance of a Type I error goes from 5 percent to 1 percent, but at the same time the probability of missing a true alternative hypothesis (Type II error) increases. The only way to decrease both types of error at the same time is to increase the sample size.
In selecting an alpha level you should consider the sample size, the strength of the relationship, and the practical consequences of committing a Type I error or a Type II error. If alpha is set at 0.05 a sample of 1,300 will detect a slight difference or effect 95 percent of the time, and a sample of 50 will detect a moderate effect only 46 percent of the time.3 Tables 9.1a, 9.1b, and 9.1c are three hypothetical tables linking wine drinking and blood pressure. Given 0.05 as the alpha level, the relationship between the variables in Tables 9.1a and 9.1c is statistically significant; the one in Table 9.1b is not.
TABLE 9.1A
Blood Pressure by Wine Drinking
Blood Pressure
Number of Respondents That Drink Wine
Number of Respondents That Do Not Drink Wine
High
20
22
Normal or low
40
18
TABLE 9.1B
Blood Pressure by Wine Drinking
Blood Pressure
Number of Respondents That Drink Wine
Number of Respondents That Do Not Drink Wine
High
20
20
Normal or low
40
20
TABLE 9.1C
Blood Pressure by Wine Drinking
Blood Pressure
Number of Respondents That Drink Wine
Number of Respondents That Do Not Drink Wine
High
100
100
Normal or low
200
100
How do these relationships differ? The sample in Table 9.1c is five times larger than that in Table 9.1b. Table 9.1a shows a stronger relationship than Table 9.1b. All three tables would be statistically significant if we had selected 0.10 as the alpha level. The following summarizes the factors that affect whether a relationship is found to be statistically significant.
The larger the sample the more likely you are to find statistical significance.
The smaller the sample the less likely you are to find statistical significance.
The stronger the relationship the more likely you are to find statistical significance.
The higher the alpha level the more likely you are to find statistical significance.
The lower the alpha level the less likely you are to find statistical significance.
Selecting and Computing a Test Statistic
In this text we examine two common tests of statistical significance: chi-square (χ2), a statistic for nominal level data usually applied to contingency tables4 and the t-test, a statistic that compares the differences between the means of two groups.
Chi-square: A chi-square test compares the frequencies in a contingency table with the frequencies that would be expected if the relationship between variables is random in the population. Table 9.2a tests a hypothesis that the job training programs have different outcomes. The table compares participants in three job training programs and their outcomes. Each cell contains the frequencies observed (fo) in the collected data.
TABLE 9.2A
Frequencies Observed: Outcomes by Type of Training Program
TABLE 9.2B
Frequencies Expected: Outcomes if Type of Training Program Has No Effect
Table 9.2b shows what the data would look like if there was no relationship between a program and trainee outcomes; each cell contains the frequencies expected (fe) if the null hypothesis were true.
The frequency, fe, for each cell is the column total multiplied by the row total divided by the sample size. For example 31.9 = (64 × 292)/586.
Typically software programs calculate chi-square using the equation and report the value of p, the associated probability. The associated probability is the actual level at which the value of a statistical test is significant. The associated probability may be lower, higher, or the same as alpha. In this example χ2 = 50.67. The associated probability may be reported as 0.0000. There is less than 1 chance in 1,000 that you would obtain χ2 equal to 50.67 if the variables were independent, that is, not related. You might report your finding in a table footnote, “χ2 = 50.67. p < 0.001),” or in a sentence, “The type of job training is related to what trainees are currently doing (χ2 = 50.57, p < 0.001).” The important piece of information is “p < 0.001,” which provides strong evidence that the relationship is probably not random. Citing the statistic (χ2) allows readers trained in statistics to determine whether you used the appropriate statistical test. Some researchers also report degrees of freedom (df), which enables reviewers to visualize how the data were analyzed.5
Chi-square has two characteristics that you want to keep in mind. First, as a nominal statistic, it does not provide information on the direction of any relationship. In our example, chi-square indicates that the relationship between type of training program attended and current status is probably nonrandom. It does not indicate which program is the most effective. Second, the numerical value of chi-square tends to increase as the sample size increases. Thus the chi-square value is partially a product of sample size. It does not directly measure the strength of the association between variables and should not be used as a measure of association.
t-tests: The t-test is a statistic for ratio data. It tests hypotheses that compare the means of two groups. You can test hypotheses that one group’s mean is higher than another group’s, in which case you use a one-tailed test. Alternatively, you can test hypotheses that the group means are different, in which case you use a two-tailed test. In our example, if we measured actual blood pressure the hypothesis that wine drinkers have lower average blood pressure than non–wine drinkers would require a one-tailed test. The following example hypotheses illustrate two- and one-tailed t-tests.
HA:
The average salary of male trainees and female trainees differs (use a two-tailed test).
H0:
The average salary of male and female trainees is the same.
HA:
The average salary of male trainees is higher than the average salary of female trainees (use a one-tailed test).
H0:
The average salary of male trainees is the same as or less than the average salary of female trainees.
A two-tailed test does not specify whether men or women earn more. If you found that the average salary of female trainees is greater than that of male trainees, you would reject the null hypothesis. In a one-tailed test, the null hypothesis is expanded to include a finding in the “wrong” direction. Thus, if the t-test implied that the average salary of female trainees was more than that of male trainees, you would not reject the null hypothesis.
For t-tests comparing two groups you can assume either equal or unequal variation in the population; the population variation is estimated by the standard deviation. The equation for unequal variation can serve as a default option; it is more conservative and produces slightly higher associated probabilities. To illustrate a t-test, we assume unequal variations of male and female’s salaries.
For the Male Sample
For the Female Sample
n1 = 403
n2 = 132
Mean1 = $17,095
Mean2 = $14,885
Standard deviation(s)1 = $6,329
Standard deviation (s)2 = $4,676
A software program may calculate the value of t using the equation
and report its significance level. In this example, t = 4.28. Its associated probability may be reported as 0.0000. Reports normally include the value of t. (If t is at least 2, then p ≤ 0.05 for either a one-tailed or a two-tailed test.) You may report the findings in a sentence, “The average salary of male trainees is higher than that of female trainees (t = 4.28, p < 0.001),” in a table with the value of t in one column and the associated probability in the next, or in a table footnote. In the case of table footnotes a common practice is to place asterisks next to the t values and give the alpha level in a table footnote. The relationship between the number of asterisks and value of p varies from author to author. Usually, the more asterisks, the lower the value of p (you will see an example of this in Exercise 9.4). Students who have studied statistics may wonder why we do not use the normal distribution (z-scores) to test hypotheses about sample means. To use z-scores properly, the population variance must be known. If the population variance is not known, it is estimated by the standard deviation, in which case t-tests as opposed to z-scores are appropriate. Therefore, while using z-scores introduces little error with larger samples (n > 60), social scientists tend to rely on t-tests.
Making a Decision
If your significance statistic yields a low associated probability you can assume that the null hypothesis of no relationship is untrue. In our examples you would reject the null hypotheses and accept the alternative hypotheses that different job-training programs achieved different outcomes and that male trainees earned more than female trainees. The associated probability does not indicate the probability of a Type I error, nor does it imply that a relationship is more significant or stronger. Its contribution is more modest. It indicates the probability of a specific χ2 or t-value occurring if the null hypothesis is true. The relationship may be weaker, or stronger, than the one found in the sample data.
The findings result from many decisions, including whom to sample, how to sample them, what information to gather, how to gather it, and when to gather it. Remember that you are applying a statistical test to a set of numbers. Statistical significance cannot make up for a flawed design. You will obtain an answer even if your sample was biased and your measures unreliable. Even if the methodological decisions are sound and the statistics are applied correctly, the data still represent a single sample out of all the possible samples from the population. A test of significance alone should not bear the burden of demonstrating the truth of a hypothesis. It is far more realistic and reasonable to consider each statistical finding as part of a body of evidence supporting the truth or error of a hypothesis.
If you fail to reject your null hypothesis you do not have irrefutable evidence that the null hypothesis is true—a different sample or a larger sample might yield different results.
ALTERNATIVES TO TESTS OF STATISTICAL SIGNIFICANCE
Social scientists have recommended four alternatives to the traditional significance test. First, the null hypothesis can include a specific difference. Let’s go back to an earlier example. We will assume that administrators do not want to fund more on-the-job training programs unless they place at least 10 percent more of their trainees than other training programs. To test this preference the null hypothesis and the alternative hypothesis would be stated as
HA:
On-the-job training programs place at least 10 percent more of their trainees than other training programs.
H0:
On-the-job training programs do not place at least 10 percent more of their trainees than other training programs.
Second, you should consider reporting confidence intervals. Proponents argue that confidence intervals are more informative than a test of significance. They provide information on the value of parameters, differences between them, and the direction of the differences. The confidence interval avoids implying that the sample means, for example, provide a precise estimate of the differences. Rather the population means probably fall somewhere within the range indicated by the confidence intervals.
Third, you may include the measure of association. Statistics such as r and r2, not tests of statistical significance, measure the strength of a relationship.
The fourth alternative is replicating studies. The importance of replication is underscored by the following quote, “the results from an unreplicated study, no matter how statistically significant … are necessarily speculative…. Replications play a vital role in safeguarding the empirical literature from contamination from specious results.”6 Replications do not have to duplicate previous research exactly. Investigators may implement the research using a different population or another setting and see if the findings generalize to other populations or settings. Findings that go in the same direction, whether or not they are statistically significant, or that have overlapping confidence levels provide more support than a simple significance test.
With the exception of measuring the size of the effect of the independent variable on the dependent variable, each alternative has constraints. You may have inadequate knowledge to specify a relationship beyond anticipating some difference. Confidence levels work with interval data, but setting confidence levels for proportional data are less informative. Replication requires the opportunity and resources to repeat a study.
CONCLUDING OBSERVATIONS
Sampling statistics and tests of statistical significance are inferential statistics, that is, we can infer something about our population from a probability sample. Sampling statistics allow us to estimate the value of a population characteristic, the parameter. Tests of statistical significance, when applied to relationships between variables, simply indicate the probability that a null relationship exists in the population even if the data show a relationship. Sampling statistics are straightforward and less subject to misinterpretation. Consider the statement “a survey of River Dale residents, which had a 6 percent sampling error, found that 36 percent of the respondents used a food pantry last year.” We would estimate that between 30 and 42 percent of River Dale residents used a food pantry last year. Because the surveyors used a 95 percent confidence level, which is almost always used in social science research, 5 times out of a 100, the characteristic will lie outside the 30 to 42 percent range.
Tests of statistical significance are trickier. The term significance may imply that a relationship is important or strong. One may erroneously assume that a strong sample relationship indicates a strong relationship in the population or that an intervention caused the observed outcome. None of these are true. The relationship may be unimportant or weak in the population. The intervention may not have caused the outcome. The current trend is to put less emphasis on a single statistic. Instead researchers report confidence intervals and effect sizes (such as r2).
Both tests of statistical significance and sampling statistics may give a false impression of certainty. Neither can compensate for invalid or unreliable measures or sloppy data collection. Even properly drawn samples are subject to error. However, tests of significance do have value in that they provide evidence to interpret research findings.
RECOMMENDED RESOURCES
Fink, Arlene, How To Conduct Surveys: A Step By Step Guide, Fourth Edition (Thousand Oaks, CA: Sage Publications, Inc., 2009). See especially Chapter 6.
O’Sullivan, E., G. Rassel, and M. Berner, Research Methods for Public Administrators, Fifth Edition (New York: Pearson/Longman, 2008). Chapter 12.
Rumsey, D, Statistics for Dummies (Hoboken, NJ: Wiley Publishing, Inc., 2009).
Salkind, Neal, Statistics for People Who (Think They) Hate Statistics, Fourth Edition (Thousand Oaks, CA: Sage Publications, Inc., 2010).
CHAPTER 9 EXERCISES
There are four sets of exercises for Chapter 9.
• Exercise 9.1 Reviewing Polling Data reports a poll on racial discrimination. This exercise is designed to give you practice in computing and interpreting sampling statistics.
• Exercise 9.2 Attitudes toward Corporal Punishment: Are Men and Women Different? examines data to compare men’s and women’s attitudes about spanking. This exercise is designed to give you practice in interpreting a contingency table and a test of statistical significance.
• Exercise 9.3 What Is Going On in the Schools? considers a study to see if African American and Hispanic students are more likely to be suspended than other students. This exercise is designed to give you practice in interpreting tests of statistical significance.
• Exercise 9.4 How Groups Work Together presents data comparing perceived characteristics of collaborations formed around women’s issues and environmental issues. This exercise is designed to give you practice in interpreting tables that report the results of t-tests.
EXERCISE 9.1 Reviewing Polling Data
Scenario
You are a member of a community action group that focuses on various issues that affect community life. As you are perusing the Web for data on incidents of racial discrimination, you find a Washington Post–ABC poll of 1,079 randomly selected Americans. Two hundred and four of the poll respondents were African Americans.
Section A: Getting Started
1. The following data were reported in answer to the question “How big a problem is racism in our society today?”
Of all respondents: 26% a “big problem,” 22% a “small problem”
Of African American respondents: 44% a “big problem,” 11% a “small problem”
Of White respondents: 22% a “big problem,” 23% a “small problem”
a. Use the reported statistics, for example, 26%, to estimate p and 1 – p. Compute the sampling error for the entire sample, the African American sample, and the White sample. (Assume that the number of Whites is the same as the number of non–African Americans.)
b. Use the sampling error and report the confidence interval for the percentage of African Americans and Whites who believed that racism was a big problem.
c. What is the probability that your estimates in 1b are wrong? How did you arrive at this estimate?
2. Assume maximum variability (50-50 split)
a. Compute the sampling errors for the entire sample and for the 204-member African American sample.
b. As a general practice, would you analyze survey data using a 50-50 split or would you use the statistical analysis (question 1) to come up with more precise estimates? Justify your answer.
3. The survey also reported that among African American respondents 60 percent had personally felt that “a shopkeeper or sales clerk was trying to make” them feel unwelcome. You are curious if the same thing is true in your community. In trying to decide a value of p would you use 0.60, 0.50, or something else? Justify your answer.
4. The community action group considers replicating parts of a survey. What size sample is needed to have 2 percent, 5 percent, or 10 percent accuracy? (Note that accuracy is the same as sampling error.)
EXERCISE 9.2 Attitudes toward Corporal Punishment: Are Men and Women Different?
Scenario
A child care organization commissioned a random survey to identify attitudes toward child-raising. A topic of interest was the difference between the beliefs of men and women regarding discipline. The follow table reports data on men’s and women’s attitudes toward spanking.
Attitudes toward Spanking by Respondent Gender
Favor Spanking
Male
Female
Strongly
115
107
Somewhat
212
221
Oppose
73
109
Strongly oppose
18
42
Chi-square = 13.1, degrees of freedom = 3, significance = 0.004.
Section A: Getting Started
1. State the alternative hypothesis and the null hypothesis that could be tested with the data in this table.
2. Identify the independent and the dependent variables.
3. Calculate percentages and include in them in a table. Write a sentence to describe the relationship shown in the table. Do the data in the table support or contradict your hypothesis? Explain.
4. Based on the chi-square evidence what would you do, that is, would you reject the null hypothesis?
Section B: Small Group Exercise
1. What are the implications of the findings presented in exercise 9.2? Do you consider the table an interesting observation, a question for further study, or something else?
2. A finding of statistical significance can be persuasive. What other evidence should the child care organization present so that parents and other stakeholders are able to evaluate the findings?
EXERCISE 9.3 What Is Going On in the Schools?
Scenario
A community action group has heard complaints that African American and Hispanic students are more likely to be suspended (either in-school or out-of-school) than other students. The superintendent of schools offers to review the files of students in grades 9–12. The school system has 39,000 students in grades 9–12: 58 percent African American, 12 percent Hispanic, and 25 percent White.
Section A: Getting Started
1. What target population would you recommend the superintendent use? Why did you recommend this population? (Note that target population refers to the specific population that the data will represent.)
2. State the alternative hypothesis and the null hypothesis the superintendent should test.
3. If the superintendent tests the hypothesis and makes a Type I error, explain what has happened.
4. If the superintendent tests the hypothesis and makes a Type II error, explain what has happened.
5. Should the superintendent be more concerned about a Type I error or a Type II error? Justify your answer.
6. The superintendent originally set α = 0.05. How can she further decrease the probability of a Type I error? How can she further decrease the probability of a Type II error?
Section B: Small Group Exercise
1. Decide on a target population and suggest a sample size for the superintendent’s study. Justify your recommendation.
2. Discuss what actions the superintendent might take if the null hypothesis is rejected.
3. List arguments for the position
a. Committing a Type I error is the more serious concern.
b. Committing a Type II error is the more serious concern.
4. Based on what you have observed in Exercises 9.2 and 9.3 draft a memo “What you want to know about statistical significance: A guide for citizens.”
EXERCISE 9.4 How Groups Work Together
The following table was created as part of a study of collaborations formed around women’s issues and environmental issues. Members answered a series of questions to see if the coalitions were different. Each question was answered along a scale ranging from 1 = Not at all true to 7 = To a great extent true.
Two-tailed t-test: * p < 0.05, ** p < 0.05, *** p < 0.001.
Section A: Getting Started
1. In plain English explain what information the table contains.
2. In plain English interpret the statistical information for the last line (“My organization can count on each partner to meet its obligations.)
3. How do collaborations focused on women’s issues differ from collaborations focused on environmental issues? What criteria did you use to make your choices?
NOTES
1Parker, Kim, “The harried life of the working mothers,” Social and Demographic Trends, Pew Research Center, October 1, 2009. Posted at http://pewsocialtrends.org/pubs/745/the-harried-life-of-the-working-mother#prc-jump. Accessed January 27, 2010.
2Cohen, J., “Things I have learned (so far),” American Psychologist (1990), 45:1304–1312 provides an accessible discussion of the limitations of tests of significance. Gill, J., “The insignificance of null hypothesis significance testing,” Political Research Quarterly (1999), 52:647–674 and Henson, R. K., “Book Review: Beyond significance testing: Reforming data analysis methods in behavioral research,” Applied Psychological Measurement (2006), 30:452–455. Both summarize the basic arguments and give extensive references. We have avoided addressing the arguments because both our position and theirs should lead you to avoid overvaluing the results and to instead focus on the strength of relationships and the quality of the research methodology.
3Cohen, J., “Things I Have Learned (So Far),” (p. 1308) This discussion is based on the concept of power, the probability of correctly rejecting a null hypothesis. To determine a study’s power one needs to have specified the sample size, alpha level, and effect size. We believe that our discussion, which underscores the role sample size and effect size, is adequate for most readers.
4Chi-square can also be used to determine goodness of fit; here, we have limited our discussion to the use of chi-square to tests for independence between variables.
5The value of most tests of statistical significance is affected by sample size, degrees of freedom, or both. For chi-square the degrees of freedom (df) are based on the number of rows and columns in the table.
6Hubbard, R., and Ryan, P. A., “The historical growth of statistical significance testing in psychology—And its future prospects,” Educational and Psychological Measurement (2000), 60:661–681.