Histograms And Descriptive Statistics
Histograms and Descriptive Statistics
Your first IBM SPPS assignment includes two sections in which you will:
1. Create two histograms and provide interpretations.
2. Calculate measures of central tendency and dispersion and provide interpretations.
Key Details and Instructions
• Submit your assignment as a Word document.
• Begin your assignment by creating a properly formatted APA title page. Include a reference list at the
end of the document if necessary. On page 2, begin Section 1.
• Write your report in narrative format, integrating your SPSS output charts and tables with your
responses to the specific requirements listed for this assignment. (See the Copy/Export Output
Instructions in the Resources area.)
• Label all tables and graphs in a manner consistent with APA style and formatting guidelines. Citations,
if needed, should be included in the text as well as in a reference section at the end of the report.
• Refer to the IBM SPSS Step-By-Step Guide: Histograms and Descriptive Statistics (in the Resources
area) for additional help in completing this assignment.
Section 1: Histograms and Visual Interpretation
Section 1 will include one histogram of total scores for all the males in the data set, and one histogram of
total scores for all the females in the data set.
Using the total and gender variables in your grades.sav data set, create two histograms:
• A histogram for male students.
• A histogram for female students.
Copy the histogram output from SPSS and paste it into a Word document. Below the histograms in your
Word document, provide an interpretation based on your visual inspection. Correctly use all of the following
terms in your discussion:
• Skew.
• Kurtosis.
• Outlier.
• Symmetry.
• Modality.
Comment on any differences between males and females regarding their total scores. Analyze the strengths
and limitations of visually interpreting histograms.
Section 2: Calculate and Interpret Measures of Central Tendency and Dispersion
Using the grades.sav file, compute descriptive statistics, including mean, standard deviation, skewness, and
kurtosis for the following variables:
• id
• gender
• ethnicity
• gpa
• quiz3
• total
Copy the descriptives output from SPSS and paste it into your Word document. Below the descriptives
output table in your Word document:
• Indicate which variables are meaningless to interpret in terms of mean, standard deviation, skewness,
and kurtosis. Justify your decision.
• Next, indicate which variables are meaningful to interpret. Justify your decision.
• For the meaningful variables, do the following:
◦ Specify any variables that are in the ideal range for both skewness and kurtosis.
◦ Specify any variables that are acceptable but not excellent.
◦ Specify any variables that are unacceptable.
◦ Explain your decisions.
• For all meaningful variables, report and interpret the descriptive statistics (mean, standard deviation,
skewness, and kurtosis).
Submit both sections of your assignment as an attached Word document.
Resources
Histograms and Descriptive Statistics Scoring Guide.
• Includes all relevant output; no irrelevant output is included. No errors in SPSS output.
• Evaluates the concepts of skew, kurtosis, outliers, symmetry, and modality for two histograms.
• Analyzes the strengths and limitations of examining a distribution of scores with a histogram.
• Includes all relevant output; no irrelevant output is included. No errors in SPSS output.
• Evaluates meaningful versus meaningless variables reported in descriptive statistics.
• Evaluates descriptive statistics.
• Without exception, communicates in a manner that is scholarly, professional, and consistent with the expectations for members
in the identified field of study.
IBM SPSS Step-by-Step Guide: Histograms and Descriptive Statistics [DOC]. Copy/Export Output Instructions.
In Unit 1, you read about the difference between descriptive statistics and inferential statistics in Chapter 1 of your Warner text. For the next two units, we will focus on the theory, logic, and application of descriptive statistics. This introduction focuses on scales of measurement, measures of central tendency and dispersion, the visual inspection of histograms, and the detection and processing of outliers.
An important concept in understanding descriptive statistics is the scales of measurement. The Warner (2013) text defines four scales of measurement—nominal, ordinal, interval, and ratio:
• Nominal data refer to numbers arbitrarily assigned to represent group membership, such as gender (male = 1; female = 2). Nominal data are useful in comparing groups, but they are meaningless in terms of measures of central tendency and dispersion.
• Ordinal data represent ranked data, such as coming in first, second, or third in a marathon. However, ordinal data do not tell us how much of a difference there is between measurements. The first-place and second-place finishers could finish 1 second apart, whereas the third-place finisher arrives 2 minutes later. Ordinal data lack equal intervals.
• Interval data refer to equal intervals between data points. An example is degrees measured in Fahrenheit. Interval data lack a "true zero" value (freezing at 32 degrees Fahrenheit).
• Ratio data do have a true zero, such as heart rate, where "0" represents a heart that is not beating. This is often seen as "count" data in social research. For example, how many days did an employee miss from work? Zero is a meaningful unit in this example.
These four scales of measurement are routinely reviewed in introductory statistics textbooks as the classic way of differentiating measurements. However, the boundaries between the measurement scales are fuzzy. For example, is intelligence quotient (IQ) measured on the ordinal or interval scale? Recently, researchers have argued for a simpler dichotomy in terms of selecting an appropriate statistic: categorical versus continuous measures.
• A categorical variable is a nominal variable. It simply categorizes things according to group membership (for example, apple = 1, banana = 2, grape = 3).
• A continuous measure represents a difference in magnitude of something, such as a continuum of "low to high" statistics anxiety. In contrast to categorical variables designated by arbitrary values, a quantitative measure allows for a variety of arithmetic operations, including equal (=), less than (<), greater than (>), addition (+), subtraction (−), multiplication (* or ×), and division (/ or ÷). Arithmetic operations generate a variety of descriptive statistics discussed next.
Measures of Central Tendency and Dispersion
Chapter 2 of Warner (2013) reviews descriptive statistics that measure central tendency (mean, median, mode) and dispersion (range, sum of squares, variance, standard deviation). To visualize central tendency and dispersion, refer to Figure 2.5 on page 46 of the Warner text for an illustration of how heart rate data are represented in a histogram. The horizontal axis represents heart rate ("hr"). The vertical axis represents the total number of people who were recorded at a particular heart rate ("Frequency"). Measures of centrality summarize where data clump together at the center of a distribution of scores. (For example, in Figure 2.5 this occurs around hr = 74.)
Unit 2 - Descriptive Statistics: Theory and Logic INTRODUCTION
To simplify, consider the following measured heart rates: 65, 70, 75, 75, 130.
The simplest measure of central tendency is the mode. It is the most frequent score within a distribution of scores (for example, two scores of hr = 75). Technically, in a distribution of scores, you can have two or more modes. An advantage of the mode is that it can be applied to categorical data. It is also not sensitive to extreme scores.
The median is the geometric center of a distribution because of how it is calculated. All scores are arranged in ascending order. The score in the middle is the median. In the five heart rates above, the middle score is a 75. If you have an even number of scores, the average of the two middle scores is used. The median also has the advantage of not being sensitive to extreme scores.
The mean is probably what most people consider to be an average score. In the example above, the mean heart rate is (65 + 70 + 75 + 75 + 130) ÷ 5 = 83. Although the mean is more sensitive to extreme scores (such as 130) relative to the mode and median, it can be more stable across samples, and it is the best estimate of the population mean. It is also used in many of the inferential statistics studied in this course, such as t tests and analysis of variance (ANOVA).
In contrast to measures of central tendency, measures of dispersion summarize how far apart data are spread on a distribution of scores. The range is a basic measure of dispersion quantifying the distance between the lowest score and the highest score in a distribution (for example, 130 − 65 = 65). A deviance represents the difference between an individual score and the mean. For example, the deviance for the first heart rate score (65) is 65 − 83, which is −18. By calculating the deviance for each score above from a mean of 83, we arrive at −18, −13, −8, −8, and +47. Summing all of the deviances equals 0, which is not a very informative measure of dispersion.
A somewhat more informative measure of dispersion is sum of squares ( SS), which you will see again in Units 9 and 10 in the study of analysis of variance (ANOVA). To get around the problem of summing to zero, the sum of squares involves calculating the square of each deviation and then summing those squares. In the example
above, SS = [(−18)2 + (−13)2 + (−8)2 + (−8)2 + (+47)2] = [(324) + (169) + (64) + (64) + (2209)] = 2830. The problem with SS is that it increases as data points increase (Field, 2013), and it still is not a very informative measure of dispersion.
This problem is solved by next calculating the sample variance ( s2), which is the average distance between the mean and a particular score (squared). Instead of dividing SS by 5 for the example above, we divide by N − 1, or 4; see pages 56–57 of your Warner text for an explanation. The variance is therefore SS ÷ ( N − 1), or 2830 ÷ 4 = 707.5. The problem with interpreting variance is that it is the average distance of "squared units" from the mean. What is, for example, a "squared" heart rate score?
The final step is calculating the sample standard deviation ( s), which is simply calculated as the square root of the sample variance, or in our example, √707.5 = 26.60. The sample standard deviation represents the average deviation of scores from the mean. In other words, the average distance of heart rate scores to the mean is 26.6 beats per minute. If the extreme score of 130 is replaced with a score closer to the mean, such as 90, then s = 9.35. Thus, small standard deviations (relative to the mean) represent a small amount of dispersion; large standard deviations (relative to the mean) represent a large amount of dispersion (Field, 2013). The standard deviation is an important component of the normal distribution.
Visual Inspection of a Distribution of Scores
An assumption of the statistical tests that you will study in this course is that the scores for a dependent variable are normal (or approximately normal) in shape. This assumption is first checked by examining a histogram of the distribution. Figure 4.19 in the Warner text (p. 147) represents a distribution of heart rate scores that are
approximately normal in shape and visualized in terms of a bell-shaped curve. Notice that the tails of the distribution are approximately symmetrical, meaning that they are near mirror images to the left and right of the mean. This distribution technically has two modes at hr = 70 and hr = 76, but the close proximity of these modes suggests a unimodal distribution.
Departures from normality and symmetry are assessed in terms of skew and kurtosis. Skewness is the tilt or extent a distribution deviates from symmetry around the mean. A distribution that is positively skewed has a longer tail extending to the right (the "positive" side of the distribution) as shown in Figure 4.20 of the Warner text (p. 148). A distribution that is negatively skewed has a longer tail extending to the left (the "negative" side of the distribution) as shown in Figure 4.21 of the Warner text (p. 149). In contrast to skewness, kurtosis is defined as the peakedness of a distribution of scores. Figure 4.22 of the Warner text (p. 150) illustrates a distribution with normal kurtosis, negative kurtosis (a "flat" distribution; platykurtic), and positive kurtosis (a "sharp" peak; leptokurtic).
The use of these terms is not limited to your description of a distribution following a visual inspection. They are included in your list of descriptive statistics and should be included when analyzing your distribution of scores. Skew and kurtosis scores of near zero indicate a shape that is symmetric or close to normal respectively. Values of −1 to +1 are considered ideal, whereas values ranging from −2 to +2 are considered acceptable for psychometric purposes.
Outliers
Outliers are defined as extreme scores on either the left of right tail of a distribution, and they can influence the overall shape of that distribution. There are a variety of methods for identifying and adjusting for outliers. Outliers can be detected by calculating z scores (reviewed in Unit 4) or by inspection of a box plot. Once an outlier is detected, the researcher must determine how to handle it. The outlier may represent a data entry error that should be corrected, or the outlier may be a valid extreme score. The outlier can be left alone, deleted, or transformed. Whatever decision is made regarding an outlier, the researcher must be transparent and justify his or her decision.
References
Field, A. (2013). Discovering statistics using IBM SPSS (4th ed.). Thousand Oaks, CA: Sage.
Warner, R. M. (2013). Applied statistics: From bivariate through multivariate techniques (2nd ed.). Thousand Oaks, CA: Sage.
OBJECTIVES
To successfully complete this learning unit, you will be expected to:
1. Analyze the strengths and limitations of descriptive statistics.
2. Identify previous experience with and future applications of descriptive statistics.
3. Analyze the purpose and reporting of confidence intervals.
4. Discuss standard error and confidence intervals.
Unit 2 Study 1- Readings Use your Warner text, Applied Statistics: From Bivariate Through Multivariate Techniques, to complete the following:
• Read Chapter 2, "Basic Statistics, Sampling Error, and Confidence Intervals," pages 41–80. This reading addresses the following topics:
◦ Sample mean ( M). ◦ Sum of squared deviations ( SS).
◦ Sample variance ( s2). ◦ Sample standard deviation ( s). ◦ Sample standard error ( SE). ◦ Confidence intervals (CIs).
• Read Chapter 4, "Preliminary Data Screening" pages 125–184. This reading addresses the following topics:
◦ Problems in real data. ◦ Identification of errors and inconsistencies. ◦ Missing values. ◦ Data screening for individual variables. ◦ Data screening for bivariate analysis. ◦ Data transformations. ◦ Reporting preliminary data screening.
SOE Learners – Suggested Readings
Young, J. R., Young, J. L., & Hamilton, C. (2014). The use of confidence intervals as a meta-analytic lens to
summarize the effects of teacher education technology courses on preservice teacher TPACK. Journal of Research on Technology in Education, 46(2), 149–172.
Unit 2 Study 2 - Assignment Preparation This unit provides context for an upcoming assignment on histograms and descriptive statistics in Unit 3. Look ahead at the instructions and scoring guide for the Unit 3 assignment so that you have it in mind as you study the materials and complete the activities in this unit.
Software Installation
Make sure that IBM SPSS Statistics Standard GradPack is fully licensed, installed on your computer, and running properly. It is important that you have either the Standard or Premium version of SPSS that includes the full range of statistics. Proper software installation is required in order to complete your first SPSS data assignment in Unit 3.
Next, click grades.sav to download the file to your computer.
• Important: Do not use the original George and Mallery grades.sav file, as the course room grades.sav is modified for 7864.
You will use grades.sav throughout the course. The definition of variables in the grades.sav data set are found in Chapter 1 of your IBM SPSS Statistics Step by Step text. Understanding these variable definitions is necessary for interpreting SPSS output.
Next week, you will define values and scales of measurement for all variables in your grades.sav file.