2
DATA STRUCTURES AND ALGORITHMS USING
JAVA ™
William McAllister St. Joseph's College
JONES AND BARTLETT PUBLISHERS Sudbury, Massachusetts
BOSTON TORONTO LONDON SINGAPORE
3
World Headquarters Jones and Bartlett Publishers 40 Tall Pine Drive Sudbury, MA 01776 978-443-5000 info@jbpub.com www.jbpub.com
Jones and Bartlett Publishers Canada 6339 Ormindale Way Mississauga, Ontario L5V 1J2 Canada
Jones and Bartlett Publishers International Barb House, Barb Mews London W6 7PA United Kingdom
Jones and Bartlett's books and products are available through most bookstores and online booksellers. To contact Jones and Bartlett Publishers directly, call 800-832-0034, fax 978- 443-8000, or visit our website, www.jbpub.com . Substantial discounts on bulk quantities of Jones and Bartlett's publications are available to corporations, professional associations, and other qualified organizations. For details and specific discount information, contact the special sales department at Jones and Bartlett via the above contact information or send an email to specialsales@jbpub.com . Copyright © 2009 by Jones and Bartlett Publishers, LLC
All rights reserved. No part of the material protected by this copyright may be reproduced or utilized in any form, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the copyright owner.
Production Credits Publisher: David Pallai Acquisitions Editor: Timothy Anderson Editorial Assistant:Melissa Potter Production Director: Amy Rose Production Editor: Katherine Macdonald Senior Marketing Manager: Andrea DeFronzo V.P., Manufacturing and Inventory Control: Therese Connell Composition: Northeast Compositors, Inc. and International Typesetting and Composition Cover Design: Kristin E. Parker Cover Image: © Vitezslav Halamka/ShutterStock, Inc. Printing and Binding:Malloy, Inc. Cover Printing:Malloy, Inc.
Library of Congress Cataloging-in-Publication Data McAllister, William (William James) Data structures and algorithms using Java / William McAllister.—
4
mailto:info@jbpub.com
http://www.jbpub.com
http://www.jbpub.com
mailto:specialsales@jbpub.com
1st ed. p. cm. Includes index. ISBN-13: 978-0-7637-5756-4 (pbk.) ISBN-10: 0-7637-5756-X (pbk.)
1.Computer algorithms. 2. Data structures (Computer science) 3. Java (Computer program language) I. Title. QA76.9.A43M373 2008 005.13'3–dc22
2008013120
6048 Printed in the United States of America
12 11 10 09 08 10 9 8 7 6 5 4 3 2 1
5
Dedication To my best friend and wife, Gretchen (a.k.a Maggie),
who supplied endless encouragement and hundreds of commas
6
Contents
Preface
Chapter 1 Overview and Java Review 1.1 Data Structures
1.1.1 What Is Data? 1.1.2 What Is a Data Structure?
1.2 Selecting a Data Structure 1.2.1 The Data Structure's Impact on Performance 1.2.2 Determining the Performance of a Data Structure 1.2.3 The Trade-Off Process
1.3 Fundamental Concepts 1.3.1 Terminology 1.3.2 Access Modes 1.3.3 Linear Lists 1.3.4 Data Structure Operations 1.3.5 Implementing a Programmer-Defined Data Structure 1.3.6 Procedural Abstractions and Abstract Data Types (ADTs) 1.3.7 Encapsulation
1.4 Calculating Speed (Time Complexity) 1.4.1 Big-O Analysis (O Standing for Order of Magnitude) 1.4.2 Algorithm Speed 1.4.3 Relative Speed of Algorithms 1.4.4 Absolute Speed of an Algorithm 1.4.5 Data Structure Speed
1.5 Calculating Memory Overhead (Space Complexity) 1.6 Java Review
1.6.1 Arrays of Primitive Variables 1.6.2 Definition of a Class 1.6.3 Declaration of an Object 1.6.4 Accessing Objects 1.6.5 Standard Method Name Prefixes 1.6.6 Shallow and Deep Copies 1.6.7 Declaration of an Array of Objects 1.6.8 Objects that Contain Objects as Data Members 1.6.9 Classes that Extend Classes, Inheritance 1.6.10 Parent and Child References 1.6.11 Generic Types
Knowledge Exercises
7
Programming Exercises
Chapter 2 Array-Based Structures 2.1 The Built-in Structure Array
2.1.1 Multidimensional Arrays 2.2 Programmer-Defined Array Structures
2.2.1 Unsorted Array 2.2.2 Sorted Array 2.2.3 Unsorted-Optimized Array 2.2.4 Error Checking
2.3 Implementation of the Unsorted-Optimized Array Structure 2.3.1 Baseline Implementation 2.3.2 Utility Methods
2.4 Expandable Array-Based Structures 2.5 Generic Data Structures
2.5.1 Design Considerations 2.5.2 Generic Implementation of the Unsorted-Optimized Array 2.5.3 Client-Side Use of Generic Structures 2.5.4 Heterogeneous Generic Data Structures
2.6 Java's ArrayList Class Knowledge Exercises Programming Exercises
Chapter 3 Restricted Structures 3.1 Restricted Structures 3.2 Stack
3.2.1 Stack Operations, Terminology, and Error Conditions 3.2.2 Classical Model of a Stack 3.2.3 A Stack Application: Evaluation of Arithmetic Expressions 3.2.4 Expanded Model of a Stack
3.3 Queue 3.3.1 Queue Operations, Terminology, and Error Conditions 3.3.2 Classical Model of a Queue 3.3.3 Queue Applications 3.3.4 Expanded Model of a Queue
3.4 Generic Implementation of the Classic Stack, a Methodized Approach
3.4.1 Generic Conversion Methodology 3.5 Priority Queues 3.6 Java's Stack Class
8
Knowledge Exercises Programming Exercises
Chapter 4 Linked Lists and Iterators 4.1 Noncontiguous Structures 4.2 Linked Lists 4.3 Singly Linked Lists
4.3.1 Basic Operation Algorithms 4.3.2 Implementation 4.3.3 Performance of the Singly Linked List 4.3.4 A Stack Implemented as a Singly Linked List
4.4 Other Types of Linked Lists 4.4.1 Circular Singly Linked List 4.4.2 Double-Ended Singly Linked List 4.4.3 Sorted Singly Linked List 4.4.4 Doubly Linked List 4.4.5 Multilinked List
4.5 Iterators 4.5.1 Implementation of an Iterator 4.5.2 Multiple Iterators
4.6 Java's LinkedList Class and ListIterator Interface Knowledge Exercises Programming Exercises
Chapter 5 Hashed Data Structures 5.1 Hashed Data Structures 5.2 Hashing Access Algorithms
5.2.1 A Hashing Example 5.3 Perfect Hashed Data Structures
5.3.1 Direct Hashed Structure 5.4 Nonperfect Hashed Structures
5.4.1 Primary Storage Area Size 5.4.2 Preprocessing Algorithms 5.4.3 Hashing Functions 5.4.4 Collision Algorithms 5.4.5 The Linear Quotient (LQHashed) Data Structure Implementation 5.4.6 Dynamic Hashed Structures
Knowledge Exercises Programming Exercises
9
Chapter 6 Recursion 6.1 What Is Recursion? 6.2 Understanding Recursive Algorithms
6.2.1 n Factorial 6.2.2 The Code of a Recursive Algorithm 6.2.3 Tracing a Recursive Method's Execution Path
6.3 Formulating a Recursive Algorithm 6.3.1 Definitions 6.3.2 Methodology 6.3.3 Practice Problems
6.4 Problems with Recursion 6.4.1 Dynamic Programming Applied to Recursion
6.5 Backtracking, an Application of Recursion 6.5.1 A Generalized Backtracking Algorithm 6.5.2 Algorithm Adaptation Methodology
Knowledge Exercises Programming Exercises
Chapter 7 Trees 7.1 Trees
7.1.1 Graphics and Terminology of Trees 7.2 Binary Trees
7.2.1 Terminology 7.2.2 Mathematics
7.3 Binary Search Trees 7.3.1 Basic Operation Algorithms 7.3.2 Performance 7.3.3 Implementation 7.3.4 Standard Tree Traversals 7.3.5 Balanced Search Trees 7.3.6 Array Implementation of a Binary Search Tree 7.3.7 Performance 7.3.8 Java's TreeMap Data Structure
Knowledge Exercises Programming Exercises
Chapter 8 Sorting 8.1 Sorting 8.2 Sorting Algorithm Speed
8.2.1 Minimum Sort Effort
10
8.2.2 An Implementation Issue Affecting Algorithm Speed 8.3 Sorting Algorithms
8.3.1 The Binary Tree Sort 8.3.2 The Bubble Sort 8.3.3 The Heap Sort 8.3.4 The Merge Sort 8.3.5 Quicksort
Knowledge Exercises Programming Exercises
Chapter 9 Graphs 9.1 Introduction
9.1.1 Graphics and Terminology of Graphs 9.2 Representing Graphs
9.2.1 Representing Vertices 9.2.2 Representing Edges
9.3 Operations Performed on Graphs 9.4 Implementing Graphs in the Vertex Number Mode 9.5 Traversing Graphs
9.5.1 Depth-First Traversal 9.5.2 Breadth-First Traversal
9.6 Connectivity and Paths 9.6.1 Connectivity of Undirected Graphs 9.6.2 Connectivity of Directed Graphs 9.6.3 Spanning Trees 9.6.4 Shortest Paths
Knowledge Exercises Programming Exercises
Appendices Appendix A ASCII Table Appendix B Derivation of the Average Search Length of a Nondirect Hashed Data Structure Appendix C Proof That If an Integer, P, Is Not Evenly Divisible by an Integer Less Than the Square Root of P, It Is a Prime Number Appendix D Calculations to Show That (n + 1) (log2(n + 1) − 2) Is the Minimum Sort Effort for the Binary Tree Sort
Glossary
11
Index
12
Preface
Introduction Data Structures and Algorithms Using Java is an undergraduate-level
textbook for a computer science curriculum. It covers the entire syllabus of the Association of Computing Machinery standard curriculum courses CS103i and CS103o, “Data Structures and Algorithms” (CS2001 evolutions of the traditional CS2 course). As such, it is intended to be used as the primary text for these courses. The book is built upon my academic and industry experience, combining the pedagogical techniques I've developed and refined during 29 years of teaching with my practical knowledge of the computer field acquired during 28 years in the industry. The resulting integration of theory and application will help students at all levels to understand core concepts that will enhance their success in later courses, and ultimately, in their careers.
Why Another Data Structures Book My primary reason for writing this book was to produce a text that
was more readable, engaging, and instructive than those currently in print without compromising the scope of the ACM CS103 course material. I wanted the text to engage students outside the classroom in the process of investigative discovery. The motivation for this was based partially on the findings from two National Science Foundation grants my colleagues and I received, whose purpose was to investigate ways of attracting students into technical areas of national need (including computer professionals) and retaining them through graduation. “Data Structures and Algorithms,” a sophomore-level course, is usually the “make-or-break” course for computer science majors. As a co-principal investigator on both of these grants, I realized that a highly accessible Data Structures and Algorithms text, with improved pedagogy, would help retain students majoring in computer science.
Advantages of This Text I've endeavored to present the minimal amount of Java code
necessary to illustrate the implementation of the learned concepts. This minimal code set leaves plenty of room for meaningful programming assignments, while being thorough enough that a computer professional
13
could use the text as a source code reference book. The book is more comprehensive in some topic areas than current
books in print, which makes it more appealing to highly capable students. The clearer language, simple examples, and abundance of instructional figures, including nearly 300 illustrations and more than 50 tables, make it more accessible to the majority of students who might otherwise struggle to grasp these and other, more challenging, concepts.
The text is aimed at both the computer scientist who implements, refines, or improves the classic data structures and the software engineer who selects and integrates library implementations of these data structures into particular applications. From the computer scientist's viewpoint, the design of each of the classic data structures is discussed and their algorithms' pseudocode developed and implemented. In addition, the characteristics of each structure that give rise to its speed and space complexity advantages (or disadvantages) are examined. From the software engineer's viewpoint, the text presents the relative merits of the classic data structures to the extent that an optimum choice could be made for a particular application. This is accomplished by deriving the speed and memory requirements of each of the classic data structures in a quantified manner, with the performance of each structure compared in a summary table at the end of each chapter. In addition, multivariable graphs are presented to demonstrate how to locate optimum design points from a memory requirement point of view. Finally, the use of the Java API implementation of the data structures is demonstrated.
Other significant advantages of this book are: a methodized approach to recursive algorithm development aimed at teaching students to “think” recursively; an introduction to the techniques and benefits of dynamic programming; and line-by-line discussions of the significant portions of the implementation code, which include the techniques for encapsulating data within a structure and the techniques for implementing generic structures.
Programming Language Background The implementation language used in the text is Java, but since the
first chapter includes a review of objects and the Java constructs and syntax necessary to understand the book's code examples, students with a programming background in other high-level languages should find the book approachable. The Java review guides students through an
14
understanding of the fundamental concepts essential to data structures that are particular to object-oriented languages, such as Java's use of reference variables, its memory allocation model, arrays of objects, shallow and deep object copies, parent-to-child references, composition (containership), and generic features of Java 1.4.
Organization The text employs a unique, student-friendly pedagogical approach
and organizational structure. The chapters that present the classic data structures (Chapters 2 , 3 , 4 , 5 , 7 , and 9 ) use a common template, rooted in the object paradigm, as the organizational basis of each chapter. The use of this template, embellished with topics particular to each structure, permits the student to “anticipate” the material presented in each chapter, which in turn makes the text more readable. The template begins by stating the shortcomings of the structures studied to that point and identifies the ways in which the new structure addresses some of these shortcomings. Then the structure is visually introduced, and its initialization and operation algorithms are developed in pseudocode. These algorithms are then fully implemented in Java as an encapsulated homogenous structure, and the structure's use is then demonstrated in a telephone information request application. Next, the performance of the structure is quantified, discussed, and compared to the performance of the previously studied structures. Finally, the use of the Java API implementation of the structure is discussed and demonstrated.
Chapter 1 is an introduction to the basic concepts of data structures and a review of Java. The first part of the chapter introduces the student to the concept of a data structure and its impact on the performance of a software product. The terminology and fundamental concepts common to all data structures are discussed, including encapsulation, abstraction, and algorithm performance. This section concludes with a discussion of speed complexity, space complexity, and Big-O analysis. The second part of the chapter is a review of the Java concepts and constructs necessary to understanding the implementations presented in the text. The topics include array declarations, class definitions and objects, containership, inheritance, shallow and deep copies, and the use of the generic features of Java 1.4 to write generic methods and classes.
Chapter 2 discusses the implementation of the built-in structure array and presents three programmer-defined, array-based structures. A major
15
pedagogical advantage of this chapter is that it utilizes the simplicity of arrays and the students' familiarity with them to facilitate a discussion of the concepts common to all data structures. In the first part of the chapter, the student is encouraged to view arrays from a new perspective: a data structures perspective. Once viewed from that perspective, they can be used as a case study to illustrate the techniques of memory modeling and performance trade-offs that were employed in their design. The second part of the chapter begins the book's study of programmer-defined data structures by presenting three data structures built upon arrays. The simplicity of these structures allows the student to focus on learning the concepts common to all data structures. Then, having gained an early understanding of these concepts, students are free to concentrate on the more complicated conceptual details of the structures presented in subsequent chapters. The chapter also presents techniques for expanding arrays at run-time and for converting the array-based implementation developed in the chapter to a fully generic data structure using the generic features of Java 1.4. It concludes with a discussion of the use of the ArrayList class included in the Java API.
Chapter 3 presents the restricted structures Stack and Queue. The first part of the chapter discusses what they have in common with other data structures and encourages the student to consider what about them is restricted. Thus the student understands that Stack and Queue are part of the broader data structure landscape, rather than two isolated computer science topics. The second part of the chapter presents the details of Stack and Queue, including a full implementation of each structure. The Queue implementation is used as the basis of presenting a methodized approach to converting a data structure class to a fully generic class using the generic features of Java 1.4. The chapter concludes with a discussion of priority queues and the use of the Stack class included in the Java API.
Chapter 4 introduces the student to linked lists, their use in the implementation of a stack, and iterators. The first part of the chapter discusses singly linked lists and their implementation. It also includes a discussion of several variations of a singly linked list, doubly linked lists, and multi-linked lists. The second part of the chapter introduces the concept of an iterator, a discussion of the advantages of its use, and a full implementation of an iterator class. As part of this implementation, the concept of an inner class is introduced, which includes a discussion of why classes are coded as inner classes. The chapter concludes with a discussion of the use of the LinkedList and ListIterator classes included in the Java
16
API. Chapter 5 deals with the topic of hashing, but it significantly extends
the traditional treatment of this topic. It treats the pervasive data structures that use hashing algorithms as a separate data structure category, Hashed Structures. The first part of the chapter illustrates the advantages of hashing via an example, and then it presents the algorithms of a data structure based on a perfect hashing function. The second part of the chapter discusses nonperfect hashing functions, collision algorithms and their performance, and string preprocessing algorithms used by data structures that utilize nonperfect hashing functions. Included is a discussion of optimum hash table size and a 4k + 3 prime number generator. All these concepts are then applied in the implementation of a data structure based on the division hashing function and high performance collision algorithm. The chapter concludes with a discussion of expandable hashed structure and the use of the HashTable class included in Java's API.
Chapter 6 presents the topic of recursion. The first part of the chapter introduces recursion and provides the student with an understanding of how a recursive method works. The student is taught to think recursively in the second part of the chapter via a methodized approach to recursive algorithm development. Several examples are presented whose recursive solutions can be discovered by directly applying the methodology. Once comfortable with the methodology, the student is presented with other examples that require modification in order to discover their recursive solutions. The chapter concludes with a discussion of the speed and space complexity problems of recursive algorithms and the application of dynamic programming to alleviate these problems. Other applications of recursion appear in the chapters on trees, sorting, and graphs.
Chapter 7 introduces the student to the topic of trees. After a discussion of the terminology of trees in general and binary trees in particular, including the mathematics of binary trees, the algorithms of the binary search tree are introduced. All three cases of the delete algorithm are presented. The structure is implemented using a linked implementation, and a subsequent analysis of its performance leads to a discussion of self-balancing trees, specifically the AVL and re-black trees. The second part of the chapter presents the array-based implementation of a binary tree, which includes the pseudocode of its Insert and Fetch algorithms. The poor performance of the Delete algorithm under this implementation is discussed along with a suggested remedy. The chapter concludes with a discussion of the use of the red-black tree
17
implementation TreeMap included in Java's API. Chapter 8 presents the topic of sorting. It begins with a discussion of
the motivation for sorting and demonstrates the need for efficient sorting algorithms. Included is a discussion of the parameters and techniques used to determine the speed of these algorithms. The first algorithm presented is the binary tree sort, which uses the student's knowledge of search trees as a stepping stone to the analysis of sorting algorithms. The remainder of the chapter presents four other classic sorting algorithms, and it implements two of the better performers—the merge and quick sort—recursively. As each algorithm is discussed, its speed and space complexity are determined and summarized in a table for comparative purposes.
Chapter 9 presents the topic of graphs. In the first part of the chapter, they are presented as part of the data structure landscape. After a discussion of the terminology and the mathematics of graphs, the techniques for representing them are introduced and compared, and the algorithms used to operate on them are developed and implemented. These algorithms included both depth-first and breadth-first traversals. The first part of the chapter concludes with an implementation of a graph structure that includes a depth-first traversal. The second part of the chapter discusses the issues of connectivity and paths along with the associated algorithms. These algorithms include the determination of connectivity, spanning trees, and shortest paths based on Warshall's, Dijkstra's, and Floyd's algorithms.
Supplemental Materials The pedagogical features of the text are significantly enhanced by
animation courseware that can be run on any Java-enabled browser. These animations demonstrate the functionality of the algorithms associated with each data structure presented in the text by presenting the changes that take place in main memory as the algorithms execute. Other ancillary materials provided with the text are PowerPoint lecture outlines, the text's source code, spreadsheets that can be used to perform design performance trade- offs between various data structures, and solutions to selected exercises. These supplements can be found at Jones and Bartlett's catalog page, http://www.jbpub.com/catalog/9780763757564 .
Acknowledgments I would like to thank the team at Jones and Bartlett that aided me in
18
http://www.jbpub.com/catalog/9780763757564
the preparation of this book and guided me through the publication process: my acquisitions editor, Tim Anderson; his editorial assistant, Melissa Potter; Melissa Elmore, Associate Production Editor; Kat Macdonald, Production Editor; my copy editor, Diana Coe; and the entire production staff.
I'd also like to thank the reviewers for taking the time to review the manuscript and providing many valuable suggestions, most of which have been incorporated into the textbook: Tom Murphy, Contra Costa College, and Victor Shtern, Boston University.
Most importantly, I'd like to thank my family, friends, and colleagues who not only offered encouragement but also endless patience and understanding during the times when I was too busy to be me. Finally, I'd like to thank St. Joseph's College for the sabbatical that allowed me to get this textbook off the ground.
To the Students It's my hope that you enjoy reading this book, and that you will
experience the joy of accomplishment as you learn material that is fundamental to the remarkable discipline in which we have chosen to immerse ourselves. The pedagogy on which it is based evolved through the classroom experiences I've had with hundreds of students much like you. By the time you finish this book, you will be way ahead of my son-in-law (a lawyer) who says he'd rather wait for Data Structures: The Movie . I imagine each of you reading and learning from this book (even before the movie version is released), and I wish you much success in your future careers.
William McAllister
19
CHAPTER 1
Overview and Java Review
OBJECTIVES The objectives of this chapter are to familiarize the student with the
concepts common to all data structures and to review the Java constructs used to implement them. More specifically, the student will be able to
Understand the basic terminology of data structures, including data abstraction, encapsulation, linear lists, complexity, homogeneity, and the four basic operations performed on structures in both the key field and the node number mode.
Explain the difference between programmer-defined and built-in data structures, shallow and deep copies, absolute and relative algorithm speed, and static and nonstatic methods.
Understand the performance of a data structure, the factors that affect it, and the techniques used to quantify it, including the application of a Big-O analysis.
Understand the binary search algorithm and be able to explain and verify its O(log2 n ) performance.
Understand the significance of a data structure's density and be able to calculate it.
Understand and be able to declare primitive and reference variables in Java, arrays of primitive and reference variables, and understand their underlying memory models.
Understand and implement methods, classes, declarations, and operations on objects, and the passing of arguments by value.
Implement a node definition class and an application class that creates nodes and operates on them.
Understand and be able to implement inheritance and containership.
20
Generalize a method or class using the generic features of Java.
1.1 Data Structures A thorough knowledge of the topic of Data Structures is essential to
both software design and implementation. Its subject matter embodies techniques for writing software that performs efficiently, as opposed to software that merely produces correct results. Properly designed data structures can increase the speed of a program and reduce the amount of memory required to store the data that the program processes. In addition, it serves as a bridge between a basic programming course and more advanced topics such as operating systems, networking, and computer organization.
When asked to reflect on their undergraduate careers, most computer scientists consider the core topic of data structures to be the most challenging course in the undergraduate curriculum. Not only is the sheer volume of the material burdensome, but the concepts and algorithms introduced in the course are quite novel. Furthermore, implementation of the concepts into functional software often tax their novice programming abilities. It is not unusual for students to struggle at the beginning of this course until their coding skills improve and they can then focus all of their attention on the underlying concepts and algorithms.
After a brief discussion of data, data structures, and their roles and importance in the field of computer science, this chapter will present topics common to all data structures. It concludes with a review of the Java language, including the topics of objects, inheritance, and generics, which are necessary to implement the concepts presented in subsequent chapters.
1.1.1 What Is Data? Data is information.
Information such as your name, your address, and your age are all examples of data. The source of the information can be any device attached to a computer system, such as a keyboard, a mouse, a modem, a network line, a disk drive, or some other mass storage device. Sometimes the program itself can be the source of the data if, during processing, it generates data (Figure 1.1 ).
21
Figure 1.1 Some of the Sources of a Program's Input Data All programs, when viewed on a macro level, are the same: when
executed they accept input data, process the data, and then output a result (Figure 1.2 ). In some special cases, the input or output data set is empty, but more often it is not. Occasionally, although the input or output information appears to be empty, it is not. For example, consider a program that when executed simply generates a random number. It appears to have no input data; however, random number algorithms require a seed value , and although the value is not input by the program user, it still must be supplied for the algorithm to function properly. This seed value is often the current time supplied to the program by the computer's system clock.
Studies show that programs spend 80% of their execution time searching through memory to locate the data they process. When the data set is small, this time is usually insignificant. For example, a program that outputs the largest of 10 numbers stored in memory will identify the largest number after 10 memory accesses. Assuming a memory access takes 100 nanoseconds (10−9 seconds), the 10 memory accesses will be complete and the result will be available in 1,000 nanoseconds. It is difficult to imagine a user growing impatient in one millionth of a second. However, if the information was needed to make a critical mid-course correction to a space probe, or to shut down a runaway nuclear reactor, the time to locate the datum could be significant, even though the data set is small.
22
Figure 1.2 Macro View of a Program When programs deal with large data sets, the time spent searching for
a datum is more likely to be significant. Consider a program that searches sequentially through 400 million social security records. If the information requested is the 400 millionth record, and each memory access takes 100 nanoseconds, it will take 40 seconds to locate the information (400 × 106
accesses * 100 × 10−9 accesses per second). Studies show users grow impatient in one to two seconds; they find 40 seconds intolerable (Figure 1.3 ). Thus, the speed at which a program locates the data to be processed is usually a concern when designing programs that operate on large data sets, and it must also be considered when designing programs that operate on small data sets under stringent time constraints.
Another program design concern is efficient use of main memory. Programs execute most rapidly when both the data that they operate on and their processing instructions reside simultaneously in main memory. However, even on modern computer systems, main memory is a limited resource. Therefore, it is important that the amount of storage required in order for the data to be processed (as well as the program instructions) be kept to a minimum. Techniques for minimizing the data storage requirements of a program and the speed at which the program locates the data it processes are fundamental to the topic of data structures. Data Structures is the area of computer science that addresses these issues in order to ensure that the storage requirements and speed at which the data is accessed is consistent with the requirements of the application.
Figure 1.3 A User's Reaction to a Slow Program
1.1.2 What Is a Data Structure?
23
The National Institute of Standards and Technology defines a data structure as: “A data structure is an organization of information, usually in memory, for
better algorithm efficiency.” Therefore, the study of data structures is the study of how to organize
the information that a program processes in a way that improves the program's performance. Specifically, the information is organized in a way that facilitates the operations the program's algorithm performs on the data.
For example, suppose salespersons used a program to look up the price of an item in a store's inventory. A good organization of the data would be to place the most popular items at the beginning of the data set, since a sequential search would then rapidly locate them. However, this organization of the data would not as be attractive if the function of the program were to output the store's inventory in alphabetic order.
To improve the program's efficiency, the organization scheme often requires additional storage over and above the size of the program's data set. This extra storage is commonly referred to as overhead . Returning to our store example, if we wished to perform rapid price checks and rapidly produce inventory printouts, an organization scheme might store the data set twice: ordered by an item's popularity and ordered in alphabetic order. Although this would facilitate the operations performed on the data set (price check and inventory output) it would double the data memory required by the application.
Standard for Goodness A good data structure is one that organizes the data in a way that facilitates
the operations performed on the data, while keeping its total storage requirement at, or close to, the size of the data set.
There are two types of data structures: built-in data structures and programmer-defined data structures. Built-in data structures are schemes for storing data that are part of a programming language. Examples of these are memory cell (variable ) declarations, arrays, and Java's String class.
Programmer-defined data structures are schemes for storing data that are conceived and implemented by the programmer of a particular program. These data structures are constructed from the built-in data structures. Examples of these are parallel arrays, class definitions, hashing schemes, linked lists, trees, stacks, and queues. As programming languages
24
evolve, the built-in data structures have expanded to include some of the programmer defined data structures. For instance, Java's Application Programmer Interface includes implementations of hashing schemes, linked lists, trees, stacks, and queues. Many of these data structures are also implemented in the C++ Standard Template Library .
1.2 Selecting a Data Structure The selection of a data structure for a particular application can have
a significant effect on the performance of the program. Techniques borrowed from other engineering disciplines are used to ensure that the structure selected satisfies the performance criteria of the application and has a minimal impact on the overall implementation cost of the application.
1.2.1 The Data Structure's Impact on Performance In an introductory programming course, the programs students write
operate on small data sets and execution time constraints are not considered. As such, these programs are evaluated using the following two criteria (standards for goodness):
• The “penmanship” of the program is good. • For all sets of valid inputs, the program produces the correct outputs.
Here, we will consider penmanship to include such issues as an absence of syntax errors, good variable naming, proper indentation, good organization of the code (including the use of subprograms), and proper use of the language constructs (e.g., appropriate use of switch statements vs. if-else-if statements).
These criteria are adequate for programs that operate on small data sets. However, as discussed, programs that process large sets of data, or have stringent speed constraints previously imposed on them, must also consider two additional criteria:
• Efficient use of storage—both main memory and external storage. • Speed of execution.
There is additional criteria that brings us to the study of Data Structures, because the scheme that is selected for storing large data sets can have a significant impact on both speed and memory utilization.
To illustrate this point, consider a program that would service
25
telephone number information requests for a large city. Aside from information requests, listings will be both added to, and deleted, from the data base as residents move into and out of town. In addition, listings will be updated as customers relocate within the town. The conditions and assumptions of the application are given in Table 1.1 , as is the operation speed requirement, 0.10 seconds.
Suppose that four software firms have developed a program for this application, each of which uses a different data structure to store the data, as specified in Table 1.2 .
Table 1.1 Conditions and Assumptions for a Telephone Information
Request Application Condition Problem Assumptions Number of phone listings 100,000,000 Size of each person's name and each listing
16 bytes and 50 bytes
Time to fetch 4 bytes from memory 100 nanoseconds (100 × 10−9 seconds)
Time to execute an instruction 2 nanoseconds (2 × 10−9 seconds) Maximum time to perform an operation
0.10 seconds
Table 1.2 Data Structures Used by Four Telephone Information Request
Programs Program Data Structure 1 Unsorted Array 2 Sorted Array 3 Hashing 4 Perfect Hashing
Based on the first set of criteria, all the programs perform equally well: for valid inputs, each program produces valid outputs, and they all are written with good penmanship. However, since they operate on a large data set with a specified time constraint—perform an operation within one- half second—we must also consider the two additional evaluation criteria: efficient use of memory and speed of execution.
To evaluate the speed and memory utilization of the programs,
26
assume each program is put into service for several days. The average time needed to perform four common operations on the data set is monitored, as is the additional storage required by the programs above that used to store the 100,000,000 listings. The results, summarized in Table 1.3 , show that the performances of the candidate programs differ greatly. The shaded cells in the table indicate unacceptable performance. Only programs 3 and 4 meet the speed requirement of the problem, and program 4 requires an unacceptable amount of additional memory—memory above that required to store the 100,000,000 listings. (The techniques used to calculate the data in Table 1.3 are discussed in another section of this chapter.) By examining the data in the table, we can clearly see that the choice of a data structure can have a large effect on the execution speed and memory requirements of a program.
1.2.2 Determining the Performance of a Data Structure In our hypothetical telephone listing problem, we indicated that the
four programs were actually put into service to evaluate their performance. However, this is most often not the way the merits of candidate data structures (i.e., the alternative structures being considered) are evaluated. It would be much too costly to code an entire program just to determine whether the performance of a candidate data structure is acceptable. Rather, this evaluation must take place during the early stages of the design of the program, before any code is written.
To do this, we borrow tools from other engineering disciplines. Consider a group of civil engineers responsible for evaluating several candidate designs for a bridge. Certainly, they would never consider fabricating each design so that they can be tested to determine which design is the best. Cost issues aside, this build-and-test approach to design
27
trade-offs would be too time-consuming. Instead, civil engineers perform detailed calculations early in the design process to evaluate candidate designs. Based on the results of these calculations, they select the lowest cost design that satisfies the performance criteria.
Software engineers have adopted this design trade-off technique to evaluate candidate data structures early in the program design process. Consistent with the definition of a data structure, two sets of calculations are performed on each candidate data structure during the trade-off process:
• Calculations to determine the speed of the operations to be performed on the data. • Calculations to determine the amount of extra storage (overhead) associated with the data structure.
These two calculations are considered to be a measure of the performance of a data structure. A high-performing data structure is one that
• Rapidly operates on the data. • Minimizes storage requirements.
Unfortunately, due to the architecture of modern computer systems, rapid operation and minimal storage are usually mutually exclusive. Data structures that are very fast normally require high storage overhead; this was the case with our fourth telephone program's data structure. Data structures that minimize memory overhead can be slow; this was the case with our first and second telephone programs' data structure. Thus, the selection of the best structure for a particular application is usually a compromise, or trade-off, between speed and overhead and one other very important factor: cost.
1.2.3 The Trade-Off Process Once the performance of the candidate data structures has been
calculated (i.e., their speed and memory requirements), the trade-off process begins aimed at selecting the best data structure for the application. The selection of the best data structure should always be based on the following guideline:
Select the least expensive data structure that satisfies the speed requirements and storage constraints of the application.
Thus, there are three factors to consider in the trade-off: cost, speed,
28
and memory overhead. The process is illustrated in Figure 1.4 . Speed requirements can vary widely from one application to another.
A program that is monitoring the temperature of a nuclear reactor may have to operate on its data within a few hundred nanoseconds to prevent a meltdown, while a program that updates a bank account balance may have several seconds to perform its operation. When the data processing is performed to update a display viewed by humans, an operation time of 0.1 seconds is more than adequate. Studies show that faster response times are imperceptible to humans. Whatever the speed requirements for a particular problem are, good software engineering practices mandate that they be specified before the program is designed and that they be documented in the project's Requirements Document .
The cost of a data structure is primarily the labor cost associated with developing the code to implement the data structure. This includes its design, coding, testing, and documentation. Some data structures can be implemented in a few lines of code, while others take thousands of lines. Software engineering studies indicate that the cost of software is directly proportional to the number of lines of code that the software contains. Typical software costs for large software projects are illustrated in Figure 1.5 for various burdened labor rates. 1 The cost shown includes the cost of design, implementation, testing, and documentation. Thus, the most cost- effective data structures are those whose implementations require a minimal amount of code and utilize builtin data structures in their design.
Figure 1.4 Data Structure Selection Process Involving Four Candidate Structures
29
Figure 1.5 Cost of Software that is Part of a Large Project Using data similar to that presented in Figure 1.5 , the designer of a
data structure can estimate its cost by simply estimating the number of lines of code required to implement the data structure. Once this is done, the most inexpensive structure is selected from those that meet the speed criteria and demonstrate an acceptable level of memory overhead.
To illustrate the trade-off process depicted in Figure 1.4 , we will return to our telephone listing application. During the Requirements Phase of the project, our system analysis team has met with the customer and has consolidated their findings in the project's Requirements Document. Assume that the conditions specified in Table 1.1 have been reproduced from that document. As previously mentioned, four candidate data structures have been proposed for the project (see Table 1.2 ). These will be passed through the trade-off process illustrated in Figure 1.4 .
=