[Content_Types].xml
_rels/.rels
word/_rels/document.xml.rels
word/document.xml
CHAPTER PREVIEW Businesses of every size organize data records into collections called databases. At one extreme, small businesses use databases to keep track of customers; at the other extreme, huge corporations such as Boeing and Verizon use databases to support complex sales, marketing, and operations activities. In between are businesses such as All Road Parts that use databases as a crucial part of their operations. Such businesses have a small staff of professionals and can’t always support special needs, like those of Addison and Drew at All Road Parts. To obtain the one-of-a-kind reports they need, Addison and Drew need to be creative and adaptable.This chapter discusses the why, what, and how of database processing. We begin by describing the purpose of databases and then explain the important components of database systems. We then overview the process of creating a database system and summarize your role as a future user of such systems.Users have a crucial role in the development of database applications. Specifically, the structure and content of the database depend entirely on how users view their business activity. To build the database, the developers will create a model of that view using a tool called the entity-relationship model. You need to understand how to interpret such models because the development team might ask you to validate the correctness of such a model when building a system for your use. Finally, we describe the various database administration tasks.This chapter focuses on database technology. Here we consider the basic components of a database and the functions of database applications. You will learn how Addison used database reporting to solve the All Road Parts problem in Chapter 9.Q1 What Is the Purpose of a Database? The purpose of a database is to keep track of things. When most students learn that, they wonder why we need a special technology for such a simple task. Why not just use a list? If the list is long, put it into a spreadsheet.In fact, many professionals do keep track of things using spreadsheets. If the structure of the list is simple enough, there is no need to use database technology. The list of student grades in Figure 5-1, for example, works perfectly well in a spreadsheet.Figure 5-1 A List of Student Grades Presented in a SpreadsheetFigure 5-2 Student Data Shown in a Form from a Database lefttopSuppose, however, that the professor wants to track more than just grades. Say that the professor wants to record email messages as well. Or perhaps the professor wants to record both email messages and office visits. There is no place in Figure 5-1 to record that additional data. Of course, the professor could set up a separate spreadsheet for email messages and another one for office visits, but that awkward solution would be difficult to use because it does not provide all of the data in one place.Instead, the professor wants a form like that in Figure 5-2. With it, the professor can record student grades, emails, and office visits all in one place. A form like the one in Figure 5-2 is difficult, if not impossible, to produce from a spreadsheet. Such a form is easily produced, however, from a database.The key distinction between Figures 5-1 and 5-2 is that the data in Figure 5-1 is about a single theme or concept. It is about student grades only. The data in Figure 5-2 has multiple themes; it shows student grades, student emails, and student office visits. We can make a general rule from these examples: Lists of data involving a single theme can be stored in a spreadsheet; lists that involve data with multiple themes require a database. We will say more about this general rule as this chapter proceeds.As you will see, databases can be more difficult to develop than spreadsheets; this difficulty causes some people to prefer to work with spreadsheets—or at least pretend to—as described in the Security Guide on pages 186–187. Q2 What Is a Database? A database is a self-describing collection of integrated records. To understand the terms in this definition, you first need to understand the terms illustrated in Figure 5-3. As you learned in Chapter 4, a byte is a character of data. In databases, bytes are grouped into columns, such as Student Number and Student Name. Columns are also called fields. Columns or fields, in turn, are grouped into rows, which are also called records. In Figure 5-3, the collection of data for all columns (Student Number, Student Name, HW1, HW2, and MidTerm) is called a row or a record. Finally, a group of similar rows or records is called a table or a file. From these definitions, you can see that there is a hierarchy of data elements, as shown in Figure 5-4.It is tempting to continue this grouping process by saying that a database is a group of tables or files. This statement, although true, does not go far enough. As shown in Figure 5-5, a database is a collection of tables plus relationships among the rows in those tables, plus special data, called metadata, that describes the structure of the database. By the way, the cylindrical symbol labeled “database” in Figure 5-5 represents a computer disk drive. It is used like this because databases are stored on disks.lefttopFigure 5-3 Student Table (also called a file)Figure 5-4 Hierarchy of Data ElementsFigure 5-5 Components of a DatabaselefttopConsider the terms on the left-hand side of Figure 5-5. You know what tables are. To understand what is meant by relationships among rows in tables, examine Figure 5-6. It shows sample data from the three tables Email, Student, and Office Visit. Notice the column named Student Number in the Email table. That column indicates the row in Student to which a row of Email is connected. In the first row of Email, the Student Number value is 1325. This indicates that this particular email was received from the student whose Student Number is 1325. If you examine the Student table, you will see that the row for Andrea Baker has this value. Thus, the first row of the Email table is related to Andrea Baker.lefttopFigure 5-6 Example of Relationships Among Rows Now consider the last row of the Office Visit table at the bottom of the figure. The value of Student Number in that row is 4867. This value indicates that the last row in Office Visit belongs to Adam Verberra.From these examples, you can see that values in one table relate rows of that table to rows in a second table. Several special terms are used to express these ideas. A key (also called a primary key) is a column Now consider the last row of the office Visit table at the bottom of the figure. The value of Student Number in that row is 4867. This value indicates that the last row in office Visit belongs to Adam Verberra.From these examples, you can see that values in one table relate rows of that table to rows in a second table. Several special terms are used to express these ideas. A key (also called a primary key) is a column or group of columns that identifies a unique row in a table. Student Number is the key of the Student table. Given a value of Student Number, you can determine one and only one row in Student. Only one student has the number 1325, for example.Every table must have a key. The key of the Email table is EmailNum, and the key of the office Visit table is Visit ID. Sometimes more than one column is needed to form a unique identifier. In a table called City, for example, the key would consist of the combination of columns (City, State) because a given city name can appear in more than one state.Student Number is not the key of the Email or the office Visit tables. We know that about Email because there are two rows in Email that have the Student Number value 1325. The value 1325 does not identify a unique row; therefore, Student Number cannot be the key of Email.Nor is Student Number a key of office Visit, although you cannot tell that from the data in Figure 5-6. If you think about it, however, there is nothing to prevent a student from visiting a professor more than once. If that were to happen, there would be two rows in Office Visit with the same value of Student Number. It just happens that no student has visited twice in the limited data in Figure 5-6.In both Email and Office Visit, Student Number is a key, but it is a key of a different table, namely Student. Hence, the columns that fulfill a role like that of Student Number in the Email and Office Visit tables are called foreign keys. This term is used because such columns are keys, but they are keys of a different (foreign) table than the one in which they reside.Before we go on, databases that carry their data in the form of tables and that represent relationships using foreign keys are called relational databases. (The term relational is used because another, more formal name for a table like those we’re discussing is relation.) You’ll learn about another kind of database, or data store, in Q8 and in Case Study 5.Metadata Recall the definition of database: A database is a self-describing collection of integrated records. The records are integrated because, as you just learned, rows can be tied together by their key/foreign key relationship. Relationships among rows are represented in the database. But what does self-describing mean?It means that a database contains, within itself, a description of its contents. Think of a library. A library is a self-describing collection of books and other materials. It is self-describing because the library contains a catalog that describes the library’s contents. The same idea also pertains to a database. Databases are self-describing because they contain not only data, but also data about the data in the database.Metadata is data that describes data. Figure 5-7 shows metadata for the Email table. The format of metadata depends on the software product that is processing the database. Figure 5-7 shows the metadata as it appears in Microsoft Access. Each row of the top part of this form describes a column of the Email table. The columns of these descriptions are Field Name, Data Type, and Description. Field Name contains the name of the column, Data Type shows the type of data the column may hold, and Description contains notes that explain the source or use of the column. As you can see, there is one row of metadata for each of the four columns of the Email table: EmailNum, Date, Message, and Student Number.11430020955Figure 5-7 Sample Metadata (in Access) The bottom part of this form provides more metadata, which Access calls Field Properties, for each column. In Figure 5-7, the focus is on the Date column (note the light rectangle drawn around the Date row). Because the focus is on Date in the top pane, the details in the bottom pane pertain to the Date column. The Field Properties describe formats, a default value for Access to supply when a new row is created, and the constraint that a value is required for this column. It is not important for you to remember these details. Instead, just understand that metadata is data about data and that such metadata is always a part of a database.Ethics Guide: QUERYING INEQUALITY? MaryAnn Baker works as a data analyst in human relations at a large, multinational corporation. As part of its compensation program, her company defines job categories and assigns salary ranges to each category. For example, the category M1 is used for first-line managers and is assigned the salary range of $75,000 to $95,000. Every job description is assigned to one of these categories, depending on the knowledge and skills required to do that job. Thus, the job titles Manager of Customer Support, Manager of Technical Writing, and Manager of Product Quality Assurance are all judged to involve about the same level of expertise and are all assigned to category M1.One of MaryAnn’s tasks is to analyze company salary data and determine how well actual salaries conform to established ranges. When discrepancies are noted, human relations managers meet to determine whether the discrepancy indicates a need to:• Adjust the category’s salary range; • Move the job title to a different category; • Define a new category; or • Train the manager of the employee with the discrepancy on the use of salary ranges in setting employee compensation. MaryAnn is an expert in creating database queries. Initially, she used Microsoft Access to produce reports, but much of the salary data she needs resides in the organization’s Oracle database. At first, she would ask the IS Department to extract certain data and move it into Access, but over time she learned that it was faster to ask IS to move all employee data from the operational Oracle database into another Oracle database created just for HR data analysis. Although Oracle provides a graphical query interface like that in Access, she found it easier to compose complex queries directly in SQL, so she learned it and, within a few months, became an SQL expert.“I never thought I’d be doing this,” she said. “But it turns out to be quite fun, like solving a puzzle, and apparently I’m good at it.”left643890One day, after a break, MaryAnn signed into her computer and happened to glance at the results of a query that she’d left running while she was gone. “That’s odd,” she thought, “all the people with Hispanic surnames have lower salaries than the others.” She wasn’t looking for that pattern; it just happened to jump out at her as she glanced at the screen.As she examined the data, she began to wonder if she was seeing a coincidence or if there was a discriminatory pattern within the organization. Unfortunately for MaryAnn’s purposes, the organization did not track employee race in its database, so she had no easy way of identifying employees of Hispanic heritage other than reading through the list of surnames. But, as a skilled problem solver, that didn’t stop MaryAnn. She realized that many employees having Hispanic origins were born in certain cities in Texas, New Mexico, Arizona, and California. Of course, this wasn’t true for all employees; many non-Hispanic employees were born in those cities, too, and many Hispanic employees were born in other cities. This data was still useful, however, because MaryAnn’s sample queries revealed that the proportion of employees with Hispanic surnames who were also born in those cities was very high. “OK,” she thought, “I’ll use those cities as a rough surrogate.”Using birth city as a query criterion, MaryAnn created queries that determined employees who were born in the selected cities earned, on average, 23 percent less than those who were not. “Well, that could be because they work in lower-pay-grade jobs.” After giving it a bit of thought, MaryAnn realized that she needed to examine wages and salaries within job categories. “Where,” she wondered, “do people born in those cities fall in the ranges of their job categories?” So, she constructed SQL to determine where within a job category the compensation for people born in the selected cities fell. “Wow!” she said to herself, “almost 80 percent of the employees born in those cities fall into the bottom half of their salary range.”MaryAnn scheduled an appointment with her manager for the next day.The presence of metadata makes databases much more useful. Because of metadata, no one needs to guess, remember, or even record what is in the database. To find out what a database contains, we just look at the metadata inside the database.Q3 What Is a Database Management System (DBMS)? A database management system (DBMS) is a program used to create, process, and administer a database. As with operating systems, almost no organization develops its own DBMS. Instead, companies license DBMS products from vendors such as IBM, Microsoft, Oracle, and others. Popular DBMS products are DB2 from IBM, Access and SQL Server from Microsoft, and Oracle Database from the Oracle Corporation. Another popular DBMS is MySQL, an open source DBMS product that is license-free for most applications.1 Other DBMS products are available, but these five process the great bulk of databases today.Note that a DBMS and a database are two different things. For some reason, the trade press and even some books confuse the two. A DBMS is a software program; a database is a collection of tables, relationships, and metadata. The two are very different concepts.Creating the Database and Its Structures Database developers use the DBMS to create tables, relationships, and other structures in the database. The form in Figure 5-7 can be used to define a new table or to modify an existing one. To create a new table, the developer just fills the new table’s metadata into the form.To modify an existing table—say, to add a new column—the developer opens the metadata form for that table and adds a new row of metadata. For example, in Figure 5-8 the developer has added a new column called Response?. This new column has the data type Yes/No, which means that the column can contain only one value—Yes or No. The professor will use this column to indicate whether he has responded to the student’s email. A column can be removed by deleting its row in this table, though doing so will lose any existing data.Figure 5-8 Adding a New Column to a Table (in Access) Source: Microsoft Access 2013 1MySQL was supported by the MySQL company. In 2008, that company was acquired by Sun Microsystems, which was, in turn, acquired by Oracle later that year. However, because MySQL is open source, Oracle does not own the source code. A GROUP EXERCISE Using MIS in Class 5: How Much Is a Database Worth? The Firm, Minneapolis (www.TheFirmMpls.com), is a workout studio that realizes more than 15,000 person-visits per month, an average of 500 visits per day. Neil Miyamoto, one of the two business partners, believes that the database is The Firm’s single most important asset. According to Neil:Take away anything else—the building, the equipment, the inventory—anything else, and we’d be back in business 6 months or less. Take away our customer database, however, and we’d have to start all over. It would take us another 8 years to get back where we are.2 Why is the database so crucial? It records everything the company’s customers do. If the Firm decides to offer an early morning kickboxing class featuring a particular trainer, it can use its database to offer that class to everyone who ever took an early morning class, a kickboxing class, or a class by that trainer. Customers receive targeted solicitations for offerings they care about and, maybe equally important, they don’t receive solicitations for those they don’t care about. Clearly, The Firm database has value and, if it wanted to, The Firm could sell that data.In this exercise, you and a group of your fellow students will be asked to consider the value of a database to organizations other than The Firm.1. Many small business owners have found it financially advantageous to purchase their own building. As one owner remarked upon his retirement, “We did well with the business, but we made our real money by buying the building.” Explain why this might be so. 2. To what extent does the dynamic you identified in your answer to item 1 pertain to databases? Do you think it likely that, in 2050, some small business owners will retire and make statements like, “We did well with the business, but we made our real money from the database we generated?” Why or why not? In what ways is real estate different from database data? Are these differences significant to your answer? 3. Suppose you had a national database of student data. Assume your database includes the name, email address, university, grade level, and major for each student. Name five companies that would find that data valuable and explain how they might use it. (For example, Pizza Hut could solicit orders from students during finals week.) 4. Describe a product or service that you could develop that would induce students to provide the data in item 3. 5. Considering your answers to items 1 through 4, identify two organizations in your community that could generate a database that would potentially be more valuable than the organization itself. Consider businesses, but also think about social organizations and government offices. For each organization, describe the content of the database and how you could entice customers or clients to provide that data. Also, explain why the data would be valuable and who might use it.6. Relate what you have learned in this exercise to the 3D printing discussion at All Road Parts. 7. Prepare a 1-minute statement of what you have learned from this exercise that you could use in a job interview to illustrate your ability to innovate the use of technology in business. 8. Present your answers to items 1–6 to the rest of the class. 2Personal conversation with the author, 2012. Reprinted by permission. Processing the Database The second function of the DBMS is to process the database. Such processing can be quite complex, but, fundamentally, the DBMS provides applications for four processing operations: to read, insert, modify, or delete data. These operations are requested in application calls upon the DBMS. From a form, when the user enters new or changed data, a computer program behind the form calls the DBMS to make the necessary database changes. From a Web application, a program on the client or on the server calls the DBMS directly to make the change.Structured Query Language (SQL) is an international standard language for processing a database. All five of the DBMS products mentioned earlier accept and process SQL (pronounced “see-quell”) statements. As an example, the following SQL statement inserts a new row into the Student table:INSERT INTO Student ([Student Number], [Student Name], HW1, HW2, MidTerm) VALUES (1000, ‘Franklin, Benjamin’, 90, 95, 100); As stated, statements like this one are issued “behind the scenes” by programs that process forms and reports. Alternatively, they can be issued directly to the DBMS by an application program.You do not need to understand or remember SQL language syntax. Instead, just realize that SQL is an international standard for processing a database. SQL can also be used to create databases and database structures. You will learn more about SQL if you take a database management class.Administering the Database A third DBMS function is to provide tools to assist in the administration of the database. Database administration involves a wide variety of activities. For example, the DBMS can be used to set up a security system involving user accounts, passwords, permissions, and limits for processing the database. To provide database security, a user must sign on using a valid user account before she can process the database.Permissions can be limited in very specific ways. In the Student database example, it is possible to limit a particular user to reading only Student Name from the Student table. A different user could be given permission to read the entire Student table, but limited to update only the HW1, HW2, and MidTerm columns. Other users can be given still other permissions.In addition to security, DBMS administrative functions include backing up database data, adding structures to improve the performance of database applications, removing data that are no longer wanted or needed, and similar tasks.For important databases, most organizations dedicate one or more employees to the role of database administration. Figure 5-9 summarizes the major responsibilities for this function. You will learn more about this topic if you take a database management course.Q4 How Do Database Applications Make Databases More Useful? A set of database tables, by itself, is not very useful; the tables in Figure 5-6 contain the data the professor wants, but the format is awkward at best. The data in database tables can be made more useful, or more available for the conception of information, when it is placed into forms like that in Figure 5-2 or other formats.A database application is a collection of forms, reports, queries, and application programs3 that serves as an intermediary between users and database data. Database applications reformat database table data to make it more informative and more easily updated. Application programs also have features that provide security, maintain data consistency, and handle special cases.3Watch out for confusion between a database application and a database application program. A database application includes forms, reports, queries, and database application programs. Figure 5-9 Summary of Database Administration (DBA) TasksThe specific purposes of the four elements of a database application are:Forms View data; insert new, update existing, and delete existing dataReports Structured presentation of data using sorting, grouping, filtering, and other operationsQueries Search based upon data values provided by the userApplication programs Provide security, data consistency, and special purpose processing, e.g., handle out-of-stock situationsDatabase applications came into prominence in the 1990s and were based on the technology that was available at that time. Many existing systems today are long-lived extensions to those applications; the ERP system SAP (discussed in Chapter 7) is a good example of this concept. You should expect to see these kinds of applications during the early years of your career.Today, however, many database applications are based on newer technology that employs browsers, the Web, and related standards. These browser-based applications can do everything the older ones do, but they are more dynamic and better suited to today’s world. To see why, consider each type.Figure 5-10 Components of a Database Application System Traditional Forms, Queries, Reports, and Applications In most cases, a traditional database is shared among many users. In that case, the application shown in Figure 5-10 resides on the users’ computers and the DBMS and database reside on a server computer. A network, in most cases not the Internet, is used to transmit traffic back and forth between the users’ computers and the DBMS server computer.Database technology puts unprecedented ability to conceive information into the hands of users. But what do you do with that information when you find something objectionable? See the Ethics Guide on pages 164–165 for an example case. Single-user databases like those in Microsoft Access are an exception. With such databases, the application, the DBMS, and the database all reside on the user’s computer.Traditional forms appeared in window-like displays like that in Figure 5-2. They serve their purpose; users can view, insert, modify, and delete data with them, but by today’s standards, they look clunky.lefttop00Figure 5-11 Example of a Student ReportFigure 5-11 shows a traditional report, which is a static display of data, placed into a format that is meaningful to the user. In this report, each of the emails for a particular student is shown after the student’s name and grade data. Figure 5-12 shows a traditional query. The user specifies query criteria in a window-like box (Figure 5-12a), and the application responds with data that fit those criteria (Figure 5-12b).lefttopFigure 5-12a Sample Query Form Used to Enter Phrase for SearchFigure 5-12b Sample Query Results of Query Operation Source: Microsoft Access 2013 Traditional database application programs are written in object-oriented languages such as C++ and VisualBasic (and even in earlier languages like COBOL). They are thick applications that need to be installed on users’ computers. In some cases, all of the application logic is contained in a program on users’ computers and the server does nothing except run the DBMS and serve up data. In other cases, some application code is placed on both the users’ computers and the database server computer.As stated, in the early years of your career, you will still see traditional applications, especially for enterprise-wide applications like ERP and CRM. Most likely, you will also be concerned, as a user if not in a more involved way, with the transition from such traditional applications into browser-based applications.Browser Forms, Reports, Queries, and Applications lefttopFigure 5-13 Four Application Programs on a Web Server Computer The databases in browser-based applications are nearly always shared among many users. As shown in Figure 5-13, the users’ browsers connect over the Internet to a Web server computer, which in turn connects to a database server computer (often many computers are involved on the server side of the Internet).Figure 5-14 Account Creation Browser FormBrowser applications are thin-client applications that need not be pre-installed on the users’ computers. In most cases, all of the code for generating and processing the application elements is shared between the users’ computers and the servers. JavaScript is the standard language for user-side processing. Languages like C# and Java are used for server-side code, though JavaScript is starting to be used on the server with an open source product named Node.js (all of this is discussed further in Chapter 6).Browser database application forms, reports, and queries are displayed and processed using html and, most recently, using html5, css3, and JavaScript as you learned in Chapter 4. Figure 5-14 shows a browser form that is used to create a new user account in Office 365. The form’s content is dynamic; the user can click on the blue arrow next to Additional Details to see more data. Also, notice the steps in the left-hand side that outline the process that administrator will follow when creating the new account. The current step is shown in color. Compare and contrast this form with that in Figure 5-2; it is cleaner, with much less chrome.Figure 5-15 illustrates a browser report that shows the content of a SharePoint site. The content is dynamic; many of the items can be clicked to produce other reports or take other actions. The user can select a criterion in the box in the upper-right-hand corner to filter the report to display only a specific type of content.Browser-based applications can support traditional queries, but more exciting are graphical queries, in which query criteria are created when the user clicks on a graphic. Figure 5-16 shows a car jack for an off-road vehicle like a Jeep. All Road Parts might use a photo like this to show available parts to customers. Users click on parts of the jack and, in browser code behind the scene, query criteria are sent to the database application to display part order data for that particular part. In this way, users need not specify part numbers and so on when ordering. (See the opening dialogue of Chapter 4 regarding the impact of sales costs on that situation.)Security requirements are more stringent for browser-based applications than for traditional ones. Most traditional applications run within a corporate network that is protected from the wild and woolly Internet. Browser-based applications are normally open to the public, over the Internet, and as such are far more vulnerable. Thus, protecting security is a major function for browser-based application programs. Like traditional database application programs, they need to provide for data consistency and to handle special conditions as well. As an example of the need for data consistency, consider the problems introduced by multi-user processing.Figure 5-15 Browser Report lefttopMulti-User Processing Most traditional and browser-based applications involve multiple users processing the same database. While such multi-user processing is common, it does pose unique problems that you, as a future manager, should know about. To understand the nature of those problems, consider the following scenario, which could occur on either a traditional or browser-based application.Two All Road Parts customers, Andrea and Jeffrey, are both attempting to buy the last two pedal sets for a particular trail bike. Andrea uses her browser to access the All Road Web site and finds that two sets are available. She places both of them in her shopping cart. She doesn’t know it, but when she opened the order form, she invoked an application program on All Road’s server that read the database to find that two sets are available. Before she checks out, she takes a moment to verify with her spouse that she should buy both sets.Meanwhile, Jeffrey uses his browser and also finds that two sets are available because his browser activates that same application that reads the database and finds (because Andrea has not yet checked out) that two are available. He places both in his cart and checks out.Meanwhile, Andrea and her spouse decide to buy both, so she checks out. Clearly, we have a problem. Both Andrea and Jeffrey have purchased the same two pedal sets. One of them is going to be disappointed.This problem, known as the lost-update problem, exemplifies one of the special characteristics of multi-user database processing. To prevent this problem, some type of locking must be used to coordinate the activities of users who know nothing about one another. Locking brings its own set of problems, however, and those problems must be addressed as well. We will not delve further into this topic here, however.Be aware of possible data conflicts when you manage business activities that involve multiuser processing. If you find inaccurate results that seem not to have a cause, you may be experiencing multi-user data conflicts. Contact your IS department for assistance.Q5 How Are Data Models Used for Database Development? In Chapter 10, we will describe the process for developing information systems in detail. However, business professionals have such a critical role in the development of database applications that we need to anticipate part of that discussion here by introducing two topics—data modeling and database design.Because the design of the database depends entirely on how users view their business environment, user involvement is critical for database development. Think about the Student database. What data should it contain? Possibilities are: Students, Classes, Grades, Emails, Office Visits, Majors, Advisers, Student Organizations—the list could go on and on. Further, how much detail should be included in each? Should the database include campus addresses? Home addresses? Billing addresses?In fact, there are dozens of possibilities, and the database developers do not and cannot know what to include. They do know, however, that a database must include all the data necessary for the users to perform their jobs. Ideally, it contains that amount of data and no more. So, during database development the developers must rely on the users to tell them what to include in the database.Figure 5-17 Database Development Process Database structures can be complex, in some cases very complex. So, before building the database the developers construct a logical representation of database data called a data model. It describes the data and relationships that will be stored in the database. It is akin to a blueprint. Just as building architects create a blueprint before they start building, so, too, database developers create a data model before they start designing the database.Figure 5-17 summarizes the database development process. Interviews with users lead to database requirements, which are summarized in a data model. Once the users have approved (validated) the data model, it is transformed into a database design. That design is then implemented into database structures. We will consider data modeling and database design briefly in the next two sections. Again, your goal should be to learn the process so that you can be an effective user representative for a development effort.What Is the Entity-Relationship Data Model? The entity-relationship (E-R) data model is a tool for constructing data models. Developers use it to describe the content of a data model by defining the things (entities) that will be stored in the database and the relationships among those entities. A second, less popular tool for data modeling is the Unified Modeling Language (UML). We will not describe that tool here. However, if you learn how to interpret E-R models, with a bit of study you will be able to understand UML models as well.Entities An entity is some thing that the users want to track. Examples of entities are Order, Customer, Salesperson, and Item. Some entities represent a physical object, such as Item or Salesperson; others represent a logical construct or transaction, such as Order or Contract. For reasons beyond this discussion, entity names are always singular. We use Order, not Orders; Salesperson, not Salespersons.Entities have attributes that describe characteristics of the entity. Example attributes of Order are Order Number, Order Date, Subtotal, Tax, Total, and so forth. Example attributes of Salesperson are Salesperson Name, Email, Phone, and so forth.Entities have an identifier, which is an attribute (or group of attributes) whose value is associated with one and only one entity instance. For example, Order Number is an identifier of Order because only one Order instance has a given value of Order Number. For the same reason, Customer Number is an identifier of Customer. If each member of the sales staff has a unique name, then Salesperson Name is an identifier of Salesperson.Before we continue, consider that last sentence. Is the salesperson’s name unique among the sales staff? Both now and in the future? Who decides the answer to such a question? Only the users know whether this is true; the database developers cannot know. This example underlines why it is important for you to be able to interpret data models because only users like you will know for sure.Figure 5-18 Student Data Model Entities Figure 5-18 shows examples of entities for the Student database. Each entity is shown in a rectangle. The name of the entity is just above the rectangle, and the identifier is shown in a section at the top of the entity. Entity attributes are shown in the remainder of the rectangle. In Figure 5-18, the Adviser entity has an identifier called AdviserName and the attributes Phone, Campus Address, and EmailAddress.Observe that the entities Email and Office Visit do not have an identifier. Unlike Student or Adviser, the users do not have an attribute that identifies a particular email. We could make one up. For example, we could say that the identifier of Email is Email Number, but if we do so we are not modeling how the users view their world. Instead, we are forcing something onto the users. Be aware of this possibility when you review data models about your business. Do not allow the database developers to create something in the data model that is not part of your business world.Relationships Entities have relationships to each other. An Order, for example, has a relationship to a Customer entity and also to a Salesperson entity. In the Student database, a Student has a relationship to an Adviser, and an Adviser has a relationship to a Department.Figure 5-19 shows sample Department, Adviser, and Student entities and their relationships. For simplicity, this figure shows just the identifier of the entities and not the other attributes. For this sample data, Accounting has three professors—Jones, Wu, and Lopez—and Finance has two professors—Smith and Greene.Figure 5-19 Example of Department, Adviser, and Student Entities and Relationships Figure 5-20 Sample Relationships Version 1 The relationship between Advisers and Students is a bit more complicated because in this example an adviser is allowed to advise many students and a student is allowed to have many advisers. Perhaps this happens because students can have multiple majors. In any case, note that Professor Jones advises students 100 and 400 and that student 100 is advised by both Professors Jones and Smith.Diagrams like the one in Figure 5-19 are too cumbersome for use in database design discussions. Instead, database designers use diagrams called entity-relationship (E-R) diagrams. Figure 5-20 shows an E-R diagram for the data in Figure 5-19. In this figure, all of the entities of one type are represented by a single rectangle. Thus, there are rectangles for the Department, Adviser, and Student entities. Attributes are shown as before in Figure 5-18.Additionally, a line is used to represent a relationship between two entities. Notice the line between Department and Adviser, for example. The forked lines on the right side of that line signify that a department may have more than one adviser. The little lines, which are referred to as crow’s feet, are shorthand for the multiple lines between Department and Adviser in Figure 5-19. Relationships like this one are called 1:N, or one-to-many relationships, because one department can have many advisers, but an adviser has at most one department.Now examine the line between Adviser and Student. Notice the short lines that appear at each end of the line. These lines are the crow’s feet, and this notation signifies that an adviser can be related to many students and that a student can be related to many advisers, which is the situation in Figure 5-19. Relationships like this one are called N:M, or many-to-many relationships, because one adviser can have many students and one student can have many advisers.Students sometimes find the notation N:M confusing. Interpret the N and M to mean that a variable number, greater than one, is allowed on each side of the relationship. Such a relationship is not written N:N because that notation would imply that there are the same number of entities on each side of the relationship, which is not necessarily true. N:M means that more than one entity is allowed on each side of the relationship and that the number of entities on each side can be different.Figure 5-21 shows the same entities with different assumptions. Here, advisers may advise in more than one department, but a student may have only one adviser, representing a policy that students may not have multiple majors.Which, if either, of these versions is correct? Only the users know. These alternatives illustrate the kinds of questions you will need to answer when a database designer asks you to check a data model for correctness.Figures 5-20 and 5-21 are typical examples of an entity-relationship diagram. Unfortunately, there are several different styles of entity-relationship diagrams. This one is called, not surprisingly, a crow’s-foot diagram version. You may learn other versions if you take a database management class.Figure 5-21 Sample Relationships Version 2Figure 5-22 Sample Relationships Showing Both Maximum and Minimum Cardinalities The crow’s-foot notation shows the maximum number of entities that can be involved in a relationship. Accordingly, they are called the relationship’s maximum cardinality. Common examples of maximum cardinality are 1:N, N:M, and 1:1 (not shown).Another important question is, “What is the minimum number of entities required in the relationship?” Must an adviser have a student to advise, and must a student have an adviser? Constraints on minimum requirements are called minimum cardinalities.Figure 5-22 presents a third version of this E-R diagram that shows both maximum and minimum cardinalities. The vertical bar on a line means that at least one entity of that type is required. The small oval means that the entity is optional; the relationship need not have an entity of that type.Thus, in Figure 5-22 a department is not required to have a relationship to any adviser, but an adviser is required to belong to a department. Similarly, an adviser is not required to have a relationship to a student, but a student is required to have a relationship to an adviser. Note, also, that the maximum cardinalities in Figure 5-22 have been changed so that both are 1:N.Is the model in Figure 5-22 a good one? It depends on the policy of the university. Again, only the users know for sure.Q6 How Is a Data Model Transformed into a Database Design? Database design is the process of converting a data model into tables, relationships, and data constraints. The database design team transforms entities into tables and expresses relationships by defining foreign keys. Database design is a complicated subject; as with data modeling, it occupies weeks in a database management class. In this section, however, we will introduce two important database design concepts: normalization and the representation of two kinds of relationships. The first concept is a foundation of database design, and the second will help you understand important design considerations.Normalization Normalization is the process of converting a poorly structured table into two or more well-structured tables. A table is such a simple construct that you may wonder how one could possibly be poorly structured. In truth, there are many ways that tables can be malformed—so many, in fact, that researchers have published hundreds of papers on this topic alone.Consider the Employee table in Figure 5-23a. It lists employee names, hire dates, email addresses, and the name and number of the department in which the employee works. This table seems innocent enough. But consider what happens when the Accounting department changes its name to Accounting and Finance. Because department names are duplicated in this table, every row that has a value of “Accounting” must be changed to “Accounting and Finance.”Figure 5-23 A Poorly Designed Employee Table Data Integrity Problems Suppose the Accounting name change is correctly made in two rows, but not in the third. The result is shown in Figure 5-23b. This table has what is called a data integrity problem: Some rows indicate that the name of Department 100 is “Accounting and Finance,” and another row indicates that the name of Department 100 is “Accounting.”This problem is easy to spot in this small table. But consider a table like the Customer table in the Amazon.com database or the eBay database. Those databases have millions of rows. Once a table that large develops serious data integrity problems, months of labor will be required to remove them.Data integrity problems are serious. A table that has data integrity problems will produce incorrect and inconsistent results. Users will lose confidence in the data, and the system will develop a poor reputation. Information systems with poor reputations become serious burdens to the organizations that use them.Normalizing for Data Integrity The data integrity problem can occur only if data are duplicated. Because of this, one easy way to eliminate the problem is to eliminate the duplicated data. We can do this by transforming the table design in Figure 5-23a into two tables, as shown in Figure 5-24. Here the name of the department is stored just once; therefore, no data inconsistencies can occur.Of course, to produce an employee report that includes the department name, the two tables in Figure 5-24 will need to be joined back together. Because such joining of tables is common, DBMS products have been programmed to perform it efficiently, but it still requires work. From this example, you can see a trade-off in database design: Normalized tables eliminate data duplication, but they can be slower to process. Dealing with such trade-offs is an important consideration in database design.The general goal of normalization is to construct tables such that every table has a single topic or theme. In good writing, every paragraph should have a single theme. This is true of databases as well; every table should have a single theme. The problem with the table design in Figure 5-23 is that it has two independent themes: employees and departments. The way to correct the problem is to split the table into two tables, each with its own theme. In this case, we create an Employee table and a Department table, as shown in Figure 5-24.Figure 5-24 Two Normalized Tables As mentioned, there are dozens of ways that tables can be poorly formed. Database practitioners classify tables into various normal forms according to the kinds of problems they have. Transforming a table into a normal form to remove duplicated data and other problems is called normalizing the table.4 Thus, when you hear a database designer say, “Those tables are not normalized,” she does not mean that the tables have irregular, not-normal data. Instead, she means that the tables have a format that could cause data integrity problems.Summary of Normalization As a future user of databases, you do not need to know the details of normalization. Instead, understand the general principle that every normalized (well-formed) table has one and only one theme. Further, tables that are not normalized are subject to data integrity problems.Be aware, too, that normalization is just one criterion for evaluating database designs. Because normalized designs can be slower to process, database designers sometimes choose to accept non-normalized tables. The best design depends on the users’ processing requirements.Representing Relationships Figure 5-25 shows the steps involved in transforming a data model into a relational database design. First, the database designer creates a table for each entity. The identifier of the entity becomes the key of the table. Each attribute of the entity becomes a column of the table. Next, the resulting tables are normalized so that each table has a single theme. Once that has been done, the next step is to represent relationship among those tables.Figure 5-25 Transforming a Data Model into a Database Design For example, consider the E-R diagram in Figure 5-26a. The Adviser entity has a 1:N relationship to the Student entity. To create the database design, we construct a table for Adviser and a second table for Student, as shown in Figure 5-26b. The key of the Adviser table is AdviserName, and the key of the Student table is Student Number.Figure 5-26 Representing a 1: N Relationship Further, the EmailAddress attribute of the Adviser entity becomes the EmailAddress column of the Adviser table, and the Student Name and MidTerm attributes of the Student entity become the Student Name and MidTerm columns of the Student table.The next task is to represent the relationship. Because we are using the relational model, we know that we must add a foreign key to one of the two tables. The possibilities are: (1) place the foreign key Student Number in the Adviser table or (2) place the foreign key AdviserName in the Student table.The correct choice is to place AdviserName in the Student table, as shown in Figure 5-26 c. To determine a student’s adviser, we just look into the AdviserName column of that student’s row. To determine the adviser’s students, we search the AdviserName column in the Student table to determine which rows have that adviser’s name. If a student changes advisers, we simply change the value in the AdviserName column. Changing Jackson to Jones in the first row, for example, will assign student 100 to Professor Jones.Figure 5-27 Representing an N:M Relationship For this data model, placing Student Number in Adviser would be incorrect. If we were to do that, we could assign only one student to an adviser. There is no place to assign a second adviser.This strategy for placing foreign keys will not work for N:M relationships, however. Consider the data model in Figure 5-27a; here advisers and students have a many-to-many relationship. An adviser may have many students, and a student may have multiple advisers (for multiple majors).To see why the foreign key strategy we used for 1:N relationships will not work for N:M relationships, examine Figure 5-27b. If student 100 has more than one adviser, there is no place to record second or subsequent advisers.To represent an N:M relationship, we need to create a third table, as shown in Figure 5-27c. The third table has two columns, AdviserName and Student Number. Each row of the table means that the given adviser advises the student with the given number.As you can imagine, there is a great deal more to database design than we have presented here. Still, this section should give you an idea of the tasks that need to be accomplished to create a database. You should also realize that the database design is a direct consequence of decisions made in the data model. If the data model is wrong, the database design will be wrong as well.Q7 What Is the Users’ Role in the Development of Databases? As stated, a database is a model of how the users view their business world. This means that the users are the final judges as to what data the database should contain and how the records in that database should be related to one another.The easiest time to change the database structure is during the data modeling stage. Changing a relationship from one-to-many to many-to-many in a data model is simply a matter of changing the 1:N notation to N:M. However, once the database has been constructed and loaded with data and forms, reports, queries, and application programs have been created, changing a one-to-many relationship to many-to-many means weeks of work.You can glean some idea of why this might be true by contrasting Figure 5-26c with Figure 5-27c. Suppose that instead of having just a few rows, each table has thousands of rows; in that case, transforming the database from one format to the other involves considerable work. Even worse, however, is that someone must change application components as well. For example, if students have at most one adviser, then a single text box can be used to enter AdviserName. If students can have multiple advisers, then a multiple-row table will need to be used to enter AdviserName and a program will need to be written to store the values of AdviserName into the Adviser Student Intersection table. There are dozens of other consequences, consequences that will translate into wasted labor and wasted expense.Thus, user review of the data model is crucial. When a database is developed for your use, you must carefully review the data model. If you do not understand any aspect of it, you should ask for clarification until you do. Entities must contain all of the data you and your employees need to do your jobs, and relationships must accurately reflect your view of the business. If the data model is wrong, the database will be designed incorrectly, and the applications will be difficult to use, if not worthless. Do not proceed unless the data model is accurate.As a corollary, when asked to review a data model, take that review seriously. Devote the time necessary to perform a thorough review. Any mistakes you miss will come back to haunt you, and by then the cost of correction may be very high with regard to both time and expense. This brief introduction to data modeling shows why databases can be more difficult to develop than spreadsheets.Q8 2024? With ever cheaper data storage and data communications, we can be sure that the volume of database data will continue to grow, probably exponentially, through 2024. All that data contains patterns that can be used to conceive information to help businesses and organizations achieve their strategies, as you’ll learn when you study business intelligence in Chapter 9. Furthermore, as databases become bigger and bigger, they’re more attractive as targets for theft or mischief, a subject you’ll consider in Chapter 12.Setting these ideas aside, what else can we imagine for database technology by 2024? We can get a glimpse into that future by recognizing that the major principles of the relational model—the fixed-sized tables, the relationships among tables via foreign keys, and the theory of normalization—all came about because of limited storage space and limited processing speeds back in the 1960s and early 1970s.5 At some point, maybe the mid-1990s, these limitations were removed by improved storage and processing technology, and today they do not exist. Today the relational model is not needed.Furthermore, the relational model was never a natural fit with business documents. For example, users want to store sales orders; they do not want to break up sales orders via normalization and store the data in separate tables. It’s like taking your car into a parking garage and having the attendant break it up into pieces, store the pieces in separate piles, and then reassemble it from the pieces when you come back to get it. And why? For the efficiency and convenience of the management of the parking garage.This is not to say that relational databases will be replaced anytime soon. Organizations have created thousands of relational databases with millions of lines of application code that process SQL statements against relational data structures. There is also a strong social trend among older technologists to hang onto the relational model. But the primary reason for the relational model’s existence is gone, and document piece-making via normalization is no longer necessary.Also, organizations today want to store new types of data such as images, audios, and videos. Those files are large collections of bits, and they don’t fit into relational structures. Collections of such files still need metadata; we need such data to record when, where, how, and for what purpose the files exist, but we don’t need to put it into relational databases just to obtain metadata. All Road Parts’ desire to store images for customers’ image query provides an excellent example.MongoDB is an open source document-oriented DBMS that All Road Parts could use to store its nonstructured data. MongoDB does not require normalized data; instead, it manages collections of documents where those documents can have a variety of structures, including large bit files for image, audio, and video data. MongoDB can also store documents like sales orders without requiring that they be normalized. It is used by companies like Craigslist and foursquare; the name MongoDB is a play on the adjective humongous.But MongoDB is not alone. A few years ago, Amazon.com determined that relational database technology wouldn’t meet its needs, and it developed a nonrelational data store called Dynamo.6 Meanwhile, for many of the same reasons, Google developed a nonrelational data store called Bigtable.7 Facebook took concepts from both of these systems and developed a third nonrelational data store called Cassandra.8 In 2008, Facebook turned Cassandra over to the open source community, and now Apache has dubbed it a Top Level Project (TLP), which is the height of respectability among open source projects.Such nonrelational databases have come to be called NoSQL databases, where NoSQL means nonrelational databases that support very high transaction rates processing relatively simple data structures, replicated on many servers in the cloud. NoSQL is not the best term; Not Relational Databases would have been better, but the die has been cast. You can learn more about the rationale for NoSQL products and some of their most intriguing features in Case Study 5, page 194.Use of these NoSQL products has led to the definition of a new type of data store. Big Data (also spelled Big Data) is used to describe data collections that are characterized by huge volume, rapid velocity, and great variety. Considering volume, Big Data refers to data sets that are at least a petabyte in size, and usually larger. A data set containing all Google searches in the United States on a given day is Big Data in size. Additionally, Big Data has high velocity, meaning that it is generated rapidly. (If you know physics, you know that speed would be a more accurate term, but speed doesn’t start with a v, and the vvv description has become a common way to describe Big Data.)