Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Computer architecture a quantitative approach appendix solutions

19/11/2021 Client: muhammad11 Deadline: 2 Day

Computer Architecture Formulas

1. CPU time = Instruction count ! Clock cycles per instruction ! Clock cycle time

2. X is n times faster than Y: n =

3. Amdahl’s Law: Speedupoverall = =

4.

5.

6.

7. Availability = Mean time to fail / (Mean time to fail + Mean time to repair)

8.

where Wafer yield accounts for wafers that are so bad they need not be tested and is a parameter called the process-complexity factor, a measure of manufacturing difficulty. ranges from 11.5 to 15.5 in 2011.

9. Means—arithmetic (AM), weighted arithmetic (WAM), and geometric (GM):

AM = WAM = GM =

where Timei is the execution time for the ith program of a total of n in the workload, Weighti is the weighting of the ith program in the workload.

10. Average memory-access time = Hit time + Miss rate ! Miss penalty

11. Misses per instruction = Miss rate ! Memory access per instruction

12. Cache index size: 2index = Cache size /(Block size ! Set associativity)

13. Power Utilization Effectiveness (PUE) of a Warehouse Scale Computer =

Rules of Thumb

1. Amdahl/Case Rule: A balanced computer system needs about 1 MB of main memory capacity and 1 megabit per second of I/O bandwidth per MIPS of CPU performance.

2. 90/10 Locality Rule: A program executes about 90% of its instructions in 10% of its code.

3. Bandwidth Rule: Bandwidth grows by at least the square of the improvement in latency.

4. 2:1 Cache Rule: The miss rate of a direct-mapped cache of size N is about the same as a two-way set- associative cache of size N/2.

5. Dependability Rule: Design with no single point of failure.

6. Watt-Year Rule: The fully burdened cost of a Watt per year in a Warehouse Scale Computer in North America in 2011, including the cost of amortizing the power and cooling infrastructure, is about $2.

Execution timeY Execution timeX/ PerformanceX PerformanceY/=

Execution timeold Execution timenew -------------------------------------------

1

1 Fractionenhanced–# ) Fractionenhanced Speedupenhanced ------------------------------------+

---------------------------------------------------------------------------------------------

Energydynamic 1 2/ Capacitive load Voltage 2

!!

Powerdynamic 1 2/ Capacitive load! Voltage 2 Frequency switched! !

Powerstatic Currentstatic Voltage!

Die yield Wafer yield 1 1 Defects per unit area Die area!+ )(/ N!=

1 n --- Timei

i 1=

n

Weighti Timei!

i 1=

n n Timei

i 1=

n

Total Facility Power IT Equipment Power --------------------------------------------------

(

N N

In Praise of Computer Architecture: A Quantitative Approach Sixth Edition

“Although important concepts of architecture are timeless, this edition has been thoroughly updated with the latest technology developments, costs, examples, and references. Keeping pace with recent developments in open-sourced architec- ture, the instruction set architecture used in the book has been updated to use the RISC-V ISA.”

—from the foreword by Norman P. Jouppi, Google

“Computer Architecture: A Quantitative Approach is a classic that, like fine wine, just keeps getting better. I bought my first copy as I finished up my undergraduate degree and it remains one of my most frequently referenced texts today.”

—James Hamilton, Amazon Web Service

“Hennessy and Patterson wrote the first edition of this book when graduate stu- dents built computers with 50,000 transistors. Today, warehouse-size computers contain that many servers, each consisting of dozens of independent processors and billions of transistors. The evolution of computer architecture has been rapid and relentless, butComputer Architecture: A Quantitative Approach has kept pace, with each edition accurately explaining and analyzing the important emerging ideas that make this field so exciting.”

—James Larus, Microsoft Research

“Another timely and relevant update to a classic, once again also serving as a win- dow into the relentless and exciting evolution of computer architecture! The new discussions in this edition on the slowing of Moore's law and implications for future systems are must-reads for both computer architects and practitioners working on broader systems.”

—Parthasarathy (Partha) Ranganathan, Google

“I love the ‘Quantitative Approach’ books because they are written by engineers, for engineers. John Hennessy and Dave Patterson show the limits imposed by mathematics and the possibilities enabled by materials science. Then they teach through real-world examples how architects analyze, measure, and compromise to build working systems. This sixth edition comes at a critical time: Moore’s Law is fading just as deep learning demands unprecedented compute cycles. The new chapter on domain-specific architectures documents a number of prom- ising approaches and prophesies a rebirth in computer architecture. Like the scholars of the European Renaissance, computer architects must understand our own history, and then combine the lessons of that history with new techniques to remake the world.”

—Cliff Young, Google

She Zinan
This page intentionally left blank

Computer Architecture A Quantitative Approach

Sixth Edition

John L. Hennessy is a Professor of Electrical Engineering and Computer Science at Stanford University, where he has been a member of the faculty since 1977 and was, from 2000 to 2016, its 10th President. He currently serves as the Director of the Knight-Hennessy Fellow- ship, which provides graduate fellowships to potential future leaders. Hennessy is a Fellow of the IEEE and ACM, a member of the National Academy of Engineering, the National Acad- emy of Science, and the American Philosophical Society, and a Fellow of the American Acad- emy of Arts and Sciences. Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 John von Neumann Award, which he shared with David Patterson. He has also received 10 honorary doctorates.

In 1981, he started the MIPS project at Stanford with a handful of graduate students. After completing the project in 1984, he took a leave from the university to cofound MIPS Com- puter Systems, which developed one of the first commercial RISC microprocessors. As of 2017, over 5 billion MIPS microprocessors have been shipped in devices ranging from video games and palmtop computers to laser printers and network switches. Hennessy subse- quently led the DASH (Director Architecture for Shared Memory) project, which prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been adopted in modern multiprocessors. In addition to his technical activities and university responsibil- ities, he has continued to work with numerous start-ups, both as an early-stage advisor and an investor.

David A. Patterson became a Distinguished Engineer at Google in 2016 after 40 years as a UC Berkeley professor. He joined UC Berkeley immediately after graduating from UCLA. He still spends a day a week in Berkeley as an Emeritus Professor of Computer Science. His teaching has been honored by the Distinguished Teaching Award from the University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and Under- graduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von NeumannMedal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM, and IEEE, and he was elected to the National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information Technology Advisory Committee to the President of the United States, as chair of the CS division in the Berkeley EECS department, as chair of the Computing Research Association, and as President of ACM. This record led to Distinguished Service Awards from ACM, CRA, and SIGARCH. He is currently Vice-Chair of the Board of Directors of the RISC-V Foundation.

At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced instruction set computer, and the foundation of the commercial SPARC architec- ture. He was a leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable storage systems frommany companies. He was also involved in the Network of Workstations (NOW) project, which led to cluster technology used by Internet companies and later to cloud computing. His current interests are in designing domain-specific archi- tectures for machine learning, spreading the word on the open RISC-V instruction set archi- tecture, and in helping the UC Berkeley RISELab (Real-time Intelligent Secure Execution).

Computer Architecture A Quantitative Approach

Sixth Edition

John L. Hennessy Stanford University

David A. Patterson University of California, Berkeley

With Contributions by

Krste Asanovi!c University of California, Berkeley Jason D. Bakos University of South Carolina Robert P. Colwell R&E Colwell & Assoc. Inc. Abhishek Bhattacharjee Rutgers University Thomas M. Conte Georgia Tech Jos!e Duato Proemisa Diana Franklin University of Chicago David Goldberg eBay

Norman P. Jouppi Google Sheng Li Intel Labs Naveen Muralimanohar HP Labs Gregory D. Peterson University of Tennessee Timothy M. Pinkston University of Southern California Parthasarathy Ranganathan Google David A. Wood University of Wisconsin–Madison Cliff Young Google Amr Zaky University of Santa Clara

Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States

© 2019 Elsevier Inc. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library

ISBN: 978-0-12-811905-1

For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Katey Birtcher Acquisition Editor: Stephen Merken Developmental Editor: Nate McFadden Production Project Manager: Stalin Viswanathan Cover Designer: Christian J. Bilbow

Typeset by SPi Global, India

http://www.elsevier.com/permissions
https://www.elsevier.com/books-and-journals
To Andrea, Linda, and our four sons

This page intentionally left blank

Foreword

by Norman P. Jouppi, Google

Much of the improvement in computer performance over the last 40 years has been provided by computer architecture advancements that have leveraged Moore’s Law and Dennard scaling to build larger and more parallel systems. Moore’s Law is the observation that the maximum number of transistors in an integrated circuit doubles approximately every two years. Dennard scaling refers to the reduc- tion of MOS supply voltage in concert with the scaling of feature sizes, so that as transistors get smaller, their power density stays roughly constant. With the end of Dennard scaling a decade ago, and the recent slowdown of Moore’s Law due to a combination of physical limitations and economic factors, the sixth edition of the preeminent textbook for our field couldn’t be more timely. Here are some reasons.

First, because domain-specific architectures can provide equivalent perfor- mance and power benefits of three or more historical generations of Moore’s Law and Dennard scaling, they now can provide better implementations than may ever be possible with future scaling of general-purpose architectures. And with the diverse application space of computers today, there are many potential areas for architectural innovation with domain-specific architectures. Second, high-quality implementations of open-source architectures now have a much lon- ger lifetime due to the slowdown in Moore’s Law. This gives them more oppor- tunities for continued optimization and refinement, and hence makes them more attractive. Third, with the slowing of Moore’s Law, different technology compo- nents have been scaling heterogeneously. Furthermore, new technologies such as 2.5D stacking, new nonvolatile memories, and optical interconnects have been developed to provide more than Moore’s Law can supply alone. To use these new technologies and nonhomogeneous scaling effectively, fundamental design decisions need to be reexamined from first principles. Hence it is important for students, professors, and practitioners in the industry to be skilled in a wide range of both old and new architectural techniques. All told, I believe this is the most exciting time in computer architecture since the industrial exploitation of instruction-level parallelism in microprocessors 25 years ago.

The largest change in this edition is the addition of a new chapter on domain- specific architectures. It’s long been known that customized domain-specific archi- tectures can have higher performance, lower power, and require less silicon area than general-purpose processor implementations. However when general-purpose

ix

processors were increasing in single-threaded performance by 40% per year (see Fig. 1.11), the extra time to market required to develop a custom architecture vs. using a leading-edge standard microprocessor could cause the custom architecture to lose much of its advantage. In contrast, today single-core performance is improving very slowly, meaning that the benefits of custom architectures will not be made obsolete by general-purpose processors for a very long time, if ever. Chapter 7 covers several domain-specific architectures. Deep neural networks have very high computation requirements but lower data precision requirements – this combination can benefit significantly from custom architectures. Two example architectures and implementations for deep neural networks are presented: one optimized for inference and a second optimized for training. Image processing is another example domain; it also has high computation demands and benefits from lower-precision data types. Furthermore, since it is often found in mobile devices, the power savings from custom architectures are also very valuable. Finally, by nature of their reprogrammability, FPGA-based accelerators can be used to implement a variety of different domain-specific architectures on a single device. They also can benefit more irregular applications that are frequently updated, like accelerating internet search.

Although important concepts of architecture are timeless, this edition has been thoroughly updated with the latest technology developments, costs, examples, and references. Keeping pace with recent developments in open-sourced architecture, the instruction set architecture used in the book has been updated to use the RISC-V ISA.

On a personal note, after enjoying the privilege of working with John as a grad- uate student, I am now enjoying the privilege of working with Dave at Google. What an amazing duo!

x ■ Foreword

Contents

Foreword ix

Preface xvii

Acknowledgments xxv

Chapter 1 Fundamentals of Quantitative Design and Analysis

1.1 Introduction 2 1.2 Classes of Computers 6 1.3 Defining Computer Architecture 11 1.4 Trends in Technology 18 1.5 Trends in Power and Energy in Integrated Circuits 23 1.6 Trends in Cost 29 1.7 Dependability 36 1.8 Measuring, Reporting, and Summarizing Performance 39 1.9 Quantitative Principles of Computer Design 48 1.10 Putting It All Together: Performance, Price, and Power 55 1.11 Fallacies and Pitfalls 58 1.12 Concluding Remarks 64 1.13 Historical Perspectives and References 67

Case Studies and Exercises by Diana Franklin 67

Chapter 2 Memory Hierarchy Design

2.1 Introduction 78 2.2 Memory Technology and Optimizations 84 2.3 Ten Advanced Optimizations of Cache Performance 94 2.4 Virtual Memory and Virtual Machines 118 2.5 Cross-Cutting Issues: The Design of Memory Hierarchies 126 2.6 Putting It All Together: Memory Hierarchies in the ARM Cortex-A53

and Intel Core i7 6700 129 2.7 Fallacies and Pitfalls 142 2.8 Concluding Remarks: Looking Ahead 146 2.9 Historical Perspectives and References 148

xi

Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li 148

Chapter 3 Instruction-Level Parallelism and Its Exploitation

3.1 Instruction-Level Parallelism: Concepts and Challenges 168 3.2 Basic Compiler Techniques for Exposing ILP 176 3.3 Reducing Branch Costs With Advanced Branch Prediction 182 3.4 Overcoming Data Hazards With Dynamic Scheduling 191 3.5 Dynamic Scheduling: Examples and the Algorithm 201 3.6 Hardware-Based Speculation 208 3.7 Exploiting ILP Using Multiple Issue and Static Scheduling 218 3.8 Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and

Speculation 222 3.9 Advanced Techniques for Instruction Delivery and Speculation 228 3.10 Cross-Cutting Issues 240 3.11 Multithreading: Exploiting Thread-Level Parallelism to Improve

Uniprocessor Throughput 242 3.12 Putting It All Together: The Intel Core i7 6700 and ARM Cortex-A53 247 3.13 Fallacies and Pitfalls 258 3.14 Concluding Remarks: What’s Ahead? 264 3.15 Historical Perspective and References 266

Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 266

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

4.1 Introduction 282 4.2 Vector Architecture 283 4.3 SIMD Instruction Set Extensions for Multimedia 304 4.4 Graphics Processing Units 310 4.5 Detecting and Enhancing Loop-Level Parallelism 336 4.6 Cross-Cutting Issues 345 4.7 Putting It All Together: Embedded Versus Server GPUs and

Tesla Versus Core i7 346 4.8 Fallacies and Pitfalls 353 4.9 Concluding Remarks 357 4.10 Historical Perspective and References 357

Case Study and Exercises by Jason D. Bakos 357

Chapter 5 Thread-Level Parallelism

5.1 Introduction 368 5.2 Centralized Shared-Memory Architectures 377 5.3 Performance of Symmetric Shared-Memory Multiprocessors 393

xii ■ Contents

5.4 Distributed Shared-Memory and Directory-Based Coherence 404 5.5 Synchronization: The Basics 412 5.6 Models of Memory Consistency: An Introduction 417 5.7 Cross-Cutting Issues 422 5.8 Putting It All Together: Multicore Processors and Their Performance 426 5.9 Fallacies and Pitfalls 438 5.10 The Future of Multicore Scaling 442 5.11 Concluding Remarks 444 5.12 Historical Perspectives and References 445

Case Studies and Exercises by Amr Zaky and David A. Wood 446

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism

6.1 Introduction 466 6.2 Programming Models and Workloads for Warehouse-Scale

Computers 471 6.3 Computer Architecture of Warehouse-Scale Computers 477 6.4 The Efficiency and Cost of Warehouse-Scale Computers 482 6.5 Cloud Computing: The Return of Utility Computing 490 6.6 Cross-Cutting Issues 501 6.7 Putting It All Together: A Google Warehouse-Scale Computer 503 6.8 Fallacies and Pitfalls 514 6.9 Concluding Remarks 518 6.10 Historical Perspectives and References 519

Case Studies and Exercises by Parthasarathy Ranganathan 519

Chapter 7 Domain-Specific Architectures

7.1 Introduction 540 7.2 Guidelines for DSAs 543 7.3 Example Domain: Deep Neural Networks 544 7.4 Google’s Tensor Processing Unit, an Inference Data

Center Accelerator 557 7.5 Microsoft Catapult, a Flexible Data Center Accelerator 567 7.6 Intel Crest, a Data Center Accelerator for Training 579 7.7 Pixel Visual Core, a Personal Mobile Device Image Processing Unit 579 7.8 Cross-Cutting Issues 592 7.9 Putting It All Together: CPUs Versus GPUs Versus DNN Accelerators 595 7.10 Fallacies and Pitfalls 602 7.11 Concluding Remarks 604 7.12 Historical Perspectives and References 606

Case Studies and Exercises by Cliff Young 606

Contents ■ xiii

Appendix A Instruction Set Principles

A.1 Introduction A-2 A.2 Classifying Instruction Set Architectures A-3 A.3 Memory Addressing A-7 A.4 Type and Size of Operands A-13 A.5 Operations in the Instruction Set A-15 A.6 Instructions for Control Flow A-16 A.7 Encoding an Instruction Set A-21 A.8 Cross-Cutting Issues: The Role of Compilers A-24 A.9 Putting It All Together: The RISC-V Architecture A-33 A.10 Fallacies and Pitfalls A-42 A.11 Concluding Remarks A-46 A.12 Historical Perspective and References A-47

Exercises by Gregory D. Peterson A-47

Appendix B Review of Memory Hierarchy

B.1 Introduction B-2 B.2 Cache Performance B-15 B.3 Six Basic Cache Optimizations B-22 B.4 Virtual Memory B-40 B.5 Protection and Examples of Virtual Memory B-49 B.6 Fallacies and Pitfalls B-57 B.7 Concluding Remarks B-59 B.8 Historical Perspective and References B-59

Exercises by Amr Zaky B-60

Appendix C Pipelining: Basic and Intermediate Concepts

C.1 Introduction C-2 C.2 The Major Hurdle of Pipelining—Pipeline Hazards C-10 C.3 How Is Pipelining Implemented? C-26 C.4 What Makes Pipelining Hard to Implement? C-37 C.5 Extending the RISC V Integer Pipeline to Handle Multicycle

Operations C-45 C.6 Putting It All Together: The MIPS R4000 Pipeline C-55 C.7 Cross-Cutting Issues C-65 C.8 Fallacies and Pitfalls C-70 C.9 Concluding Remarks C-71 C.10 Historical Perspective and References C-71

Updated Exercises by Diana Franklin C-71

xiv ■ Contents

Online Appendices

Appendix D Storage Systems

Appendix E Embedded Systems by Thomas M. Conte

Appendix F Interconnection Networks by Timothy M. Pinkston and Jos!e Duato

Appendix G Vector Processors in More Depth by Krste Asanovic

Appendix H Hardware and Software for VLIW and EPIC

Appendix I Large-Scale Multiprocessors and Scientific Applications

Appendix J Computer Arithmetic by David Goldberg

Appendix K Survey of Instruction Set Architectures

Appendix L Advanced Concepts on Address Translation by Abhishek Bhattacharjee

Appendix M Historical Perspectives and References

References R-1

Index I-1

Contents ■ xv

This page intentionally left blank

Preface

Why We Wrote This Book

Through six editions of this book, our goal has been to describe the basic principles underlying what will be tomorrow’s technological developments. Our excitement about the opportunities in computer architecture has not abated, and we echo what we said about the field in the first edition: “It is not a dreary science of paper machines that will never work. No! It’s a discipline of keen intellectual interest, requiring the balance of marketplace forces to cost-performance-power, leading to glorious failures and some notable successes.”

Our primary objective in writing our first book was to change the way people learn and think about computer architecture. We feel this goal is still valid and important. The field is changing daily and must be studied with real examples and measurements on real computers, rather than simply as a collection of defini- tions and designs that will never need to be realized. We offer an enthusiastic wel- come to anyone who came along with us in the past, as well as to those who are joining us now. Either way, we can promise the same quantitative approach to, and analysis of, real systems.

As with earlier versions, we have strived to produce a new edition that will continue to be as relevant for professional engineers and architects as it is for those involved in advanced computer architecture and design courses. Like the first edi- tion, this edition has a sharp focus on new platforms—personal mobile devices and warehouse-scale computers—and new architectures—specifically, domain- specific architectures. As much as its predecessors, this edition aims to demystify computer architecture through an emphasis on cost-performance-energy trade-offs and good engineering design. We believe that the field has continued to mature and move toward the rigorous quantitative foundation of long-established scientific and engineering disciplines.

xvii

This Edition

The ending of Moore’s Law and Dennard scaling is having as profound effect on computer architecture as did the switch to multicore. We retain the focus on the extremes in size of computing, with personal mobile devices (PMDs) such as cell phones and tablets as the clients and warehouse-scale computers offering cloud computing as the server. We also maintain the other theme of parallelism in all its forms: data-level parallelism (DLP) in Chapters 1 and 4, instruction-level par- allelism (ILP) in Chapter 3, thread-level parallelism in Chapter 5, and request- level parallelism (RLP) in Chapter 6.

The most pervasive change in this edition is switching fromMIPS to the RISC- V instruction set. We suspect this modern, modular, open instruction set may become a significant force in the information technology industry. It may become as important in computer architecture as Linux is for operating systems.

The newcomer in this edition is Chapter 7, which introduces domain-specific architectures with several concrete examples from industry.

As before, the first three appendices in the book give basics on the RISC-V instruction set, memory hierarchy, and pipelining for readers who have not read a book like Computer Organization and Design. To keep costs down but still sup- ply supplemental material that is of interest to some readers, available online at https://www.elsevier.com/books-and-journals/book-companion/9780128119051 are nine more appendices. There are more pages in these appendices than there are in this book!

This edition continues the tradition of using real-world examples to demonstrate the ideas, and the “Putting ItAll Together” sections are brand new.The “Putting ItAll Together” sectionsof this edition include thepipelineorganizationsandmemoryhier- archies of the ARM Cortex A8 processor, the Intel core i7 processor, the NVIDIA GTX-280 and GTX-480 GPUs, and one of the Google warehouse-scale computers.

Topic Selection and Organization

As before, we have taken a conservative approach to topic selection, for there are many more interesting ideas in the field than can reasonably be covered in a treat- ment of basic principles. We have steered away from a comprehensive survey of every architecture a reader might encounter. Instead, our presentation focuses on core concepts likely to be found in any new machine. The key criterion remains that of selecting ideas that have been examined and utilized successfully enough to permit their discussion in quantitative terms.

Our intent has always been to focus on material that is not available in equiv- alent form from other sources, so we continue to emphasize advanced content wherever possible. Indeed, there are several systems here whose descriptions can- not be found in the literature. (Readers interested strictly in a more basic introduc- tion to computer architecture should readComputer Organization and Design: The Hardware/Software Interface.)

xviii ■ Preface

https://www.elsevier.com/books-and-journals/book-companion/9780128119051
An Overview of the Content

Chapter 1 includes formulas for energy, static power, dynamic power, integrated cir- cuit costs, reliability, and availability. (These formulas are also found on the front inside cover.) Our hope is that these topics can be used through the rest of the book. In addition to the classic quantitative principles of computer design and performance measurement, it shows the slowing of performance improvement of general-purpose microprocessors, which is one inspiration for domain-specific architectures.

Our view is that the instruction set architecture is playing less of a role today than in 1990, so we moved this material to Appendix A. It now uses the RISC-V architecture. (For quick review, a summary of the RISC-V ISA can be found on the back inside cover.) For fans of ISAs, Appendix K was revised for this edition and covers 8 RISC architectures (5 for desktop and server use and 3 for embedded use), the 80!86, the DEC VAX, and the IBM 360/370.

We then move onto memory hierarchy in Chapter 2, since it is easy to apply the cost-performance-energy principles to this material, and memory is a critical resource for the rest of the chapters. As in the past edition, Appendix B contains an introductory review of cache principles, which is available in case you need it. Chapter 2 discusses 10 advanced optimizations of caches. The chapter includes virtual machines, which offer advantages in protection, software management, and hardware management, and play an important role in cloud computing. In addition to covering SRAM and DRAM technologies, the chapter includes new material both on Flash memory and on the use of stacked die packaging for extend- ing the memory hierarchy. The PIAT examples are the ARM Cortex A8, which is used in PMDs, and the Intel Core i7, which is used in servers.

Chapter 3 covers the exploitation of instruction-level parallelism in high- performance processors, including superscalar execution, branch prediction (including the new tagged hybrid predictors), speculation, dynamic scheduling, and simultaneous multithreading. As mentioned earlier, Appendix C is a review of pipelining in case you need it. Chapter 3 also surveys the limits of ILP. Like Chapter 2, the PIAT examples are again the ARM Cortex A8 and the Intel Core i7. While the third edition contained a great deal on Itanium and VLIW, this mate- rial is now in Appendix H, indicating our view that this architecture did not live up to the earlier claims.

The increasing importance of multimedia applications such as games and video processing has also increased the importance of architectures that can exploit data level parallelism. In particular, there is a rising interest in computing using graph- ical processing units (GPUs), yet few architects understand howGPUs really work. We decided to write a new chapter in large part to unveil this new style of computer architecture. Chapter 4 starts with an introduction to vector architectures, which acts as a foundation on which to build explanations of multimedia SIMD instruc- tion set extensions and GPUs. (Appendix G goes into even more depth on vector architectures.) This chapter introduces the Roofline performance model and then uses it to compare the Intel Core i7 and the NVIDIAGTX 280 andGTX 480GPUs. The chapter also describes the Tegra 2 GPU for PMDs.

Preface ■ xix

Chapter 5 describes multicore processors. It explores symmetric and distributed-memory architectures, examining both organizational principles and performance. The primary additions to this chapter include more comparison of multicore organizations, including the organization of multicore-multilevel caches, multicore coherence schemes, and on-chip multicore interconnect. Topics in synchronization and memory consistency models are next. The example is the Intel Core i7. Readers interested in more depth on interconnection networks should read Appendix F, and those interested in larger scale multiprocessors and scientific applications should read Appendix I.

Chapter 6 describes warehouse-scale computers (WSCs). It was extensively revised based on help from engineers at Google and Amazon Web Services. This chapter integrates details on design, cost, and performance ofWSCs that few archi- tects are aware of. It starts with the popular MapReduce programming model before describing the architecture and physical implementation of WSCs, includ- ing cost. The costs allow us to explain the emergence of cloud computing, whereby it can be cheaper to compute usingWSCs in the cloud than in your local datacenter. The PIAT example is a description of a Google WSC that includes information published for the first time in this book.

The new Chapter 7 motivates the need for Domain-Specific Architectures (DSAs). It draws guiding principles for DSAs based on the four examples of DSAs. EachDSAcorresponds to chips that have been deployed in commercial settings.We also explain why we expect a renaissance in computer architecture via DSAs given that single-thread performance of general-purpose microprocessors has stalled.

This brings us to Appendices A through M. Appendix A covers principles of ISAs, including RISC-V, and Appendix K describes 64-bit versions of RISC V, ARM,MIPS, Power, and SPARC and their multimedia extensions. It also includes some classic architectures (80x86, VAX, and IBM 360/370) and popular embed- ded instruction sets (Thumb-2, microMIPS, and RISCVC). Appendix H is related, in that it covers architectures and compilers for VLIW ISAs.

As mentioned earlier, Appendix B and Appendix C are tutorials on basic cach- ing and pipelining concepts. Readers relatively new to caching should read Appen- dix B before Chapter 2, and those new to pipelining should read Appendix C before Chapter 3.

Appendix D, “Storage Systems,” has an expanded discussion of reliability and availability, a tutorial on RAID with a description of RAID 6 schemes, and rarely found failure statistics of real systems. It continues to provide an introduction to queuing theory and I/O performance benchmarks. We evaluate the cost, perfor- mance, and reliability of a real cluster: the Internet Archive. The “Putting It All Together” example is the NetApp FAS6000 filer.

Appendix E, by Thomas M. Conte, consolidates the embedded material in one place.

Appendix F, on interconnection networks, is revised by Timothy M. Pinkston and Jos!e Duato. Appendix G, written originally by Krste Asanovi!c, includes a description of vector processors. We think these two appendices are some of the best material we know of on each topic.

xx ■ Preface

Appendix H describes VLIW and EPIC, the architecture of Itanium. Appendix I describes parallel processing applications and coherence protocols

for larger-scale, shared-memory multiprocessing. Appendix J, by David Goldberg, describes computer arithmetic.

Appendix L, by Abhishek Bhattacharjee, is new and discusses advanced tech- niques for memory management, focusing on support for virtual machines and design of address translation for very large address spaces. With the growth in clouds processors, these architectural enhancements are becoming more important.

Appendix M collects the “Historical Perspective and References” from each chapter into a single appendix. It attempts to give proper credit for the ideas in each chapter and a sense of the history surrounding the inventions. We like to think of this as presenting the human drama of computer design. It also supplies references that the student of architecture may want to pursue. If you have time, we recom- mend reading some of the classic papers in the field that are mentioned in these sections. It is both enjoyable and educational to hear the ideas directly from the creators. “Historical Perspective” was one of the most popular sections of prior editions.

Navigating the Text

There is no single best order in which to approach these chapters and appendices, except that all readers should start with Chapter 1. If you don’t want to read every- thing, here are some suggested sequences:

■ Memory Hierarchy: Appendix B, Chapter 2, and Appendices D and M.

■ Instruction-Level Parallelism: Appendix C, Chapter 3, and Appendix H

■ Data-Level Parallelism: Chapters 4, 6, and 7, Appendix G

■ Thread-Level Parallelism: Chapter 5, Appendices F and I

■ Request-Level Parallelism: Chapter 6

■ ISA: Appendices A and K

Appendix E can be read at any time, but it might work best if read after the ISA and cache sequences. Appendix J can be read whenever arithmetic moves you. You should read the corresponding portion of Appendix M after you complete each chapter.

Chapter Structure

The material we have selected has been stretched upon a consistent framework that is followed in each chapter. We start by explaining the ideas of a chapter. These ideas are followed by a “Crosscutting Issues” section, a feature that shows how the ideas covered in one chapter interact with those given in other chapters. This is

Preface ■ xxi

followed by a “Putting It All Together” section that ties these ideas together by showing how they are used in a real machine.

Next in the sequence is “Fallacies and Pitfalls,” which lets readers learn from the mistakes of others. We show examples of common misunderstandings and architectural traps that are difficult to avoid even when you know they are lying in wait for you. The “Fallacies and Pitfalls” sections is one of the most popular sections of the book. Each chapter ends with a “Concluding Remarks” section.

Case Studies With Exercises

Each chapter ends with case studies and accompanying exercises. Authored by experts in industry and academia, the case studies explore key chapter concepts and verify understanding through increasingly challenging exercises. Instructors should find the case studies sufficiently detailed and robust to allow them to create their own additional exercises.

Brackets for each exercise () indicate the text sections of primary relevance to completing the exercise. We hope this helps readers to avoid exercises for which they haven’t read the corresponding section, in addition to pro- viding the source for review. Exercises are rated, to give the reader a sense of the amount of time required to complete an exercise:

[10] Less than 5 min (to read and understand)

[15] 5–15 min for a full answer

[20] 15–20 min for a full answer

[25] 1 h for a full written answer

[30] Short programming project: less than 1 full day of programming

[40] Significant programming project: 2 weeks of elapsed time

[Discussion] Topic for discussion with others

Solutions to the case studies and exercises are available for instructors who register at textbooks.elsevier.com.

Supplemental Materials

A variety of resources are available online at https://www.elsevier.com/books/ computer-architecture/hennessy/978-0-12-811905-1, including the following:

■ Reference appendices, some guest authored by subject experts, covering a range of advanced topics

■ Historical perspectives material that explores the development of the key ideas presented in each of the chapters in the text

xxii ■ Preface

https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1
https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1
■ Instructor slides in PowerPoint

■ Figures from the book in PDF, EPS, and PPT formats

■ Links to related material on the Web

■ List of errata

New materials and links to other resources available on the Web will be added on a regular basis.

Helping Improve This Book

Finally, it is possible to make money while reading this book. (Talk about cost per- formance!) If you read the Acknowledgments that follow, you will see that we went to great lengths to correct mistakes. Since a book goes through many print- ings, we have the opportunity to make even more corrections. If you uncover any remaining resilient bugs, please contact the publisher by electronic mail (ca6bugs@mkp.com).

We welcome general comments to the text and invite you to send them to a separate email address at ca6comments@mkp.com.

Concluding Remarks

Once again, this book is a true co-authorship, with each of us writing half the chap- ters and an equal share of the appendices.We can’t imagine how long it would have taken without someone else doing half the work, offering inspiration when the task seemed hopeless, providing the key insight to explain a difficult concept, supply- ing over-the-weekend reviews of chapters, and commiserating when the weight of our other obligations made it hard to pick up the pen.

Thus, once again, we share equally the blame for what you are about to read.

John Hennessy ■ David Patterson

Preface ■ xxiii

mailto:ca6bugs@mkp.com
mailto:ca6bugs@mkp.com
mailto:ca6comments@mkp.com
mailto:ca6comments@mkp.com
This page intentionally left blank

Acknowledgments

Although this is only the sixth edition of this book, we have actually created ten different versions of the text: three versions of the first edition (alpha, beta, and final) and two versions of the second, third, and fourth editions (beta and final). Along the way, we have received help from hundreds of reviewers and users. Each of these people has helped make this book better. Thus, we have chosen to list all of the people who have made contributions to some version of this book.

Contributors to the Sixth Edition

Like prior editions, this is a community effort that involves scores of volunteers. Without their help, this edition would not be nearly as polished.

Reviewers

Jason D. Bakos, University of South Carolina; Rajeev Balasubramonian, Univer- sity of Utah; Jose Delgado-Frias, Washington State University; Diana Franklin, The University of Chicago; Norman P. Jouppi, Google; Hugh C. Lauer, Worcester Polytechnic Institute; Gregory Peterson, University of Tennessee; Bill Pierce, Hood College; Parthasarathy Ranganathan, Google; William H. Robinson, Van- derbilt University; Pat Stakem, Johns Hopkins University; Cliff Young, Google; Amr Zaky, University of Santa Clara; Gerald Zarnett, Ryerson University; Huiyang Zhou, North Carolina State University.

Members of the University of California-Berkeley Par Lab and RAD Lab who gave frequent reviews of Chapters 1, 4, and 6 and shaped the explanation of GPUs and WSCs: Krste Asanovi!c, Michael Armbrust, Scott Beamer, Sarah Bird, Bryan Catan- zaro, Jike Chong, Henry Cook, Derrick Coetzee, Randy Katz, Yunsup Lee, Leo Meyervich, Mark Murphy, Zhangxi Tan, Vasily Volkov, and Andrew Waterman.

Appendices

Krste Asanovi!c, University of California, Berkeley (Appendix G); Abhishek Bhattacharjee, Rutgers University (Appendix L); Thomas M. Conte, North Caro- lina State University (Appendix E); Jos!e Duato, Universitat Politècnica de

xxv

València and Simula (Appendix F); David Goldberg, Xerox PARC (Appendix J); Timothy M. Pinkston, University of Southern California (Appendix F).

Jos!e Flich of the Universidad Polit!ecnica de Valencia provided significant contri- butions to the updating of Appendix F.

Case Studies With Exercises

Jason D. Bakos, University of South Carolina (Chapters 3 and 4); Rajeev Balasu- bramonian, University of Utah (Chapter 2); Diana Franklin, The University of Chicago (Chapter 1 and Appendix C); Norman P. Jouppi, Google, (Chapter 2); Naveen Muralimanohar, HP Labs (Chapter 2); Gregory Peterson, University of Tennessee (Appendix A); Parthasarathy Ranganathan, Google (Chapter 6); Cliff Young, Google (Chapter 7); Amr Zaky, University of Santa Clara (Chapter 5 and Appendix B).

Jichuan Chang, Junwhan Ahn, Rama Govindaraju, and Milad Hashemi assisted in the development and testing of the case studies and exercises for Chapter 6.

Additional Material

John Nickolls, Steve Keckler, and Michael Toksvig of NVIDIA (Chapter 4 NVI- DIA GPUs); Victor Lee, Intel (Chapter 4 comparison of Core i7 and GPU); John Shalf, LBNL (Chapter 4 recent vector architectures); SamWilliams, LBNL (Roof- line model for computers in Chapter 4); Steve Blackburn of Australian National University and Kathryn McKinley of University of Texas at Austin (Intel perfor- mance and power measurements in Chapter 5); Luiz Barroso, Urs H€olzle, Jimmy Clidaris, Bob Felderman, and Chris Johnson of Google (the Google WSC in Chapter 6); James Hamilton of AmazonWeb Services (power distribution and cost model in Chapter 6).

Jason D. Bakos of the University of South Carolina updated the lecture slides for this edition.

This book could not have been published without a publisher, of course. We wish to thank all the Morgan Kaufmann/Elsevier staff for their efforts and support. For this fifth edition, we particularly want to thank our editors Nate McFadden and SteveMerken, who coordinated surveys, development of the case studies and exer- cises, manuscript reviews, and the updating of the appendices.

We must also thank our university staff, Margaret Rowland and Roxana Infante, for countless express mailings, as well as for holding down the fort at Stan- ford and Berkeley while we worked on the book.

Our final thanks go to our wives for their suffering through increasingly early mornings of reading, thinking, and writing.

xxvi ■ Acknowledgments

Contributors to Previous Editions

Reviewers

George Adams, Purdue University; Sarita Adve, University of Illinois at Urbana- Champaign; Jim Archibald, Brigham Young University; Krste Asanovi!c, Massa- chusetts Institute of Technology; Jean-Loup Baer, University of Washington; Paul Barr, Northeastern University; Rajendra V. Boppana, University of Texas, San Antonio; Mark Brehob, University of Michigan; Doug Burger, University of Texas, Austin; John Burger, SGI; Michael Butler; Thomas Casavant; Rohit Chan- dra; Peter Chen, University of Michigan; the classes at SUNY Stony Brook, Car- negie Mellon, Stanford, Clemson, and Wisconsin; Tim Coe, Vitesse Semiconductor; Robert P. Colwell; David Cummings; Bill Dally; David Douglas; Jos!e Duato, Universitat Politècnica de València and Simula; Anthony Duben, Southeast Missouri State University; Susan Eggers, University of Washington; Joel Emer; Barry Fagin, Dartmouth; Joel Ferguson, University of California, Santa Cruz; Carl Feynman; David Filo; Josh Fisher, Hewlett-Packard Laboratories; Rob Fowler, DIKU; Mark Franklin, Washington University (St. Louis); Kourosh Ghar- achorloo; Nikolas Gloy, Harvard University; David Goldberg, Xerox Palo Alto Research Center; Antonio González, Intel and Universitat Politècnica de Catalu- nya; James Goodman, University of Wisconsin-Madison; Sudhanva Gurumurthi, University of Virginia; David Harris, Harvey Mudd College; John Heinlein; Mark Heinrich, Stanford; Daniel Helman, University of California, Santa Cruz; Mark D. Hill, University of Wisconsin-Madison; Martin Hopkins, IBM; Jerry Huck, Hewlett-Packard Laboratories; Wen-mei Hwu, University of Illinois at Urbana- Champaign; Mary Jane Irwin, Pennsylvania State University; Truman Joe; Norm Jouppi; David Kaeli, Northeastern University; Roger Kieckhafer, University of Nebraska; Lev G. Kirischian, Ryerson University; Earl Killian; Allan Knies, Pur- due University; Don Knuth; Jeff Kuskin, Stanford; James R. Larus, Microsoft Research; Corinna Lee, University of Toronto; Hank Levy; Kai Li, Princeton Uni- versity; Lori Liebrock, University of Alaska, Fairbanks; Mikko Lipasti, University of Wisconsin-Madison; Gyula A. Mago, University of North Carolina, Chapel Hill; BryanMartin; NormanMatloff; DavidMeyer;WilliamMichalson,Worcester Polytechnic Institute; James Mooney; Trevor Mudge, University of Michigan; Ramadass Nagarajan, University of Texas at Austin; David Nagle, Carnegie Mel- lon University; Todd Narter; Victor Nelson; Vojin Oklobdzija, University of Cal- ifornia, Berkeley; Kunle Olukotun, Stanford University; Bob Owens, Pennsylvania State University; Greg Papadapoulous, Sun Microsystems; Joseph Pfeiffer; Keshav Pingali, Cornell University; Timothy M. Pinkston, University of Southern California; Bruno Preiss, University of Waterloo; Steven Przybylski; Jim Quinlan; Andras Radics; Kishore Ramachandran, Georgia Institute of Tech- nology; Joseph Rameh, University of Texas, Austin; Anthony Reeves, Cornell University; Richard Reid, Michigan State University; Steve Reinhardt, University of Michigan; David Rennels, University of California, Los Angeles; Arnold L. Rosenberg, University of Massachusetts, Amherst; Kaushik Roy, Purdue

Acknowledgments ■ xxvii

University; Emilio Salgueiro, Unysis; Karthikeyan Sankaralingam, University of Texas at Austin; Peter Schnorf; Margo Seltzer; Behrooz Shirazi, Southern Meth- odist University; Daniel Siewiorek, Carnegie Mellon University; J. P. Singh, Prin- ceton; Ashok Singhal; Jim Smith, University of Wisconsin-Madison; Mike Smith, Harvard University; Mark Smotherman, Clemson University; Gurindar Sohi, Uni- versity of Wisconsin-Madison; Arun Somani, University of Washington; Gene Tagliarin, Clemson University; Shyamkumar Thoziyoor, University of Notre Dame; Evan Tick, University of Oregon; Akhilesh Tyagi, University of North Car- olina, Chapel Hill; Dan Upton, University of Virginia; Mateo Valero, Universidad Polit!ecnica de Cataluña, Barcelona; Anujan Varma, University of California, Santa Cruz; Thorsten von Eicken, Cornell University; Hank Walker, Texas A&M; Roy Want, Xerox Palo Alto Research Center; David Weaver, Sun Microsystems; ShlomoWeiss, Tel Aviv University; DavidWells; MikeWestall, Clemson Univer- sity; Maurice Wilkes; Eric Williams; Thomas Willis, Purdue University; Malcolm Wing; Larry Wittie, SUNY Stony Brook; Ellen Witte Zegura, Georgia Institute of Technology; Sotirios G. Ziavras, New Jersey Institute of Technology.

Appendices

The vector appendix was revised by Krste Asanovi!c of the Massachusetts Institute of Technology. The floating-point appendix was written originally by David Gold- berg of Xerox PARC.

Exercises

George Adams, Purdue University; Todd M. Bezenek, University of Wisconsin- Madison (in remembrance of his grandmother Ethel Eshom); Susan Eggers; Anoop Gupta; David Hayes; Mark Hill; Allan Knies; Ethan L. Miller, University of California, Santa Cruz; Parthasarathy Ranganathan, Compaq Western Research Laboratory; Brandon Schwartz, University of Wisconsin-Madison; Michael Scott; Dan Siewiorek; Mike Smith; Mark Smotherman; Evan Tick; Thomas Willis.

Case Studies With Exercises

Andrea C. Arpaci-Dusseau, University of Wisconsin-Madison; Remzi H. Arpaci- Dusseau, University of Wisconsin-Madison; Robert P. Colwell, R&E Colwell & Assoc., Inc.; Diana Franklin, California Polytechnic State University, San Luis Obispo; Wen-mei W. Hwu, University of Illinois at Urbana-Champaign; Norman P. Jouppi, HP Labs; John W. Sias, University of Illinois at Urbana-Champaign; David A. Wood, University of Wisconsin-Madison.

Special Thanks

Duane Adams, Defense Advanced Research Projects Agency; Tom Adams; Sarita Adve, University of Illinois at Urbana-Champaign; Anant Agarwal; Dave

xxviii ■ Acknowledgments

Albonesi, University of Rochester; Mitch Alsup; Howard Alt; Dave Anderson; Peter Ashenden; David Bailey; Bill Bandy, Defense Advanced Research Projects Agency; Luiz Barroso, Compaq’s Western Research Lab; Andy Bechtolsheim; C. Gordon Bell; Fred Berkowitz; John Best, IBM; Dileep Bhandarkar; Jeff Bier, BDTI; Mark Birman; David Black; David Boggs; Jim Brady; Forrest Brewer; Aaron Brown, University of California, Berkeley; E. Bugnion, Compaq’s Western Research Lab; Alper Buyuktosunoglu, University of Rochester; Mark Callaghan; Jason F. Cantin; Paul Carrick; Chen-Chung Chang; Lei Chen, University of Roch- ester; Pete Chen; Nhan Chu; Doug Clark, Princeton University; Bob Cmelik; John Crawford; Zarka Cvetanovic; Mike Dahlin, University of Texas, Austin; Merrick Darley; the staff of the DEC Western Research Laboratory; John DeRosa; Lloyd Dickman; J. Ding; Susan Eggers, University of Washington; Wael El-Essawy, University of Rochester; Patty Enriquez, Mills; Milos Ercegovac; Robert Garner; K. Gharachorloo, Compaq’s Western Research Lab; Garth Gibson; Ronald Green- berg; Ben Hao; John Henning, Compaq; Mark Hill, University of Wisconsin- Madison; Danny Hillis; David Hodges; Urs H€olzle, Google; David Hough; Ed Hudson; Chris Hughes, University of Illinois at Urbana-Champaign; Mark John- son; Lewis Jordan; Norm Jouppi; William Kahan; Randy Katz; Ed Kelly; Richard Kessler; Les Kohn; John Kowaleski, Compaq Computer Corp; Dan Lambright; Gary Lauterbach, Sun Microsystems; Corinna Lee; Ruby Lee; Don Lewine; Chao-Huang Lin; Paul Losleben, Defense Advanced Research Projects Agency; Yung-Hsiang Lu; Bob Lucas, Defense Advanced Research Projects Agency; Ken Lutz; Alan Mainwaring, Intel Berkeley Research Labs; Al Marston; Rich Martin, Rutgers; John Mashey; Luke McDowell; Sebastian Mirolo, Trimedia Cor- poration; Ravi Murthy; Biswadeep Nag; Lisa Noordergraaf, Sun Microsystems; Bob Parker, Defense Advanced Research Projects Agency; Vern Paxson, Center for Internet Research; Lawrence Prince; Steven Przybylski; Mark Pullen, Defense Advanced Research Projects Agency; Chris Rowen; Margaret Rowland; Greg Semeraro, University of Rochester; Bill Shannon; Behrooz Shirazi; Robert Shom- ler; Jim Slager; Mark Smotherman, Clemson University; the SMT research group at the University of Washington; Steve Squires, Defense Advanced Research Pro- jects Agency; Ajay Sreekanth; Darren Staples; Charles Stapper; Jorge Stolfi; Peter Stoll; the students at Stanford and Berkeley who endured our first attempts at cre- ating this book; Bob Supnik; Steve Swanson; Paul Taysom; Shreekant Thakkar; Alexander Thomasian, New Jersey Institute of Technology; John Toole, Defense Advanced Research Projects Agency; Kees A. Vissers, Trimedia Corporation; Willa Walker; David Weaver; Ric Wheeler, EMC; Maurice Wilkes; Richard Zimmerman.

John Hennessy ■ David Patterson

Acknowledgments ■ xxix

1.1 Introduction 2 1.2 Classes of Computers 6 1.3 Defining Computer Architecture 11 1.4 Trends in Technology 18 1.5 Trends in Power and Energy in Integrated Circuits 23 1.6 Trends in Cost 29 1.7 Dependability 36 1.8 Measuring, Reporting, and Summarizing Performance 39 1.9 Quantitative Principles of Computer Design 48 1.10 Putting It All Together: Performance, Price, and Power 55 1.11 Fallacies and Pitfalls 58 1.12 Concluding Remarks 64 1.13 Historical Perspectives and References 67

Case Studies and Exercises by Diana Franklin 67

1 Fundamentals of Quantitative Design and Analysis

An iPod, a phone, an Internet mobile communicator… these are NOT three separate devices! And we are calling it iPhone! Today Apple is going to reinvent the phone. And here it is.

Steve Jobs, January 9, 2007

New information and communications technologies, in particular high-speed Internet, are changing the way companies do business, transforming public service delivery and democratizing innovation. With 10 percent increase in high speed Internet connections, economic growth increases by 1.3 percent.

The World Bank, July 28, 2009

Computer Architecture. https://doi.org/10.1016/B978-0-12-811905-1.00001-8 © 2019 Elsevier Inc. All rights reserved.

https://doi.org/10.1016/B978-0-12-811905-1.00001-8
1.1 Introduction

Computer technology has made incredible progress in the roughly 70 years since the first general-purpose electronic computer was created. Today, less than $500 will purchase a cell phone that has as much performance as the world’s fastest computer bought in 1993 for $50 million. This rapid improvement has come both from advances in the technology used to build computers and from innovations in computer design.

Although technological improvements historically have been fairly steady, progress arising from better computer architectures has been much less consistent. During the first 25 years of electronic computers, both forces made a major con- tribution, delivering performance improvement of about 25% per year. The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology led to a higher rate of performance improvement—roughly 35% growth per year.

This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business being based on microprocessors. In addition, two significant changes in the computer market- place made it easier than ever before to succeed commercially with a new archi- tecture. First, the virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux, low- ered the cost and risk of bringing out a new architecture.

These changes made it possible to develop successfully a new set of architec- tures with simpler instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques, the exploitation of instruc- tion-level parallelism (initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations).

The RISC-based computers raised the performance bar, forcing prior architec- tures to keep up or disappear. The Digital Equipment Vax could not, and so it was replaced by a RISC architecture. Intel rose to the challenge, primarily by translat- ing 80x86 instructions into RISC-like instructions internally, allowing it to adopt many of the innovations first pioneered in the RISC designs. As transistor counts soared in the late 1990s, the hardware overhead of translating the more complex x86 architecture became negligible. In low-end applications, such as cell phones, the cost in power and silicon area of the x86-translation overhead helped lead to a RISC architecture, ARM, becoming dominant.

Figure 1.1 shows that the combination of architectural and organizational enhancements led to 17 years of sustained growth in performance at an annual rate of over 50%—a rate that is unprecedented in the computer industry.

The effect of this dramatic growth rate during the 20th century was fourfold. First, it has significantly enhanced the capability available to computer users. For many applications, the highest-performance microprocessors outperformed the supercomputer of less than 20 years earlier.

2 ■ Chapter One Fundamentals of Quantitative Design and Analysis

1

5

9

13 18

24

51

80

117

183

280

481 649

993 1,267

1,779 3,016

4,195 6,043

6,681 7,108 11,86514,387

19,484

21,871

24,129

31,999

34,967

39,419

40,967

49,935

49,935

49,870

1

10

100

1000

10,000

100,000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018

P er

fo rm

an ce

( vs

. V A

X -1

1/ 78

0)

25%/year

52%/year

23%/year 12%/year 3.5%/year

IBM POWERstation 100, 150 MHz

Digital Alphastation 4/266, 266 MHz

Digital Alphastation 5/300, 300 MHz

Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164

Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor

IBM Power4, 1.3 GHz

Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz

Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz

Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz)

Intel Core i7 4 cores 3.4 GHz (boost to 3.8 GHz) Intel Xeon 4 cores 3.6 GHz (Boost to 4.0 GHz)

Intel Xeon 4 cores 3.6 GHz (Boost to 4.0 GHz) Intel Xeon 4 cores 3.7 GHz (Boost to 4.1 GHz)

Intel Core i7 4 cores 4.0 GHz (Boost to 4.2 GHz) Intel Core i7 4 cores 4.0 GHz (Boost to 4.2 GHz)

Intel Core i7 4 cores 4.2 GHz (Boost to 4.5 GHz)

Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)

Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)

AMD Athlon 64, 2.8 GHz

Digital 3000 AXP/500, 150 MHz

HP 9000/750, 66 MHz

IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz

MIPS M/120, 16.7 MHz

Sun-4/260, 16.7 MHz

VAX 8700, 22 MHz

AX-11/780, 5 MHz

Figure 1.1 Growth in processor performance over 40 years. This chart plots program performance relative to the VAX 11/780 as measured by the SPEC integer benchmarks (see Section 1.8). Prior to the mid-1980s, growth in processor performance was largely technology-driven and averaged about 22% per year, or doubling performance every 3.5 years. The increase in growth to about 52% starting in 1986, or doubling every 2 years, is attributable to more advanced architectural and organizational ideas typified in RISC architectures. By 2003 this growth led to a dif- ference in performance of an approximate factor of 25 versus the performance that would have occurred if it had continued at the 22% rate. In 2003 the limits of power due to the end of Dennard scaling and the available instruction-level parallelism slowed uniprocessor performance to 23% per year until 2011, or doubling every 3.5 years. (The fastest SPECintbase performance since 2007 has had automatic parallelization turned on, so uniprocessor speed is harder to gauge. These results are limited to single-chip systems with usually four cores per chip.) From 2011 to 2015, the annual improvement was less than 12%, or doubling every 8 years in part due to the limits of parallelism of Amdahl’s Law. Since 2015, with the end of Moore’s Law, improvement has been just 3.5% per year, or doubling every 20 years! Performance for floating-point-oriented calculations follows the same trends, but typically has 1% to 2% higher annual growth in each shaded region. Figure 1.11 on page 27 shows the improvement in clock rates for these same eras. Because SPEC has changed over the years, performance of newer machines is estimated by a scaling factor that relates the performance for different versions of SPEC: SPEC89, SPEC92, SPEC95, SPEC2000, and SPEC2006. There are too few results for SPEC2017 to plot yet.

1.1 Introduction

■ 3

Second, this dramatic improvement in cost-performance led to new classes of computers. Personal computers and workstations emerged in the 1980s with the availability of the microprocessor. The past decade saw the rise of smart cell phones and tablet computers, which many people are using as their primary com- puting platforms instead of PCs. These mobile client devices are increasingly using the Internet to access warehouses containing 100,000 servers, which are being designed as if they were a single gigantic computer.

Third, improvement of semiconductor manufacturing as predicted by Moore’s law has led to the dominance of microprocessor-based computers across the entire range of computer design. Minicomputers, which were traditionally made from off-the-shelf logic or from gate arrays, were replaced by servers made by using microprocessors. Even mainframe computers and high-performance supercom- puters are all collections of microprocessors.

The preceding hardware innovations led to a renaissance in computer design, which emphasized both architectural innovation and efficient use of technology improvements. This rate of growth compounded so that by 2003, high- performance microprocessors were 7.5 times as fast as what would have been obtained by relying solely on technology, including improved circuit design, that is, 52% per year versus 35% per year.

This hardware renaissance led to the fourth impact, which was on software development. This 50,000-fold performance improvement since 1978 (see Figure 1.1) allowed modern programmers to trade performance for productivity. In place of performance-oriented languages like C and C++, much more program- ming today is done in managed programming languages like Java and Scala. More- over, scripting languages like JavaScript and Python, which are even more productive, are gaining in popularity along with programming frameworks like AngularJS and Django. To maintain productivity and try to close the performance gap, interpreters with just-in-time compilers and trace-based compiling are repla- cing the traditional compiler and linker of the past. Software deployment is chang- ing as well, with Software as a Service (SaaS) used over the Internet replacing shrink-wrapped software that must be installed and run on a local computer.

The nature of applications is also changing. Speech, sound, images, and video are becoming increasingly important, along with predictable response time that is so critical to the user experience. An inspiring example is Google Translate. This application lets you hold up your cell phone to point its camera at an object, and the image is sent wirelessly over the Internet to a warehouse-scale computer (WSC) that recognizes the text in the photo and translates it into your native language. You can also speak into it, and it will translate what you said into audio output in another language. It translates text in 90 languages and voice in 15 languages.

Alas, Figure 1.1 also shows that this 17-year hardware renaissance is over. The fundamental reason is that two characteristics of semiconductor processes that were true for decades no longer hold.

In 1974 Robert Dennard observed that power density was constant for a given area of silicon even as you increased the number of transistors because of smaller dimensions of each transistor. Remarkably, transistors could go faster but use less

4 ■ Chapter One Fundamentals of Quantitative Design and Analysis

power. Dennard scaling ended around 2004 because current and voltage couldn’t keep dropping and still maintain the dependability of integrated circuits.

This change forced the microprocessor industry to use multiple efficient pro- cessors or cores instead of a single inefficient processor. Indeed, in 2004 Intel can- celed its high-performance uniprocessor projects and joined others in declaring that the road to higher performance would be via multiple processors per chip rather than via faster uniprocessors. This milestone signaled a historic switch from relying solely on instruction-level parallelism (ILP), the primary focus of the first three editions of this book, to data-level parallelism (DLP) and thread-level par- allelism (TLP), which were featured in the fourth edition and expanded in the fifth edition. The fifth edition also added WSCs and request-level parallelism (RLP), which is expanded in this edition. Whereas the compiler and hardware conspire to exploit ILP implicitly without the programmer’s attention, DLP, TLP, and RLP are explicitly parallel, requiring the restructuring of the application so that it can exploit explicit parallelism. In some instances, this is easy; in many, it is a major new burden for programmers.

Amdahl’s Law (Section 1.9) prescribes practical limits to the number of useful cores per chip. If 10% of the task is serial, then the maximum performance benefit from parallelism is 10 no matter how many cores you put on the chip.

The second observation that ended recently is Moore’s Law. In 1965 Gordon Moore famously predicted that the number of transistors per chip would double every year, which was amended in 1975 to every two years. That prediction lasted for about 50 years, but no longer holds. For example, in the 2010 edition of this book, the most recent Intel microprocessor had 1,170,000,000 transistors. If Moore’s Law had continued, we could have expected microprocessors in 2016 to have 18,720,000,000 transistors. Instead, the equivalent Intel microprocessor has just 1,750,000,000 transistors, or off by a factor of 10 from what Moore’s Law would have predicted.

The combination of

■ transistors no longer getting much better because of the slowing of Moore’s Law and the end of Dinnard scaling,

■ the unchanging power budgets for microprocessors,

■ the replacement of the single power-hungry processor with several energy- efficient processors, and

■ the limits to multiprocessing to achieve Amdahl’s Law

caused improvements in processor performance to slow down, that is, to double every 20 years, rather than every 1.5 years as it did between 1986 and 2003 (see Figure 1.1).

The only path left to improve energy-performance-cost is specialization. Future microprocessors will include several domain-specific cores that perform only one class of computations well, but they do so remarkably better than general-purpose cores. The new Chapter 7 in this edition introduces domain-specific architectures.

1.1 Introduction ■ 5

This text is about the architectural ideas and accompanying compiler improve- ments that made the incredible growth rate possible over the past century, the rea- sons for the dramatic change, and the challenges and initial promising approaches to architectural ideas, compilers, and interpreters for the 21st century. At the core is a quantitative approach to computer design and analysis that uses empirical obser- vations of programs, experimentation, and simulation as its tools. It is this style and approach to computer design that is reflected in this text. The purpose of this chap- ter is to lay the quantitative foundation on which the following chapters and appen- dices are based.

This book was written not only to explain this design style but also to stimulate you to contribute to this progress.We believe this approach will serve the computers of the future just as it worked for the implicitly parallel computers of the past.

1.2 Classes of Computers

These changes have set the stage for a dramatic change in how we view computing, computing applications, and the computer markets in this new century. Not since the creation of the personal computer have we seen such striking changes in the way computers appear and in how they are used. These changes in computer use have led to five diverse computing markets, each characterized by different applications, requirements, and computing technologies. Figure 1.2 summarizes these main- stream classes of computing environments and their important characteristics.

Internet of Things/Embedded Computers

Embedded computers are found in everyday machines: microwaves, washing machines, most printers, networking switches, and all automobiles. The phrase

Feature Personal mobile device (PMD)

Desktop Server Clusters/warehouse-scale computer

Internet of things/ embedded

Price of system $100–$1000 $300–$2500 $5000–$10,000,000 $100,000–$200,000,000 $10–$100,000

Price of microprocessor

$10–$100 $50–$500 $200–$2000 $50–$250 $0.01–$100

Critical system design issues

Cost, energy, media performance, responsiveness

Price- performance, energy, graphics performance

Throughput, availability, scalability, energy

Price-performance, throughput, energy proportionality

Price, energy, application- specific performance

Figure 1.2 A summary of the five mainstream computing classes and their system characteristics. Sales in 2015 included about 1.6 billion PMDs (90% cell phones), 275 million desktop PCs, and 15 million servers. The total number of embedded processors sold was nearly 19 billion. In total, 14.8 billion ARM-technology-based chips were shipped in 2015. Note the wide range in system price for servers and embedded systems, which go from USB keys to network routers. For servers, this range arises from the need for very large-scale multiprocessor systems for high-end trans- action processing.

6 ■ Chapter One Fundamentals of Quantitative Design and Analysis

Internet of Things (IoT) refers to embedded computers that are connected to the Internet, typically wirelessly. When augmented with sensors and actuators, IoT devices collect useful data and interact with the physical world, leading to a wide variety of “smart” applications, such as smart watches, smart thermostats, smart speakers, smart cars, smart homes, smart grids, and smart cities.

Embedded computers have the widest spread of processing power and cost. They include 8-bit to 32-bit processors that may cost one penny, and high-end 64-bit processors for cars and network switches that cost $100. Although the range of computing power in the embedded computing market is very large, price is a key factor in the design of computers for this space. Performance requirements do exist, of course, but the primary goal often meets the performance need at a minimum price, rather than achieving more performance at a higher price. The projections for the number of IoT devices in 2020 range from 20 to 50 billion.

Most of this book applies to the design, use, and performance of embedded processors, whether they are off-the-shelf microprocessors or microprocessor cores that will be assembled with other special-purpose hardware.

Unfortunately, the data that drive the quantitative design and evaluation of other classes of computers have not yet been extended successfully to embedded computing (see the challenges with EEMBC, for example, in Section 1.8). Hence we are left for now with qualitative descriptions, which do not fit well with the rest of the book. As a result, the embedded material is concentrated in Appendix E. We believe a separate appendix improves the flow of ideas in the text while allowing readers to see how the differing requirements affect embedded computing.

Personal Mobile Device

Personal mobile device (PMD) is the term we apply to a collection of wireless devices with multimedia user interfaces such as cell phones, tablet computers, and so on. Cost is a prime concern given the consumer price for the whole product is a few hundred dollars. Although the emphasis on energy efficiency is frequently driven by the use of batteries, the need to use less expensive packag- ing—plastic versus ceramic—and the absence of a fan for cooling also limit total power consumption. We examine the issue of energy and power in more detail in Section 1.5. Applications on PMDs are often web-based and media-oriented, like the previously mentioned Google Translate example. Energy and size requirements lead to use of Flash memory for storage (Chapter 2) instead of magnetic disks.

The processors in a PMD are often considered embedded computers, but we are keeping them as a separate category because PMDs are platforms that can run externally developed software, and they share many of the characteristics of desktop computers. Other embedded devices are more limited in hardware and software sophistication. We use the ability to run third-party software as the divid- ing line between nonembedded and embedded computers.

Responsiveness and predictability are key characteristics for media applica- tions. A real-time performance requirement means a segment of the application has an absolute maximum execution time. For example, in playing a video on a

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Helping Engineer
Assignment Guru
Assignment Helper
Write My Coursework
Smart Homework Helper
Top Grade Essay
Writer Writer Name Offer Chat
Helping Engineer

ONLINE

Helping Engineer

I have read your project details and I can provide you QUALITY WORK within your given timeline and budget.

$19 Chat With Writer
Assignment Guru

ONLINE

Assignment Guru

I am a PhD writer with 10 years of experience. I will be delivering high-quality, plagiarism-free work to you in the minimum amount of time. Waiting for your message.

$25 Chat With Writer
Assignment Helper

ONLINE

Assignment Helper

After reading your project details, I feel myself as the best option for you to fulfill this project with 100 percent perfection.

$31 Chat With Writer
Write My Coursework

ONLINE

Write My Coursework

I have read your project details and I can provide you QUALITY WORK within your given timeline and budget.

$47 Chat With Writer
Smart Homework Helper

ONLINE

Smart Homework Helper

I am an academic and research writer with having an MBA degree in business and finance. I have written many business reports on several topics and am well aware of all academic referencing styles.

$45 Chat With Writer
Top Grade Essay

ONLINE

Top Grade Essay

I reckon that I can perfectly carry this project for you! I am a research writer and have been writing academic papers, business reports, plans, literature review, reports and others for the past 1 decade.

$40 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Adime assessment examples - Klf electrical services ltd - Implementing the nist cybersecurity framework using cobit 2019 - East suffolk rivers catchment board v kent - Keller williams lead generation pdf - Research paper - Worksafe victoria job safety analysis worksheet - Vendor Management - Aluminum j pole antenna - Why do flowering plants need to have internal fertilisation - Iso 14001 audit checklist - Human nature at work - Exercise 11: Activity system tentative choice and research plans - Principles of language teaching - Tina jones abdominal objective data - Database Creation - Normal value of differential count - Professional investment services pty ltd - Corporate leadership council talent management - When does macduff become suspicious of macbeth - Creation of a Rubric - Culture in Nursing DQ 2 week 5 - Annuity transformation method - Allied bank payment tracking - Mirabilis jalapa in marathi - Cdcs exam questions pdf - Six facets of understanding rubric - What is a subordinated debenture - Massey ferguson sunshine 585 header - TCP/IP Attack Lab- SEED Labs Project - Chapter 11 checkpoint questions introduction to java programming - Altistart 48 fault reset - What command embeds a new spreadsheet - The ethics of educational leadership rebore - Worley parsons melbourne office - Eight days in a corset siri hustvedt pdf - What is discovery hsc - Annual report ratio analysis assignment - Burnley gov uk recyclenow - Case 6-2 - Mth 215 week 1 checkpoint - Xss to steal cookies - Teach diversity with a smile barbara ehrenreich - 13 colonies quiz questions - Simon armitage mother any distance - Mkt 421 week 3 - Swiss hotel management school - Classified biology past papers igcse - Jim jefferies amsterdam egg - Oxford disability advisory service - Tenable io user guide - Business and society by lawrence and weber pdf - Walden martin family medical clinic - Jamie oliver ted talk analysis - Help 10 pages excluding citations - Ethics and leadership - Steve jobs one last thing online - Criminal Justice Capstone Project Submission - Lincoln v205 tig welder - Te106 timer instructions manual - Abrsm exam entry deadlines - Research Paper - Motte and bailey castles - Pile integrity test frequency - Spark arrestors oxy acetylene - Freight car loadings over an 18-week period at a busy port are as follows: - Unit 5 DB: Nature vs. Nuture - Nasm stabilization endurance workout - Marketing plan - Everyone's an author tumblr - Organizational climate and motivation in nursing - Gun lane medical centre - Hamas Case Study - Rmit applied science psychology - Parker v twentieth century fox film corporation - Convert mesh size to microns - Maths difference of two squares - Initiating the Project-1 - Centers for disease control influenza vaccine advertising campaign - 63 divided by 3 - Bruce lee fighting spirit pdf - 12 contemporary management techniques - All of the following are responsibilities of derivative classifiers - Ceramic distributors pty ltd - School of electrical engineering and telecommunications - BD Paper - What methods of capital acquisition did honest tea employ why - Series and parallel circuits lab report conclusion - Cap 1 M4 - The epicurus reader selected writings and testimonia - Origo stepping stones 2.0 grade 4 answers - Determination of vitamin c concentration by titration calculations - Wingate foundation music grant - Early childhood theorists cheat sheet - Ifas table - How medical technology shapes society - Balance chemical equations worksheet - British microlight aircraft association - Describe the importance of data/information visualization - How do i make a upside down question mark