Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

2.4 3 two's complement arithmetic answer key

18/11/2021 Client: muhammad11 Deadline: 2 Day

There are 10 kinds of people in the world—those who understand binary and those who don’t.

—Anonymous

CHAPTER 2

Data Representation in Computer Systems

2.1 INTRODUCTION The organization of any computer depends considerably on how it represents numbers, characters, and control information. The converse is also true: Standards and conventions established over the years have determined certain aspects of computer organization. This chapter describes the various ways in which computers can store and manipulate numbers and characters. The ideas presented in the following sections form the basis for understanding the organization and function of all types of digital systems.

The most basic unit of information in a digital computer is called a bit, which is a contraction of binary digit. In the concrete sense, a bit is nothing more than a state of “on” or “off” (or “high” and “low”) within a computer circuit. In 1964, the designers of the IBM System/360 mainframe computer established a convention of using groups of 8 bits as the basic unit of addressable computer storage. They called this collection of 8 bits a byte.

Computer words consist of two or more adjacent bytes that are sometimes addressed and almost always are manipulated collectively. The word size represents the data size that is handled most efficiently by a particular architecture. Words can be 16 bits, 32 bits, 64 bits, or any other size that makes sense in the context of a computer’s organization (including sizes that are not multiples of eight). An 8-bit byte can be divided into two 4-bit halves called nibbles (or nybbles). Because each bit of a byte has a value within a positional numbering system, the nibble containing the least-valued binary digit is called the low-order nibble, and the other half the high-order nibble.

2.2 POSITIONAL NUMBERING SYSTEMS At some point during the middle of the sixteenth century, Europe embraced the decimal (or base 10) numbering system that the Arabs and Hindus had been using for nearly a millennium. Today, we take for granted that the number 243 means two hundreds, plus four tens, plus three units. Notwithstanding the fact that zero means “nothing,” virtually everyone knows that there is a substantial difference between having 1 of something and having 10 of something.

The general idea behind positional numbering systems is that a numeric value is represented through increasing powers of a radix (or base). This is often referred to as a weighted numbering system because each position is weighted by a power of the radix.

The set of valid numerals for a positional numbering system is equal in size to the radix of that system. For example, there are 10 digits in the decimal system, 0 through 9, and 3 digits for the ternary (base 3) system, 0, 1, and 2. The largest valid number in a radix system is one smaller than the radix, so 8 is not a valid numeral in any radix system smaller than 9. To distinguish among numbers in different radices, we use the radix as a subscript, such as in 3310 to represent the decimal number 33. (In this text, numbers written without a subscript should be assumed to be decimal.) Any decimal integer can be expressed exactly in any other integral base system (see Example 2.1).

EXAMPLE 2.1 Three numbers represented as powers of a radix.

The two most important radices in computer science are binary (base two), and hexadecimal (base 16). Another radix of interest is octal (base 8). The binary system uses only the digits 0 and 1; the octal system, 0 through 7. The hexadecimal system allows the digits 0 through 9 with A, B, C, D, E, and F being used to represent the numbers 10 through 15. Table 2.1 shows some of the radices.

2.3 CONVERTING BETWEEN BASES Gottfried Leibniz (1646–1716) was the first to generalize the idea of the (positional) decimal system to other bases. Being a deeply spiritual person, Leibniz attributed divine qualities to the binary system. He correlated the fact that any integer could be represented by a series of ones and zeros with the idea that God (1) created the universe out of nothing (0). Until the first binary digital computers were built in the late 1940s, this system remained nothing more than a mathematical curiosity. Today, it lies at the heart of virtually every electronic device that relies on digital controls.

TABLE 2.1 Some Numbers to Remember

Because of its simplicity, the binary numbering system translates easily into electronic circuitry. It is also easy for humans to understand. Experienced computer professionals can recognize smaller binary numbers (such as those shown in Table 2.1) at a glance. Converting larger values and fractions, however, usually requires a calculator or pencil and paper. Fortunately, the conversion techniques are easy to master with a little practice. We show a few of the simpler techniques in the sections that follow.

2.3.1 Converting Unsigned Whole Numbers We begin with the base conversion of unsigned numbers. Conversion of signed numbers (numbers that can be positive or negative) is more complex, and it is important that you first understand the basic technique for conversion before continuing with signed numbers.

Conversion between base systems can be done by using either repeated subtraction or a division-remainder method. The subtraction method is cumbersome and requires a familiarity with the powers of the radix being used. Because it is the more intuitive of the two methods, however, we will explain it first.

As an example, let’s say we want to convert 10410 to base 3. We know that 34 = 81 is the highest power of 3 that is less than 104, so our base 3 number will be 5 digits wide (one for each power of the radix: 0 through 4). We make note that 81 goes once into 104 and subtract, leaving a difference of 23. We know that the next power of 3, 33 = 27, is too large to subtract, so we note the zero “placeholder” and look for how many times 32 = 9 divides 23. We see that it goes twice and subtract 18. We are left with 5, from which we subtract 31 = 3, leaving 2, which is 2 × 30. These steps are shown in Example 2.2.

EXAMPLE 2.2 Convert 10410 to base 3 using subtraction.

The division-remainder method is faster and easier than the repeated subtraction method. It employs the idea that successive divisions by the base are in fact successive subtractions by powers of the base. The remainders that we get when we sequentially divide by the base end up being the digits of the result, which are read from bottom to top. This method is illustrated in Example 2.3.

EXAMPLE 2.3 Convert 10410 to base 3 using the division-remainder method.

Reading the remainders from bottom to top, we have: 10410 = 102123.

This method works with any base, and because of the simplicity of the calculations, it is particularly useful in converting from decimal to binary. Example 2.4 shows such a conversion.

EXAMPLE 2.4 Convert 14710 to binary.

Reading the remainders from bottom to top, we have: 14710 = 100100112.

A binary number with N bits can represent unsigned integers from 0 to 2N – 1. For example, 4 bits can represent the decimal values 0 through 15, whereas 8 bits can represent the values 0 through 255. The range of values that can be represented by a given number of bits is extremely important when doing arithmetic operations on binary numbers. Consider a situation in which binary numbers are 4 bits in length, and we wish to add 11112 (1510) to 11112. We know that 15 plus 15 is 30, but 30 cannot be represented using only 4 bits. This is an example of a condition known as overflow, which occurs in unsigned binary representation when the result of an arithmetic operation is outside the range of allowable precision for the given number of bits. We address overflow in more detail when discussing signed numbers in Section 2.4.

2.3.2 Converting Fractions Fractions in any base system can be approximated in any other base system using negative powers of a radix. Radix points separate the integer part of a number from its fractional part. In the decimal system, the radix point is called a decimal point. Binary fractions have a binary point.

Fractions that contain repeating strings of digits to the right of the radix point in one base may not necessarily have a repeating sequence of digits in another base. For instance, ⅔ is a repeating decimal fraction, but in the ternary system, it terminates as 0.23 (2 × 3–1 = 2 × ⅓).

We can convert fractions between different bases using methods analogous to the repeated subtraction and division-remainder methods for converting integers. Example 2.5 shows how we can use repeated subtraction to convert a number from decimal to base 5.

EXAMPLE 2.5 Convert 0.430410 to base 5.

Reading from top to bottom, we have: 0.430410 = 0.20345.

Because the remainder method works with positive powers of the radix for conversion of integers, it stands to reason that we would use multiplication to convert fractions, because they are expressed in negative powers of the radix. However, instead of looking for remainders, as we did above, we use only the integer part of the product after multiplication by the radix. The answer is read from top to bottom instead of bottom to top. Example 2.6 illustrates the process.

EXAMPLE 2.6 Convert 0.430410 to base 5.

Reading from top to bottom, we have 0.430410 = 0.20345.

This example was contrived so that the process would stop after a few steps. Often things don’t work out quite so evenly, and we end up with repeating fractions. Most computer systems implement specialized rounding algorithms to provide a predictable degree of accuracy. For the sake of clarity, however, we will simply discard (or truncate) our answer when the desired accuracy has been achieved, as shown in Example 2.7.

EXAMPLE 2.7 Convert 0.3437510 to binary with 4 bits to the right of the binary point.

Reading from top to bottom, 0.3437510 = 0.01012 to four binary places.

The methods just described can be used to directly convert any number in any base to any other base, say from base 4 to base 3 (as in Example 2.8). However, in most cases, it is faster and more accurate to first convert to base 10 and then to the desired base. One exception to this rule is when you are working between bases that are powers of two, as you’ll see in the next section.

EXAMPLE 2.8 Convert 31214 to base 3.

First, convert to decimal:

Then convert to base 3:

2.3.3 Converting Between Power-of-Two Radices Binary numbers are often expressed in hexadecimal—and sometimes octal—to improve their readability. Because 16 = 24, a group of 4 bits (called a hextet) is easily recognized as a hexadecimal digit. Similarly, with 8 = 23, a group of 3 bits (called an octet) is expressible as one octal digit. Using these relationships, we can therefore convert a number from binary to octal or hexadecimal by doing little more than looking at it.

EXAMPLE 2.9 Convert 1100100111012 to octal and hexadecimal.

If there are too few bits, leading zeros can be added.

2.4 SIGNED INTEGER REPRESENTATION We have seen how to convert an unsigned integer from one base to another. Signed numbers require that additional issues be addressed. When an integer variable is declared in a program, many programming languages automatically allocate a storage area that includes a sign as the first bit of the storage location. By convention, a “1” in the high- order bit indicates a negative number. The storage location can be as small as an 8-bit byte or as large as several words, depending on the programming language and the computer system. The remaining bits (after the sign bit) are used to represent the number itself.

How this number is represented depends on the method used. There are three commonly used approaches. The most intuitive method, signed magnitude, uses the remaining bits to represent the magnitude of the number. This method and the other two approaches, which both use the concept of complements, are introduced in the following sections.

2.4.1 Signed Magnitude Up to this point, we have ignored the possibility of binary representations for negative numbers. The set of positive and negative integers is referred to as the set of signed integers. The problem with representing signed integers as binary values is the sign—how should we encode the actual sign of the number? Signed-magnitude representation is one method of solving this problem. As its name implies, a signed-magnitude number has a sign as its leftmost bit (also referred to as the high-order bit or the most significant bit) whereas the remaining bits represent the magnitude (or absolute value) of the numeric value. For example, in an 8-bit word, –1 would be represented as 10000001, and +1 as 00000001. In a computer system that uses signed-magnitude representation and 8 bits to store integers, 7 bits can be used for the actual representation of the magnitude of the number. This means that the largest integer an 8-bit word can represent is 27 – 1, or 127 (a zero in the high-order bit, followed by 7 ones). The smallest integer is 8 ones, or –127. Therefore, N bits can represent –(2(N–1) – 1) to 2(N–1) – 1.

Computers must be able to perform arithmetic calculations on integers that are represented using this notation. Signed-magnitude arithmetic is carried out using essentially the same methods that humans use with pencil and paper, but it can get confusing very quickly. As an example, consider the rules for addition: (1) If the signs are the same, add the magnitudes and use that same sign for the result; (2) If the signs differ, you must determine which operand has the larger magnitude. The sign of the result is the same as the sign of the operand with the larger magnitude, and the magnitude must be obtained by subtracting (not adding) the smaller one from the larger one. If you consider these rules carefully, this is the method you use for signed arithmetic by hand.

We arrange the operands in a certain way based on their signs, perform the calculation without regard to the signs, and then supply the sign as appropriate when the calculation is complete. When modeling this idea in an 8-bit word, we must be careful to include only 7 bits in the magnitude of the answer, discarding any carries that take place over the high-order bit.

EXAMPLE 2.10 Add 010011112 to 001000112 using signed-magnitude arithmetic.

The arithmetic proceeds just as in decimal addition, including the carries, until we get to the seventh bit from the right. If there is a carry here, we say that we have an overflow condition and the carry is discarded, resulting in an incorrect sum. There is no overflow in this example.

We find that 010011112 + 001000112 = 011100102 in signed-magnitude representation.

Sign bits are segregated because they are relevant only after the addition is complete. In this case, we have the sum of two positive numbers, which is positive. Overflow (and thus an erroneous result) in signed numbers occurs when the sign of the result is incorrect.

In signed magnitude, the sign bit is used only for the sign, so we can’t “carry into” it. If there is a carry emitting from the seventh bit, our result will be truncated as the seventh bit overflows, giving an incorrect sum. (Example 2.11 illustrates this overflow condition.) Prudent programmers avoid “million-dollar” mistakes by checking for overflow conditions whenever there is the slightest possibility they could occur. If we did not discard the overflow bit, it would carry into the sign, causing the more outrageous result of the sum of two positive numbers being negative. (Imagine what would happen if the next step in a program were to take the square root or log of that result!)

EXAMPLE 2.11 Add 010011112 to 011000112 using signed-magnitude arithmetic.

We obtain the erroneous result of 79 + 99 = 50.

Dabbling on the Double The fastest way to convert a binary number to decimal is a method called double-dabble (or double-dibble). This method builds on the idea that a subsequent power of two is double the previous power of two in a binary number. The calculation starts with the leftmost bit and works toward the rightmost bit. The first bit is doubled and added to the next bit. This sum is then doubled and added to the following bit. The process is repeated for each bit until the rightmost bit has been used.

EXAMPLE 1

Convert 100100112 to decimal.

Step 1: Write down the binary number, leaving space between the bits.

Step 2: Double the high-order bit and copy it under the next bit.

Step 3: Add the next bit and double the sum. Copy this result under the next bit.

Step 4: Repeat Step 3 until you run out of bits.

When we combine hextet grouping (in reverse) with the double-dabble method, we find that we can convert hexadecimal to decimal with ease.

EXAMPLE 2

Convert 02CA16 to decimal.

First, convert the hex to binary by grouping into hextets.

Then apply the double-dabble method on the binary form:

As with addition, signed-magnitude subtraction is carried out in a manner similar to pencil-and-paper decimal arithmetic, where it is sometimes necessary to borrow from digits in the minuend.

EXAMPLE 2.12 Subtract 010011112 from 011000112 using signed-magnitude arithmetic.

We find that 011000112 – 010011112 = 000101002 in signed-magnitude representation.

EXAMPLE 2.13 Subtract 011000112 (99) from 010011112 (79) using signed-magnitude arithmetic. By inspection, we see that the subtrahend, 01100011, is larger than the minuend, 01001111. With the result

obtained in Example 2.12, we know that the difference of these two numbers is 00101002. Because the subtrahend is larger than the minuend, all we need to do is change the sign of the difference. So we find that 010011112 – 011000112 = 100101002 in signed-magnitude representation.

We know that subtraction is the same as “adding the opposite,” which equates to negating the value we wish to subtract and then adding instead (which is often much easier than performing all the borrows necessary for subtraction, particularly in dealing with binary numbers). Therefore, we need to look at some examples involving both positive and negative numbers. Recall the rules for addition: (1) If the signs are the same, add the magnitudes and use that same sign for the result; (2) If the signs differ, you must determine which operand has the larger magnitude. The sign of the result is the same as the sign of the operand with the larger magnitude, and the magnitude must be obtained by subtracting (not adding) the smaller one from the larger one.

EXAMPLE 2.14 Add 100100112 (–19) to 000011012 (+13) using signed-magnitude arithmetic. The first number (the augend) is negative because its sign bit is set to 1. The second number (the addend) is

positive. What we are asked to do is in fact a subtraction. First, we determine which of the two numbers is larger in magnitude and use that number for the augend. Its sign will be the sign of the result.

With the inclusion of the sign bit, we see that 100100112 – 000011012 = 100001102 in signed-magnitude representation.

EXAMPLE 2.15 Subtract 100110002 (–24) from 101010112 (–43) using signed-magnitude arithmetic. We can convert the subtraction to an addition by negating –24, which gives us 24, and then we can add this to

–43, giving us a new problem of –43 + 24. However, we know from the addition rules above that because the signs now differ, we must actually subtract the smaller magnitude from the larger magnitude (or subtract 24 from 43) and make the result negative (because 43 is larger than 24).

Note that we are not concerned with the sign until we have performed the subtraction. We know the answer must be negative. So we end up with 101010112 – 100110002 = 100100112 in signed-magnitude representation.

While reading the preceding examples, you may have noticed how many questions we had to ask ourselves: Which number is larger? Am I subtracting a negative number? How many times do I have to borrow from the minuend? A computer engineered to perform arithmetic in this manner must make just as many decisions (though a whole lot faster). The logic (and circuitry) is further complicated by the fact that signed magnitude has two representations for zero, 10000000 and 00000000 (and mathematically speaking, this simply shouldn’t happen!). Simpler methods for representing signed numbers would allow simpler and less expensive circuits. These simpler methods are based on radix complement systems.

2.4.2 Complement Systems Number theorists have known for hundreds of years that one decimal number can be subtracted from another by adding the difference of the subtrahend from all nines and adding back a carry. This is called taking the nine’s complement of the subtrahend or, more formally, finding the diminished radix complement of the subtrahend. Let’s say we wanted to find 167 – 52. Taking the difference of 52 from 999, we have 947. Thus, in nine’s complement arithmetic, we have 167 – 52 = 167 + 947 = 1114. The “carry” from the hundreds column is added back to the units place, giving us a correct 167 – 52 = 115. This method was commonly called “casting out 9s” and has been extended to binary operations to simplify computer arithmetic. The advantage that complement systems give us over signed magnitude is that there is no need to process sign bits separately, but we can still easily check the sign of a number by looking at its high-order bit.

Another way to envision complement systems is to imagine an odometer on a bicycle. Unlike cars, when you go backward on a bike, the odometer will go backward as well. Assuming an odometer with three digits, if we start at zero and end with 700, we can’t be sure whether the bike went forward 700 miles or backward 300 miles! The easiest solution to this dilemma is simply to cut the number space in half and use 001–500 for positive miles and 501–999 for negative miles. We have, effectively, cut down the distance our odometer can measure. But now if it reads 997, we know the bike has backed up 3 miles instead of riding forward 997 miles. The numbers 501–999 represent the radix complements (the second of the two methods introduced below) of the numbers 001–500 and are being used to represent negative distance.

One’s Complement As illustrated above, the diminished radix complement of a number in base 10 is found by subtracting the subtrahend from the base minus one, which is 9 in decimal. More formally, given a number N in base r having d digits, the diminished radix complement of N is defined to be (r d – 1) – N. For decimal numbers, r = 10, and the diminished radix is 10 – 1 = 9. For example, the nine’s complement of 2468 is 9999 – 2468 = 7531. For an equivalent operation in binary, we subtract from one less the base (2), which is 1. For example, the one’s complement of 01012 is 11112 – 0101 = 1010. Although we could tediously borrow and subtract as discussed above, a few experiments will convince you that forming the one’s complement of a binary number amounts to nothing more than switching all of the 1s with 0s and vice versa. This sort of bit-flipping is very simple to implement in computer hardware.

It is important to note at this point that although we can find the nine’s complement of any decimal number or the one’s complement of any binary number, we are most interested in using complement notation to represent negative numbers. We know that performing a subtraction, such as 10 – 7, can also be thought of as “adding the opposite,” as in 10 + (–7). Complement notation allows us to simplify subtraction by turning it into addition, but it also gives us a method to represent negative numbers. Because we do not wish to use a special bit to represent the sign (as we did in signed-magnitude representation), we need to remember that if a number is negative, we should convert it to its complement. The result should have a 1 in the leftmost bit position to indicate that the number is negative.

Although the one’s complement of a number is technically the value obtained by subtracting that number from a large power of two, we often refer to a computer using one’s complement for negative numbers as a one’s complement system, or a computer that uses one’s complement arithmetic. This can be somewhat misleading, as positive numbers do not need to be complemented; we only complement negative numbers so we can get them into a format the computer will understand. Example 2.16 illustrates these concepts.

EXAMPLE 2.16 Express 2310 and –910 in 8-bit binary, assuming a computer is using one’s complement representation.

Unlike signed magnitude, in one’s complement addition there is no need to maintain the sign bit separate from the other bits. The sign takes care of itself. Compare Example 2.17 with Example 2.10.

EXAMPLE 2.17 Add 010011112 to 001000112 using one’s complement addition.

Suppose we wish to subtract 9 from 23. To carry out a one’s complement subtraction, we first express the subtrahend (9) in one’s complement, then add it to the minuend (23); we are effectively now adding –9 to 23. The high-order bit will have a 1 or a 0 carry, which is added to the low-order bit of the sum. (This is called end carry-around and results from using the diminished radix complement.)

EXAMPLE 2.18 Add 2310 to –910 using one’s complement arithmetic.

EXAMPLE 2.19 Add 910 to –2310 using one’s complement arithmetic.

How do we know that 111100012 is actually –1410? We simply need to take the one’s complement of this binary number (remembering it must be negative because the leftmost bit is negative). The one’s complement of 111100012 is 000011102, which is 14.

The primary disadvantage of one’s complement is that we still have two representations for zero: 00000000 and 11111111. For this and other reasons, computer engineers long ago stopped using one’s complement in favor of the more efficient two’s complement representation for binary numbers.

Two’s Complement Two’s complement is an example of a radix complement. Given a number N in base r having d digits, the radix complement of N is defined as r d – N for N ≠ 0 and 0 for N = 0. The radix complement is often more intuitive than the diminished radix complement. Using our odometer example, the ten’s complement of going forward 2 miles is 103 – 2 = 998, which we have already agreed indicates a negative (backward) distance. Similarly, in binary, the two’s complement of the 4-bit number 00112 is 24 – 00112 = 100002 – 00112 = 11012.

Upon closer examination, you will discover that two’s complement is nothing more than one’s complement incremented by 1. To find the two’s complement of a binary number, simply flip bits and add 1. This simplifies addition and subtraction as well. Because the subtrahend (the number we complement and add) is incremented at the outset, however, there is no end carry-around to worry about. We simply discard any carries involving the high-order bits. Just as with one’s complement, two’s complement refers to the complement of a number, whereas a computer

using this notation to represent negative numbers is said to be a two’s complement system, or uses two’s complement arithmetic. As before, positive numbers can be left alone; we only need to complement negative numbers to get them into their two’s complement form. Example 2.20 illustrates these concepts.

EXAMPLE 2.20 Express 2310, –2310, and –910 in 8-bit binary, assuming a computer is using two’s complement representation.

Because the representation of positive numbers is the same in one’s complement and two’s complement (as well as signed-magnitude), the process of adding two positive binary numbers is the same. Compare Example 2.21 with Example 2.17 and Example 2.10.

EXAMPLE 2.21 Add 010011112 to 001000112 using two’s complement addition.

Suppose we are given the binary representation for a number and want to know its decimal equivalent. Positive numbers are easy. For example, to convert the two’s complement value of 000101112 to decimal, we simply convert this binary number to a decimal number to get 23. However, converting two’s complement negative numbers requires a reverse procedure similar to the conversion from decimal to binary. Suppose we are given the two’s complement binary value of 111101112, and we want to know the decimal equivalent. We know this is a negative number but must remember it is represented using two’s complement. We first flip the bits and then add 1 (find the one’s complement and add 1). This results in the following: 000010002 + 1 = 000010012. This is equivalent to the decimal value 9. However, the original number we started with was negative, so we end up with –9 as the decimal equivalent to 111101112.

The following two examples illustrate how to perform addition (and hence subtraction, because we subtract a number by adding its opposite) using two’s complement notation.

EXAMPLE 2.22 Add 910 to –2310 using two’s complement arithmetic.

It is left as an exercise for you to verify that 111100102 is actually –1410 using two’s complement notation.

EXAMPLE 2.23 Find the sum of 2310 and –910 in binary using two’s complement arithmetic.

In two’s complement, the addition of two negative numbers produces a negative number, as we might expect.

EXAMPLE 2.24 Find the sum of 111010012 (–23) and 111101112 (–9) using two’s complement addition.

Notice that the discarded carries in Examples 2.23 and 2.24 did not cause an erroneous result. An overflow occurs if two positive numbers are added and the result is negative, or if two negative numbers are added and the result is positive. It is not possible to have overflow when using two’s complement notation if a positive and a negative number are being added together.

INTEGER MULTIPLICATION AND DIVISION Unless sophisticated algorithms are used, multiplication and division can consume a considerable number of computation cycles before a result is obtained. Here, we discuss only the most straightforward approach to these operations. In real systems, dedicated hardware is used to optimize throughput, sometimes carrying out portions of the calculation in parallel. Curious readers will want to investigate some of these advanced methods in the references cited at the end of this chapter.

The simplest multiplication algorithms used by computers are similar to traditional pencil-and-paper methods used by humans. The complete multiplication table for binary numbers couldn’t be simpler: zero times any number is zero, and one times any number is that number.

To illustrate simple computer multiplication, we begin by writing the multiplicand and the multiplier to two separate storage areas. We also need a third storage area for the product. Starting with the low-order bit, a pointer is set to each digit of the multiplier. For each digit in the multiplier, the multiplicand is “shifted” one bit to the left. When the multiplier is 1, the “shifted” multiplicand is added to a running sum of partial products. Because we shift the multiplicand by one bit for each bit in the multiplier, a product requires double the working space of either the multiplicand or the multiplier.

There are two simple approaches to binary division: We can either iteratively subtract the denominator from the divisor, or we can use the same trial-and-error method of long division that we were taught in grade school. As with multiplication, the most efficient methods used for binary division are beyond the scope of this text and can be found in the references at the end of this chapter.

Regardless of the relative efficiency of any algorithms that are used, division is an operation that can always cause a computer to crash. This is the case particularly when division by zero is attempted or when two numbers of enormously different magnitudes are used as operands. When the divisor is much smaller than the dividend, we get a condition known as divide underflow, which the computer sees as the equivalent of division by zero, which is impossible.

Computers make a distinction between integer division and floating-point division. With integer division, the answer comes in two parts: a quotient and a remainder. Floating-point division results in a number that is expressed as a binary fraction. These two types of division are sufficiently different from each other as to warrant giving each its own special circuitry. Floating-point calculations are carried out in dedicated circuits called floating-point units, or FPUs.

EXAMPLE Find the product of 000001102 and 000010112.

Simple computer circuits can easily detect an overflow condition using a rule that is easy to remember. You’ll notice in both Examples 2.23 and 2.24 that the carry going into the sign bit (a 1 is carried from the previous bit position into the sign bit position) is the same as the carry going out of the sign bit (a 1 is carried out and discarded). When these carries are equal, no overflow occurs. When they differ, an overflow indicator is set in the arithmetic logic unit, indicating the result is incorrect.

A Simple Rule for Detecting an Overflow Condition in Signed Numbers: If the carry into the sign bit equals the carry out of the bit, no overflow has occurred. If the carry into the sign bit is different from the carry out of the sign bit, overflow (and thus an error) has occurred.

The hard part is getting programmers (or compilers) to consistently check for the overflow condition. Example 2.25 indicates overflow because the carry into the sign bit (a 1 is carried in) is not equal to the carry out of the sign bit (a 0 is carried out).

EXAMPLE 2.25 Find the sum of 12610 and 810 in binary using two’s complement arithmetic.

A one is carried into the leftmost bit, but a zero is carried out. Because these carries are not equal, an overflow has occurred. (We can easily see that two positive numbers are being added but the result is negative.) We return to this topic in Section 2.4.6.

Two’s complement is the most popular choice for representing signed numbers. The algorithm for adding and subtracting is quite easy, has the best representation for 0 (all 0 bits), is self-inverting, and is easily extended to larger numbers of bits. The biggest drawback is in the asymmetry seen in the range of values that can be represented by N bits. With signed-magnitude numbers, for example, 4 bits allow us to represent the values –7 through +7. However, using two’s complement, we can represent the values –8 through +7, which is often confusing to anyone learning about complement representations. To see why +7 is the largest number we can represent using 4-bit two’s complement representation, we need only remember that the first bit must be 0. If the remaining bits are all 1s (giving us the largest magnitude possible), we have 01112, which is 7. An immediate reaction to this is that the smallest negative number should then be 11112, but we can see that 11112 is actually –1 (flip the bits, add one, and make the number negative). So how do we represent –8 in two’s complement notation using 4 bits? It is represented as 10002. We know this is a negative number. If we flip the bits (0111), add 1 (to get 1000, which is 8), and make it negative, we get –8.

2.4.3 Excess-M Representation for Signed Numbers

Recall the bicycle example that we discussed when introducing complement systems. We selected a particular value (500) as the cutoff for positive miles, and we assigned values from 501 to 999 to negative miles. We didn’t need signs because we used the range to determine whether the number was positive or negative. Excess-M representation (also called offset binary representation) does something very similar; unsigned binary values are used to represent signed integers. However, excess-M representation, unlike signed magnitude and the complement encodings, is more intuitive because the binary string with all 0s represents the smallest number, whereas the binary string with all 1s represents the largest value; in other words, ordering is preserved.

The unsigned binary representation for integer M (called the bias) represents the value 0, whereas all zeros in the bit pattern represents the integer –M. Essentially, a decimal integer is “mapped” (as in our bicycle example) to an unsigned binary integer, but interpreted as positive or negative depending on where it falls in the range. If we are using n bits for the binary representation, we need to select the bias in such a manner that we split the range equally. We typically do this by choosing a bias of 2n–1 – 1. For example, if we were using 4-bit representation, the bias should be 24–1 – 1 = 7. Just as with signed magnitude, one’s complement, and two’s complement, there is a specific range of values that can be expressed in n bits.

The unsigned binary value for a signed integer using excess-M representation is determined simply by adding M to that integer. For example, assuming that we are using excess-7 representation, the integer 010 would be represented as 0 + 7 = 710 = 01112; the integer 310 would be represented as 3 + 7 = 1010 = 10102; and the integer –7 would be represented as –7 + 7 = 010 = 00002. Using excess-7 notation and given the binary number 11112, to find the decimal value it represents, we simply subtract 7: 11112 = 1510, and 15 – 7 = 8; therefore, the value 11112, using excess-7 representation, is +810.

Let’s compare the encoding schemes we have seen so far, assuming 8-bit numbers:

Excess-M representation allows us to use unsigned binary values to represent signed integers; it is important to note, however, that two parameters must be specified: the number of bits being used in the representation and the bias value itself. In addition, a computer is unable to perform addition on excess-M values using hardware designed for unsigned numbers; special circuits are required. Excess-M representation is important because of its use in representing integer exponents in floating-point numbers, as we will see in Section 2.5.

2.4.4 Unsigned Versus Signed Numbers We introduced our discussion of binary integers with unsigned numbers. Unsigned numbers are used to represent values that are guaranteed not to be negative. A good example of an unsigned number is a memory address. If the 4-bit binary value 1101 is unsigned, then it represents the decimal value 13, but as a signed two’s complement number, it represents –3. Signed numbers are used to represent data that can be either positive or negative.

A computer programmer must be able to manage both signed and unsigned numbers. To do so, the programmer must first identify numeric values as either signed or unsigned numbers. This is done by declaring the value as a specific type. For instance, the C programming language has int and unsigned int as possible types for integer variables, defining signed and unsigned integers, respectively. In addition to different type declarations, many languages have different arithmetic operations for use with signed and unsigned numbers. A language may have one subtraction instruction for signed numbers and a different subtraction instruction for unsigned numbers. In most assembly languages, programmers can choose from a signed comparison operator or an unsigned comparison operator.

It is interesting to compare what happens with unsigned and signed numbers when we try to store values that are too large for the specified number of bits. Unsigned numbers simply wrap around and start over at zero. For example, if we are using 4-bit unsigned binary numbers, and we add 1 to 1111, we get 0000. This “return to zero” wraparound is familiar—perhaps you have seen a high-mileage car in which the odometer has wrapped back around to zero. However, signed numbers devote half their space to positive numbers and the other half to negative numbers. If we add 1 to the largest positive 4-bit two’s complement number 0111 (+7), we get 1000 (–8). This wraparound with the

unexpected change in sign has been problematic to inexperienced programmers, resulting in multiple hours of debugging time. Good programmers understand this condition and make appropriate plans to deal with the situation before it occurs.

2.4.5 Computers, Arithmetic, and Booth’s Algorithm Computer arithmetic as introduced in this chapter may seem simple and straight-forward, but it is a field of major study in computer architecture. The basic focus is on the implementation of arithmetic functions, which can be realized in software, firmware, or hardware. Researchers in this area are working toward designing superior central processing units (CPUs), developing high-performance arithmetic circuits, and contributing to the area of embedded systems application-specific circuits. They are working on algorithms and new hardware implementations for fast addition, subtraction, multiplication, and division, as well as fast floating-point operations. Researchers are looking for schemes that use nontraditional approaches, such as the fast carry look-ahead principle, residue arithmetic, and Booth’s algorithm. Booth’s algorithm is a good example of one such scheme and is introduced here in the context of signed two’s complement numbers to give you an idea of how a simple arithmetic operation can be enhanced by a clever algorithm.

Although Booth’s algorithm usually yields a performance increase when multiplying two’s complement numbers, there is another motivation for introducing this algorithm. In Section 2.4.2, we covered examples of two’s complement addition and saw that the numbers could be treated as unsigned values. We simply perform “regular” addition, as the following example illustrates:

The same is true for two’s complement subtraction. However, now consider the standard pencil-and-paper method for multiplying the following two’s complement numbers:

“Regular” multiplication clearly yields the incorrect result. There are a number of solutions to this problem, such as converting both values to positive numbers, performing conventional multiplication, and then remembering if one or both values were negative to determine whether the result should be positive or negative. Booth’s algorithm not only solves this dilemma, but also speeds up multiplication in the process.

The general idea of Booth’s algorithm is to increase the speed of a multiplication when there are consecutive zeros or ones in the multiplier. It is easy to see that consecutive zeros help performance. For example, if we use the tried and true pencil-and-paper method and find 978 × 1001, the multiplication is much easier than if we take 978 × 999. This is because of the two zeros found in 1001. However, if we rewrite the two problems as follows:

we see that the problems are in fact equal in difficulty. Our goal is to use a string of ones in a binary number to our advantage in much the same way that we use a string

of zeros to our advantage. We can use the rewriting idea from above. For example, the binary number 0110 can be rewritten 1000 – 0010 = –0010 + 1000. The two ones have been replaced by a “subtract” (determined by the rightmost 1 in the string) followed by an “add” (determined by moving one position left of the leftmost 1 in the string).

Consider the following standard multiplication example:

The idea of Booth’s algorithm is to replace the string of ones in the multiplier with an initial subtract when we see the rightmost 1 of the string (or subtract 0011) and then later add for the bit after the last 1 (or add 001100). In the middle of the string, we can now use simple shifting:

In Booth’s algorithm, if the multiplicand and multiplier are n-bit two’s complement numbers, the result is a 2n-bit two’s complement value. Therefore, when we perform our intermediate steps, we must extend our n-bit numbers to 2n-bit numbers. If a number is negative and we extend it, we must extend the sign. For example, the value 1000 (–8) extended to 8 bits would be 11111000. We continue to work with bits in the multiplier, shifting each time we complete a step. However, we are interested in pairs of bits in the multiplier and proceed according to the following rules:

1. If the current multiplier bit is 1 and the preceding bit was 0, we are at the beginning of a string of ones, so subtract the multiplicand from the product (or add the opposite).

2. If the current multiplier bit is 0 and the preceding bit was 1, we are at the end of a string of ones, so add the multiplicand to the product.

3. If it is a 00 pair, or a 11 pair, do no arithmetic operation (we are in the middle of a string of zeros or a string of ones). Simply shift. The power of the algorithm is in this step: We can now treat a string of ones as a string of zeros and do nothing more than shift.

Note: The first time we pick a pair of bits in the multiplier, we should assume a mythical 0 as the “previous” bit. Then we simply move left one bit for our next pair.

Example 2.26 illustrates the use of Booth’s algorithm to multiply –3 × 5 using signed 4-bit two’s complement numbers.

EXAMPLE 2.26 Negative 3 in 4-bit two’s complement is 1101. Extended to 8 bits, it is 11111101. Its complement is 00000011. When we see the rightmost 1 in the multiplier, it is the beginning of a string of 1s, so we treat it as if it were the string 10:

Ignore extended sign bits that go beyond 2n.

EXAMPLE 2.27 Let’s look at the larger example of 53 × 126:

Note that we have not shown the extended sign bits that go beyond what we need and use only the 16 rightmost bits. The entire string of ones in the multiplier was replaced by a subtract (adding 11001011) followed by an add. Everything in the middle is simply shifting—something that is very easy for a computer to do (as we will see in Chapter 3). If the time required for a computer to do an add is sufficiently larger than that required to do a shift, Booth’s algorithm can provide a considerable increase in performance. This depends somewhat, of course, on the multiplier. If the multiplier has strings of zeros and/or ones, the algorithm works well. If the multiplier consists of an alternating string of zeros and ones (the worst case), using Booth’s algorithm might very well require more operations than the standard approach.

Computers perform Booth’s algorithm by adding and shifting values stored in registers. A special type of shift called an arithmetic shift is necessary to preserve the sign bit. Many books present Booth’s algorithm in terms of arithmetic shifts and add operations on registers only, and may appear quite different from the preceding method. We have presented Booth’s algorithm so that it more closely resembles the pencil-and-paper method with which we are all familiar, although it is equivalent to the computer algorithms presented elsewhere.

There have been many algorithms developed for fast multiplication, but many do not hold for signed multiplication. Booth’s algorithm not only allows multiplication to be performed faster in most cases, but it also has the added bonus in that it works correctly on signed numbers.

2.4.6 Carry Versus Overflow The wraparound referred to in the preceding section is really overflow. CPUs often have flags to indicate both carry and overflow. However, the overflow flag is used only with signed numbers and means nothing in the context of unsigned numbers, which use the carry flag instead. If carry (which means carry out of the leftmost bit) occurs in unsigned numbers, we know we have overflow (the new value is too large to be stored in the given number of bits) but the overflow bit is not set. Carry out can occur in signed numbers as well; however, its occurrence in signed numbers is neither sufficient nor necessary for overflow. We have already seen that overflow in signed numbers can

be determined if the carry in to the leftmost bit and the carry out of the leftmost bit differ. However, carry out of the leftmost bit in unsigned operations always indicates overflow.

TABLE 2.2 Examples of Carry and Overflow in Signed Numbers

To illustrate these concepts, consider 4-bit unsigned and signed numbers. If we add the two unsigned values 0111 (7) and 0001 (1), we get 1000 (8). There is no carry (out), and thus no error. However, if we add the two unsigned values 0111 (7) and 1011 (11), we get 0010 with a carry, indicating that there is an error (indeed, 7 + 11 is not 2). This wraparound would cause the carry flag in the CPU to be set. Essentially, carry out in the context of unsigned numbers means an overflow has occurred, even though the overflow flag is not set.

We said carry (out) is neither sufficient nor necessary for overflow in signed numbers. Consider adding the two’s complement integers 0101 (+5) and 0011 (+3). The result is 1000 (–8), which is clearly incorrect. The problem is that we have a carry in to the sign bit, but no carry out, which indicates that we have an overflow (therefore, carry is not necessary for overflow). However, if we now add 0111 (+7) and 1011 (–5), we get the correct result: 0010 (+2). We have both a carry in to and a carry out of the leftmost bit, so there is no error (so carry is not sufficient for overflow). The carry flag would be set, but the overflow flag would not be set. Thus carry out does not necessarily indicate an error in signed numbers, nor does the lack of carry out indicate that the answer is correct.

To summarize, the rule of thumb used to determine when carry indicates an error depends on whether we are using signed or unsigned numbers. For unsigned numbers, a carry (out of the leftmost bit) indicates the total number of bits was not large enough to hold the resulting value, and overflow has occurred. For signed numbers, if the carry in to the sign bit and the carry (out of the sign bit) differ, then overflow has occurred. The overflow flag is set only when overflow occurs with signed numbers.

Carry and overflow clearly occur independently of each other. Examples using signed two’s complement representation are given in Table 2.2. Carry in to the sign bit is not indicated in the table.

2.4.7 Binary Multiplication and Division Using Shifting Shifting a binary number simply means moving the bits left or right by a certain amount. For example, the binary value 00001111 shifted left one place results in 00011110 (if we fill with a zero on the right). The first number is equivalent to decimal value 15; the second is decimal 30, which is exactly double the first value. This is no coincidence!

When working with signed two’s complement numbers, we can use a special type of shift, called an arithmetic shift, to perform quick and easy multiplication and division by 2. Recall that the leftmost bit in a two’s complement number determines its sign, so we must be careful when shifting these values that we don’t change the sign bit, as multiplying or dividing by 2 should not change the sign of the number.

We can perform a left arithmetic shift (which multiples a number by 2) or a right arithmetic shift (which divides a number by 2). Assuming that bits are numbered right to left beginning with zero, we have the following definitions for left and right arithmetic shifts.

A left arithmetic shift inserts a 0 in for bit b0, and shifts all other bits left one position, resulting in bit bn–1 being replaced by bit bn–2. Because bit bn–1 is the sign bit, if the value in this bit changes, the operation has caused overflow. Multiplication by 2 always results in a binary number with the rightmost bit equal to 0, which is an even number, and thus explains why we pad with a zero on the right. Consider the following examples:

EXAMPLE 2.28 Multiply the value 11 (expressed using 8-bit signed two’s complement representation) by 2. We start with the binary value for 11:

0 0 0 0 1 0 1 1

and we shift left one place, resulting in:

0 0 0 1 0 1 1 0

which is decimal 2 = 11 × 2. No overflow has occurred, so the value is correct.

EXAMPLE 2.29 Multiply the value 12 (expressed using 8-bit signed two’s complement representation) by 4. We start with the binary value for 12:

0 0 0 0 1 1 0 0

and we shift left two places (each shift multiplies by 2, so two shifts is equivalent to multiplying by 4), resulting in:

0 0 1 1 0 0 0 0

which is decimal 48 = 12 × 4. No overflow has occurred, so the value is correct.

EXAMPLE 2.30 Multiply the value 66 (expressed using 8-bit signed two’s complement representation) by 2. We start with the binary value for 66:

0 1 0 0 0 0 1 0

and we shift left one place, resulting in:

1 0 0 0 0 1 0 0

but the sign bit has changed, so overflow has occurred (66 × 2 = 132, which is too large to be expressed using 8 bits in signed two’s complement notation).

A right arithmetic shift moves all bits to the right, but carries (copies) the sign bit from bit bn–1 to bn–2. Because we copy the sign bit from right to left, overflow is not a problem. However, division by 2 may have a remainder of 1; division using this method is strictly integer division, so the remainder is not stored in any way. Consider the following examples:

EXAMPLE 2.31 Divide the value 12 (expressed using 8-bit signed two’s complement representation) by 2. We start with the binary value for 12:

0 0 0 0 1 1 0 0

and we shift right one place, copying the sign bit of 0, resulting in:

0 0 0 0 0 1 1 0

which is decimal 6 = 12 ÷ 2.

EXAMPLE 2.32 Divide the value 12 (expressed using 8-bit signed two’s complement representation) by 4. We start with the binary value for 12:

0 0 0 0 1 1 0 0

and we shift right two places, resulting in:

0 0 0 0 0 0 1 1

which is decimal 3 = 12 ÷ 4.

EXAMPLE 2.33 Divide the value –14 (expressed using 8-bit signed two’s complement representation) by 2. We start with the two’s complement representation for –14:

1 1 1 1 0 0 1 0

and we shift right one place (carrying across the sign bit), resulting in:

1 1 1 1 1 0 0 1

which is decimal –7 = –14 ÷ 2.

Note that if we had divided –15 by 2 (in Example 2.33), the result would be 11110001 shifted one to the left to yield 11111000, which is –8. Because we are doing integer division, –15 divided by 2 is indeed equal to –8.

2.5 FLOATING-POINT REPRESENTATION

If we wanted to build a real computer, we could use any of the integer representations that we just studied. We would pick one of them and proceed with our design tasks. Our next step would be to decide the word size of our system. If we want our system to be really inexpensive, we would pick a small word size, say, 16 bits. Allowing for the sign bit, the largest integer this system could store is 32,767. So now what do we do to accommodate a potential customer who wants to keep a tally of the number of spectators paying admission to professional sports events in a given year? Certainly, the number is larger than 32,767. No problem. Let’s just make the word size larger. Thirty-two bits ought to do it. Our word is now big enough for just about anything that anyone wants to count. But what if this customer also needs to know the amount of money each spectator spends per minute of playing time? This number is likely to be a decimal fraction. Now we’re really stuck.

The easiest and cheapest approach to this problem is to keep our 16-bit system and say, “Hey, we’re building a cheap system here. If you want to do fancy things with it, get yourself a good programmer.” Although this position sounds outrageously flippant in the context of today’s technology, it was a reality in the earliest days of each generation of computers. There simply was no such thing as a floating-point unit in many of the first mainframes or microcomputers. For many years, clever programming enabled these integer systems to act as if they were, in fact, floating-point systems.

If you are familiar with scientific notation, you may already be thinking of how you could handle floating-point operations—how you could provide floating-point emulation—in an integer system. In scientific notation, numbers are expressed in two parts: a fractional part and an exponential part that indicates the power of ten to which the fractional part should be raised to obtain the value we need. So to express 32,767 in scientific notation, we could write 3.2767 × 104. Scientific notation simplifies pencil-and-paper calculations that involve very large or very small numbers. It is also the basis for floating-point computation in today’s digital computers.

2.5.1 A Simple Model In digital computers, floating-point numbers consist of three parts: a sign bit, an exponent part (representing the exponent on a power of 2), and a fractional part (which has sparked considerable debate regarding appropriate terminology). The term mantissa is widely accepted when referring to this fractional part. However, many people take exception to this term because it also denotes the fractional part of a logarithm, which is not the same as the fractional part of a floating-point number. The Institute of Electrical and Electronics Engineers (IEEE) introduced the term significand to refer to the fractional part of a floating-point number combined with the implied binary point and implied 1 (which we discuss at the end of this section). Regrettably, the two terms mantissa and significand have become interchangeable when referring to the fractional part of a floating-point number, even though they are not technically equivalent. Throughout this text, we refer to the fractional part as the significand, regardless of whether it includes the implied 1 as intended by IEEE.

The number of bits used for the exponent and significand depends on whether we would like to optimize for range (more bits in the exponent) or precision (more bits in the significand). (We discuss range and precision in more detail in Section 2.5.7.) For the remainder of this section, we will use a 14-bit model with a 5-bit exponent, an 8-bit significand, and a sign bit (see Figure 2.1). More general forms are described in Section 2.5.2.

FIGURE 2.1 Simple Model Floating-Point Representation

Let’s say that we wish to store the decimal number 17 in our model. We know that 17 = 17.0 × 100 = 1.7 × 101 = 0.17 × 102. Analogously, in binary, 1710 = 100012 × 20 = 1000.12 × 21 = 100.012 × 22 = 10.0012 × 23 = 1.00012 × 24 = 0.100012 × 25. If we use this last form, our fractional part will be 10001000 and our exponent will be 00101, as shown here:

Using this form, we can store numbers of much greater magnitude than we could using a fixed-point representation of 14 bits (which uses a total of 14 binary digits plus a binary, or radix, point). If we want to represent 65536 = 0.12 × 217 in this model, we have:

One obvious problem with this model is that we haven’t provided for negative exponents. If we wanted to store 0.25, we would have no way of doing so because 0.25 is 2–2 and the exponent –2 cannot be represented. We could fix the problem by adding a sign bit to the exponent, but it turns out that it is more efficient to use a biased exponent, because we can use simpler integer circuits designed specifically for unsigned numbers when comparing the values of two floating-point numbers.

Recall from Section 2.4.3 that the idea behind using a bias value is to convert every integer in the range into a nonnegative integer, which is then stored as a binary numeral. The integers in the desired range of exponents are first adjusted by adding this fixed bias value to each exponent. The bias value is a number near the middle of the range of possible values that we select to represent zero. In this case, we would select 15 because it is midway between 0 and 31 (our exponent has 5 bits, thus allowing for 25 or 32 values). Any number larger than 15 in the exponent field represents a positive value. Values less than 15 indicate negative values. This is called an excess-15 representation because we have to subtract 15 to get the true value of the exponent. Note that exponents of all zeros or all ones are typically reserved for special numbers (such as zero or infinity). In our simple model, we allow exponents of all zeros and ones.

Returning to our example of storing 17, we calculated 1710 = 0.100012 × 25. The biased exponent is now 15 + 5 = 20:

If we wanted to store 0.25 = 0.1 × 2–1, we would have:

There is still one rather large problem with this system: We do not have a unique representation for each number. All of the following are equivalent:

Because synonymous forms such as these are not well-suited for digital computers, floating-point numbers must be normalized—that is, the leftmost bit of the significand must always be 1. This process is called normalization. This convention has the additional advantage that if the 1 is implied, we effectively gain an extra bit of precision in the significand. Normalization works well for every value except zero, which contains no nonzero bits. For that reason, any model used to represent floating-point numbers must treat zero as a special case. We will see in the next section that the IEEE-754 floating-point standard makes an exception to the rule of normalization.

EXAMPLE 2.34 Express 0.0312510 in normalized floating-point form using the simple model with excess-15 bias.

0.0312510 = 0.000012 × 20 = 0.0001 × 2–1 = 0.001 × 2–2 = 0.01 × 2–3 = 0.1 × 2–4. Applying the bias, the exponent field is 15 – 4 = 11.

Note that in our simple model we have not expressed the number using the normalization notation that implies the 1, which is introduced in Section 2.5.4.

2.5.2 Floating-Point Arithmetic If we wanted to add two decimal numbers that are expressed in scientific notation, such as 1.5 × 102 + 3.5 × 103, we would change one of the numbers so that both of them are expressed in the same power of the base. In our example, 1.5 × 102 + 3.5 × 103 = 0.15 × 103 + 3.5 × 103 = 3.65 × 103. Floating-point addition and subtraction work the same way, as illustrated below.

EXAMPLE 2.35 Add the following binary numbers as represented in a normalized 14-bit format, using the simple model with a bias of 15.

We see that the addend is raised to the second power and that the augend is to the zero power. Alignment of these two operands on the binary point gives us:

Renormalizing, we retain the larger exponent and truncate the low-order bit. Thus, we have:

However, because our simple model requires a normalized significand, we have no way to represent zero. This is easily remedied by allowing the string of all zeros (a zero sign, a zero exponent, and a zero significand) to represent the value zero. In the next section, we will see that IEEE-754 also reserves special meaning for certain bit patterns.

Multiplication and division are carried out using the same rules of exponents applied to decimal arithmetic, such as 2 –3 × 24 = 21, for example.

EXAMPLE 2.36 Assuming a 15-bit bias, multiply:

Multiplication of 0.11001000 by 0.10011010 yields a product of 0.0111100001010000, and then multiplying by 23 × 21 = 24 yields 111.10000101. Renormalizing and supplying the appropriate exponent, the floating-point product is:

2.5.3 Floating-Point Errors When we use pencil and paper to solve a trigonometry problem or compute the interest on an investment, we intuitively understand that we are working in the system of real numbers. We know that this system is infinite, because given any pair of real numbers, we can always find another real number that is smaller than one and greater than the other.

Unlike the mathematics in our imaginations, computers are finite systems, with finite storage. When we call upon our computers to carry out floating-point calculations, we are modeling the infinite system of real numbers in a finite system of integers. What we have, in truth, is an approximation of the real number system. The more bits we use, the better the approximation. However, there is always some element of error, no matter how many bits we use.

Floating-point errors can be blatant, subtle, or unnoticed. The blatant errors, such as numeric overflow or underflow, are the ones that cause programs to crash. Subtle errors can lead to wildly erroneous results that are often hard to detect before they cause real problems. For example, in our simple model, we can express normalized numbers in the range of –.111111112 × 216 through +.11111111 × 216. Obviously, we cannot store 2–19 or 2128; they simply don’t fit. It is not quite so obvious that we cannot accurately store 128.5, which is well within our range. Converting 128.5 to binary, we have 10000000.1, which is 9 bits wide. Our significand can hold only eight. Typically, the low-order bit is dropped or rounded into the next bit. No matter how we handle it, however, we have introduced an error into our system.

We can compute the relative error in our representation by taking the ratio of the absolute value of the error to the true value of the number. Using our example of 128.5, we find:

If we are not careful, such errors can propagate through a lengthy calculation, causing substantial loss of precision. Table 2.3 illustrates the error propagation as we iteratively multiply 16.24 by 0.91 using our 14-bit simple model. Upon converting these numbers to 8-bit binary, we see that we have a substantial error from the outset.

As you can see, in six iterations, we have more than tripled the error in the product. Continued iterations will produce an error of 100% because the product eventually goes to zero. Although this 14-bit model is so small that it exaggerates the error, all floating-point systems behave the same way. There is always some degree of error involved when representing real numbers in a finite system, no matter how large we make that system. Even the smallest error can have catastrophic results, particularly when computers are used to control physical events such as in military and medical applications. The challenge to computer scientists is to find efficient algorithms for controlling such errors within the bounds of performance and economics.

TABLE 2.3 Error Propagation in a 14-Bit Floating-Point Number

2.5.4 The IEEE-754 Floating-Point Standard The floating-point model we have been using in this section is designed for simplicity and conceptual understanding. We could extend this model to include whatever number of bits we wanted. Until the 1980s, these kinds of decisions were purely arbitrary, resulting in numerous incompatible representations across various manufacturers’ systems. In 1985, the IEEE published a floating-point standard for both single- and double-precision floating-point numbers. This standard is officially known as IEEE-754 (1985) and includes two formats: single precision and double precision. The IEEE-754 standard not only defines binary floating-point representations, but also specifies basic operations, exception conditions, conversions, and arithmetic. Another standard, IEEE 854-1987, provides similar specifications for decimal arithmetic. In 2008, IEEE revised the 754 standard, and it became known as IEEE 754- 2008. It carried over the single and double precision from 754, and added support for decimal arithmetic and formats, superseding both 754 and 854. We discuss only the single and double representation for floating-point numbers.

The IEEE-754 single-precision standard uses an excess 127 bias over an 8-bit exponent. The significand assumes an implied 1 to the left of the radix point and is 23 bits. This implied 1 is referred to as the hidden bit or hidden 1 and allows an actual significand of 23 + 1 = 24 bits. With the sign bit included, the total word size is 32 bits, as shown in Figure 2.2.

FIGURE 2.2 IEEE-754 Single-Precision Floating-Point Representation

We mentioned earlier that IEEE-754 makes an exception to the rule of normalization. Because this standard assumes an implied 1 to the left of the radix point, the leading bit in the significand can indeed be zero. For example, the number 5.5 = 101.1 = .1011 × 23. IEEE-754 assumes an implied 1 to the left of the radix point and thus represents 5.5 as 1.011 × 22. Because the 1 is implied, the significand is 011 and does not begin with a 1.

Table 2.4 shows the single-precision representation of several floating-point numbers, including some special ones. One should note that zero is not directly representable in the given format, because of a required hidden bit in the significand. Therefore, zero is a special value denoted using an exponent of all zeros and a significand of all zeros. IEEE-754 does allow for both –0 and +0, although they are equal values. For this reason, programmers should use caution when comparing a floating-point value to zero.

Floating-Point Number Single-Precision Representation

1.0 0 01111111 00000000000000000000000

0.5 0 01111110 00000000000000000000000

19.5 0 10000011 00111000000000000000000

–3.75 1 10000000 11100000000000000000000

Zero 0 00000000 00000000000000000000000

± Infinity 0/1 11111111 00000000000000000000000

NaN 0/1 11111111 any nonzero significand

Denormalized Number 0/1 00000000 any nonzero significand

TABLE 2.4 Some Example IEEE-754 Single-Precision Floating-Point Numbers

When the exponent is 255, the quantity represented is ± infinity (which has a zero significand) or “not a number” (which has a nonzero significand). “Not a number,” or NaN, is used to represent a value that is not a real number (such as the square root of a negative number) or as an error indicator (such as in a “division by zero” error).

Under the IEEE-754 standard, most numeric values are normalized and have an implicit leading 1 in their significands (that is assumed to be to the left of the radix point). Another important convention is when the exponent

is all zeros but the significand is nonzero. This represents a denormalized number in which there is no hidden bit assumed.

FIGURE 2.3 Range of IEEE-754 Double-Precision Numbers

The largest magnitude value we can represent (forget the sign for the time being) with the single-precision floating-point format is 2127 × 1.1111111111 11111111111112 (let’s call this value MAX). We can’t use an exponent of all ones because that is reserved for NaN. The smallest magnitude number we can represent is 2–127

× .000000000000000000000012 (let’s call this value MIN). We can use an exponent of all zeros (which means the number is denormalized) because the significand is nonzero (and represents 2–23). Due to the preceding special values and the limited number of bits, there are four numerical ranges that single-precision floating-point numbers cannot represent: negative numbers less than –MAX (negative overflow); negative numbers greater than –MIN (negative underflow); positive numbers less than +MIN (positive underflow); and positive numbers greater than +MAX (positive overflow).

Double-precision numbers use a signed 64-bit word consisting of an 11-bit exponent and a 52-bit significand. The bias is 1023. The range of numbers that can be represented in the IEEE double-precision model is shown in Figure 2.3. NaN is indicated when the exponent is 2047. Representations for zero and infinity correspond to the single-precision model.

At a slight cost in performance, most FPUs use only the 64-bit model so that only one set of specialized circuits needs to be designed and implemented.

Virtually every recently designed computer system has adopted the IEEE-754 floating-point model. Unfortunately, by the time this standard came along, many mainframe computer systems had established their own floating-point systems. Changing to the newer system has taken decades for well-established architectures such as IBM mainframes, which now support both their traditional floating-point system and IEEE-754. Before 1998, however, IBM systems had been using the same architecture for floating-point arithmetic that the original System/360 used in 1964. One would expect that both systems will continue to be supported, owing to the substantial amount of older software that is running on these systems.

2.5.5 Range, Precision, and Accuracy When discussing floating-point numbers it is important to understand the terms range, precision, and accuracy. Range is very straightforward, because it represents the interval from the smallest value in a given format to the largest value in that same format. For example, the range of 16-bit two’s complement integers is –32768 to +32767. The range of IEEE-754 double-precision floating-point numbers is given in Figure 2.3. Even with this large range, we know there are infinitely many numbers that do not exist within the range specified by IEEE-754. The reason floating-point numbers work at all is that there will always be a number in this range that is close to the number you want.

People have no problem understanding range, but accuracy and precision are often confused with each other. Accuracy refers to how close a number is to its true value; for example, we can’t represent 0.1 in floating point, but we can find a number in the range that is relatively close, or reasonably accurate, to 0.1. Precision, on the other hand, deals with how much information we have about a value and the amount of information used to represent the value. 1.666 is a number with four decimal digits of precision; 1.6660 is the same exact number with five decimal digits of precision. The second number is not more accurate than the first.

Accuracy must be put into context—to know how accurate a value is, one must know how close it is to its intended target or “true value.” We can’t look at two numbers and immediately declare that the first is more accurate than the second simply because the first has more digits of precision.

Although they are separate, accuracy and precision are related. Higher precision often allows a value to be more accurate, but that is not always the case. For example, we can represent the value 1 as an integer, a single-precision floating point, or a double-precision floating point, but each is equally (exactly) accurate. As another example,

consider 3.13333 as an estimate for pi. It has 6 digits of precision, yet is accurate to only two digits. Adding more precision will do nothing to increase the accuracy.

On the other hand, when multiplying 0.4 × 0.3, our accuracy depends on our precision. If we allow only one decimal place for precision, our result is 0.1 (which is close to, but not exactly, the product). If we allow two decimal places of precision, we get 0.12, which accurately reflects the answer.

2.5.6 Additional Problems with Floating-Point Numbers We have seen that floating-point numbers can overflow and underflow. In addition, we know that a floating-point number may not exactly represent the value we wish, as is the case with the rounding error that occurs with the binary floating-point representation for the decimal number 0.1. As we have seen, these rounding errors can propagate, resulting in substantial problems.

Although rounding is undesirable, it is understandable. In addition to this rounding problem, however, floating- point arithmetic differs from real number arithmetic in two relatively disturbing, and not necessarily intuitive, ways. First, floating-point arithmetic is not always associative. This means that for three floating-point numbers a, b, and c,

(a + b) + c ≠ a + (b + c)

The same holds true for associativity under multiplication. Although in many cases the left-hand side will equal the right-hand side, there is no guarantee. Floating-point arithmetic is also not distributive:

a × (b) + c) ≠ ab + ac

Although results can vary depending on compiler (we used Gnu C), declaring the doubles a = 0.1, b = 0.2, and c = 0.3 illustrates the above inequalities nicely. We encourage you to find three additional floating-point numbers to illustrate that floating-point arithmetic is neither associative nor distributive.

What does this all mean to you as a programmer? Programmers should use extra care when using the equality operator on floating-point numbers. This implies that they should be avoided in controlling looping structures such as do…while and for loops. It is good practice to declare a “nearness to x” epsilon (e.g., epsilon = 1.0 × 10–20) and then test an absolute value. For example, instead of using:

if x = 2 then…

it is better to use:

Floating-Point Ops or Oops? In this chapter, we have introduced floating-point numbers and the means by which computers represent them. We have touched upon floating-point rounding errors (studies in numerical analysis will provide further depth on this topic) and the fact that floating-point numbers don’t obey the standard associative and distributive laws. But just how serious are these issues? To answer this question, we introduce three major floating-point blunders.

In 1994, when Intel introduced the Pentium microprocessor, number crunchers around the world noticed something weird was happening. Calculations involving double-precision divisions and certain bit patterns were producing incorrect results. Although the flawed chip was slightly inaccurate for some pairs of numbers, other instances were more extreme. For example, if x = 4,195,835 and y = 3,145,727, finding z = x – (x/y) × y should produce a z of 0. The Intel 286, 386, and 486 chips gave exactly that result. Even taking into account the possibility of floating-point round-off error, the value of z should have been about 9.3 × 10–10. But on the new Pentium, z was equal to 256!

Once Intel was informed of the problem, research and testing revealed the flaw to be an omission in the chip’s design. The Pentium was using the radix-4 SRT algorithm for speedy division, which necessitated a 1066-element table. Once implemented in silicon, 5 of those table entries were 0 and should have been +2.

Although the Pentium bug was a public relations debacle for Intel, it was not a catastrophe for those using the chip. In fact, it was a minor thing compared to the programming mistakes with floating-point numbers that have resulted in disasters in areas from off-shore oil drilling, to stock markets, to missile defense. The list of actual disasters that resulted from floating-point errors is very long. The following two instances are among the worst of them.

During the Persian Gulf War of 1991, the United States relied on Patriot missiles to track and intercept cruise missiles and Scud missiles. One of these missiles failed to track an incoming Scud missile, allowing the Scud to hit an American army barracks, killing 28 people and injuring many more. After an investigation, it was determined that the failure of the Patriot missile was due to using too little precision to allow the missile to accurately determine the incoming Scud velocity.

The Patriot missile uses radar to determine the location of an object. If the internal weapons control computer identifies the object as something that should be intercepted, calculations are performed to predict the air space in which the object should be located at a specific time. This prediction is based on the object’s known velocity and time of last detection.

The problem was in the clock, which measured time in tenths of seconds. But the time since boot was stored as an integer number of seconds (determined by multiplying the elapsed time by 1/10). For predicting where an object would be at a specific time, the time and velocity needed to be real numbers. It was no problem to convert the integer to a real number; however, using 24-bit registers for its calculations, the Patriot was limited in the precision of this operation. The potential problem is easily seen when one realizes 1/10 in binary is:

0.0001100110011001100110011001100 …

When the elapsed time was small, this “chopping error” was insignificant and caused no problems. The Patriot was designed to be on for only a few minutes at a time, so this limit of 24-bit precision would be of no consequence. The problem was that during the Gulf War, the missiles were on for days. The longer a missile was on, the larger the error became, and the more probable that the inaccuracy of the prediction calculation would cause an unsuccessful interception. And this is precisely what happened on February 25, 1991, when a failed interception resulted in 28 people killed—a failed interception caused by loss of precision (required for accuracy) in floating-point numbers. It is estimated that the Patriot missile had been operational about 100 hours, introducing a rounding error in the time conversion of about 0.34 seconds, which translates to approximately half a kilometer of travel for a Scud missile.

Designers were aware of the conversion problem well before the incident occurred. However, deploying new software under wartime conditions is anything but trivial. Although the new software would have fixed the bug, field personnel could have simply rebooted the systems at specific intervals to keep the clock value small enough so that 24-bit precision would have been sufficient.

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Financial Solutions Provider
Supreme Essay Writer
WRITING LAND
Assignment Helper
Quick Mentor
Finance Homework Help
Writer Writer Name Offer Chat
Financial Solutions Provider

ONLINE

Financial Solutions Provider

As per my knowledge I can assist you in writing a perfect Planning, Marketing Research, Business Pitches, Business Proposals, Business Feasibility Reports and Content within your given deadline and budget.

$34 Chat With Writer
Supreme Essay Writer

ONLINE

Supreme Essay Writer

As per my knowledge I can assist you in writing a perfect Planning, Marketing Research, Business Pitches, Business Proposals, Business Feasibility Reports and Content within your given deadline and budget.

$38 Chat With Writer
WRITING LAND

ONLINE

WRITING LAND

I have done dissertations, thesis, reports related to these topics, and I cover all the CHAPTERS accordingly and provide proper updates on the project.

$33 Chat With Writer
Assignment Helper

ONLINE

Assignment Helper

I find your project quite stimulating and related to my profession. I can surely contribute you with your project.

$22 Chat With Writer
Quick Mentor

ONLINE

Quick Mentor

I have assisted scholars, business persons, startups, entrepreneurs, marketers, managers etc in their, pitches, presentations, market research, business plans etc.

$44 Chat With Writer
Finance Homework Help

ONLINE

Finance Homework Help

This project is my strength and I can fulfill your requirements properly within your given deadline. I always give plagiarism-free work to my clients at very competitive prices.

$23 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Legal underpinnings of business law - Hollybush primary school derry - Igcse biology scheme of work - Leadership communication case study - Numerical methods using matlab pdf - What is the secret word of a master mason - English writing homework - Seismograph lab answer key - Clancy of the overflow poem pdf - Intro to Philosophy Journal 2 - Peter seligmann net worth - Assignment and discussion - Annotated Bibliography & NRP - Classification of birds ppt - Intrinsic and extrinsic motivation questionnaire for students - 170 ft lbs to nm - Jean kilbourne essay - Suzanne sataline new york times - Mohegan tribe kevin brown red eagle - Settlers hut walk namadgi - Http learn genetics utah edu content chromosomes karyotype - Surface heat flux equation - Lights and sounds intramuros - Rkfree setup 226 password 123 - How many bags of gp cement per m3 - 330 km to mph - Milo sensor wristband - Autobiography of a brown buffalo sparknotes - Actuarial mathematics for life contingent risks 2nd edition pdf - What should gore do to build effective global teams - Asic application specific integrated circuit - Voting steps - Bachelor of biomedical science uq - Lake pontchartrain causeway construction - Marlin davies buys a truck for $28 000 - General Chemistry II - Bode plot of integrator - Possible conflict management and negotiation techniques - Week 8 - IT project needs - Da pam 600 25 - Why is it best to have six or less life-cycle phases in an epm system - 24 hrs from now - Determination of cu2+ by titration - Products to sums formula - 300 Word Discussion - 2 replies - Wolverton sexual health clinic - Leece neville alternator troubleshooting - INTERNATIONL MOLANA{{+91-9829866507}} Love Vashikaran Specialist Molvi Ji - Study link 5.7 decimal comparisons - Advanced practice nursing essentials for role development - Siobhan lyons production manager - Ing increase credit card limit - Test case design techniques guru99 - Ib mathematics data booklet - Discussion Help - Lem hlsr 10 p - Comprehensive problem 3 kornett company balance sheet - Cisco 2821 end of life - Papers kim.....2 pages due by 24 hours - Samsung digitall everyone's invited dvd - With what does the major moral theory known as virtue ethics primarily concern itself? - Macbeth act 4 questions - Discussion 2b - How to draw ac equivalent circuit - Greendale stadium case study answers - A sports car moving at constant speed travels - 5 guidelines for developmentally appropriate practice - New shahadah classes in philadelphia - Soc 515- topic 5paper - Civic engagement interview questions - Agassi company uses a job order cost system - Packet tracer projects pdf - Bsbwor203 work effectively with others pdf - Convert 0.625 into a fraction - Past participle of falloir - E pent 2 ene - Www onlinecomponents com review - An automobile manufacturer is conducting a product recall - Communication and human behavior 6th edition pdf - Successful Domestic Company Goes Global! - 134 pringles road martinsville - Math discussion respond 175 words - Financial management week 1 assignment - Albert einstein letter to phyllis - Chapter5 - Student representative council speech ideas - Transition metals form coloured compounds - Informative speech social anxiety disorder - April showers edith wharton quizlet - Vce biology unit 2 practice exams - Art vocabulary ielts pdf - Sudoku game java code - Australian poems about migration - What rights were afforded probationers after the gagnon case - Athenaze 3rd edition pdf - Articulate the PR-Asian Ideologies (Plagiarism Checked) Submit Assignment - Statistics Questions - BU204 Discussion Unit 4