# The Function Changelocation is Also Called a _____

## Organization of Figurer Systems: § 5: Pipelining

#### Instructor: Thousand.S. Schmalz

This department is organized as follows:

5.one. Overview of Pipelining

five.two. Pipeline Datapath Design and Implementation

5.3. Pipeline Control and Hazards

5.4. Pipeline Operation Assay

Data contained herein was compiled from a variety of text- and Spider web-based sources, is intended as a teaching aid merely (to exist used in conjunction with the required text, and is not to be used for whatever commercial purpose. Particular thanks is given to Dr. Enrique Mafla for his permission to use selected illustrations from his grade notes in these Web pages.

### 5.1. Overview of Pipelining

Think that, in Section iv, we designed a multicycle datapath based on (a) building blocks such every bit multiplexers for selecting an operation to produce ALU output, (b) ALUs to compute addresses and arithmetic or logical operations, and (c) components such as memories or registers for long- or short-term storage of operands or results. We also showed that the multicycle datapath is, in practise, more than efficient than the single-cycle datapath.

In this section, we continue our quest for efficient computation by discovering that we can overlay single-wheel datapaths in time to produce a type of computational architecture called
pipelining. We bear witness that pipelined architectures, when they work properly and are relatively gratuitous from errors and hazards such as dependencies, stalls, or exceptions tin outperform a simple multicycle datapath. Also, we discuss problems associated with pipelining that limits its usefulness in various types of computations.

#### 5.1.1. Concept of Pipelining

Suppose you wanted to make an automobile from scratch. You might gather up the raw materials, class the metal into recognizeable shapes, cast some of the metal into an engine block, connect up fuel lines, wires, etc., to somewhen (one would hope) brand a workable motorcar. To do this, you would need many skills – all the skills of the artisans that make autos, and direction skills in addition to being an electrician and a metallurgist. This would not be an efficient fashion to make a car, merely would definitely provide many challenges.

That is the way a multicycle datapath works – it is designed to practice everything – input, output, and ciphering (recollect the fetch-decode-execute sequence). We need to ask ourselves if this is actually the best way to compute efficiently, particularly when we consider the complication of control for large (CISC) systems or even smaller RISC processors.

Fortunately, our analogy with car-making is not then far-fetched, and can actually aid us arrive at a more than efficient processor pattern. Consider the modern way of making cars – on an
assembly line. Here, there is an orderly flow of parts down a conveyor belt, and the parts are processed by different stations (also called
segments
of the assembly line). Each segment does one thing, over and over. The segments are coordinated to exploit the
sequentiality
inherent in the automobile assembly procedure. The work gets washed more smoothly (because of the orderly menses of input parts and output results), more efficiently (because each assembler at each segment of the pipeline does his or her task at what one hopes is maximum efficiency), and more reliably considering there is greater consistency in i task being done repetitively (provided the assembly line is designed correctly).

A similar analogy exists for computers. Instead of a multicycle datapath with its complex control system that walks, talks, cries, and computes – let u.s.a. suppose that nosotros could build an
assembly line for computing. Such objects actually be, and they are called
pipeline processors. They have sequentially-arranged stages or segments, each of which perform a specific task in a stock-still amount of time. Information flows through these pipelines similar cars through an associates line.

#### 5.ane.2. Definitions and Practical Observations virtually Pipelines

We next consider several terms and some applied bug associated with pipeline processors.

5.1.2.one. Definition.
A
pipeline processor
is comprised of a sequential, linear list of
segments, where each segment performs one computational task or group of tasks.

5.1.2.2. Observation.
A pipeline processor tin exist represented in ii dimensions, as shown in Effigy 5.1. Here, the pipeline segments (Seg #1 through Seg #3) are bundled vertically, and so the information tin flow from the input at the tiptop left downwards to the output of the pipeline (afterwards Segment 3). The progress of an instruction is charted in blueish typeface, and the side by side didactics is shown in red typeface.

There are three things that one must observe nigh the pipeline. First, the work (in a figurer, the ISA) is divided up into pieces that more or less fit into the segments alloted for them. 2nd, this implies that in guild for the pipeline to work efficiently and smoothly, the piece of work partitions must each take well-nigh the aforementioned fourth dimension to consummate. Otherwise, the longest division requiring time T would hold up the pipeline, and every segment would have to take time T to consummate its work. For fast segments, this would mean much idle time. Third, in order for the pipeline to work smoothly, there must be few (if whatever) exceptions or hazards that cause errors or delays within the pipeline. Otherwise, the instruction will have to be reloaded and the pipeline restarted with the same instruction that causes the exception. There are additional bug we demand to discuss near pipeline processors, which we will consider shortly.

Figure five.1.
Notional diagram of a pipeline processor. The segments are bundled vertically, and time moves forth the horizontal axis.

v.1.2.iii. Reponse Time.
It is easily verified, through inspection of Figure five.1., that the response fourth dimension for any instruction that takes three segments must be three times the response time for whatever segment,
provided that the pipeline was full when the didactics was loaded into the pipeline. As we shall meet afterward in this section, if an N-segment pipeline is empty before an instruction starts, and then N + (Northward-one) cycles or segments of the pipeline are required to execute the instruction, because it takes Due north cycles to make full the piping.

Note that we just used the term “bike” and “segment” synonomously. In the type of pipelines that we volition study in this course (which includes the vast majority of pipeline processors), each segment takes one wheel to complete its piece of work. Thus, an North-segment pipeline takes a minimum fourth dimension of N cycles to execute one teaching. This brings to listen the performance bug discussed in Department 5.one.1.5.

v.1.ii.4. Piece of work Partitioning.
In the previous section, we designed a multicycle datapath based on the assumption that computational work associated with the execution of an instruction could be partitioned into a five-pace process, equally follows:

five.ane.2.5. Performance.
Because there are N segments agile in the pipeline at any 1 time (when it is full), it is thus possible to execute Due north segments concurrently in any cycle of the pipeline. In dissimilarity, a purely sequential implementation of the fetch-decode-execute wheel would crave N cycles for the longest education. Thus, information technology can be said that we have
O(Due north) speedup. Every bit we shall see when nosotros clarify pipeline performance, an exact N-fold speedup does not always occur in exercise. However it is sufficient to say that the speedup is of society N.

### five.2. Pipeline Datapath Design and Implementation

Equally shown in Section 5.ane.2.4, the piece of work involved in an instruction tin can be partitioned into steps labelled IF (Educational activity Fetch), ID (Instruction Decode and data fetch), EX (ALU operations or R-format execution), MEM (Retentiveness operations), and WB (Write-Back to annals file). We next hash out how this sequence of steps can be implemented in terms of MIPS instructions.

#### 5.2.one. MIPS Instructions and Pipelining

In order to implement MIPS instructions effectively on a pipeline processor, nosotros must ensure that the instructions are the same length (simplicity favors regularity) for easy IF and ID, similar to the multicycle datapath. We also demand to have few simply consistent instruction formats, to avert deciphering variable formats during IF and ID, which would prohibitively increase pipeline segment complication for those tasks. Thus, the register indices should be in the same identify in each pedagogy. In exercise, this means that the
rd,
rs, and
rt
fields of the MIPS teaching must not change location beyond all MIPS pipeline instructions.

Additionally, we want to have education decoding and reading of the register contents occur at the same time, which is supported by the datapath architecture that we accept designed thus far. Discover that we have retentivity address computation in the
`lw`
and
`sw`
instructions only, and that these are the simply instructions in our five-educational activity MIPS subset that perform retention operations. As before, we assume that operands are aligned in memory, for straightforward admission.

#### v.two.2. Datapath Partition for Pipelining

Recall the single-wheel datapath, which tin can be partitioned (subdivided) into functional units equally shown in Effigy 5.2. Because the single-cycle datapath contains carve up Instruction Memory and Information Retention units, this allows us to directly implement in hardware the IF-ID-EX-MEM-WB representation of the MIPS instruction sequence. Find that several command lines take been added, for example, to route data from the ALU output (or retention output) to the annals file for writing. Likewise, there are once more three ALUs, one for ALUop, another for JTA computation, and a 3rd for adding PC+4 to compute the accost of the next instruction.

Figure 5.2.
Division of the MIPS single-wheel datapath developed previously, to form a pipeline processor. The segments are bundled horizontally, and data flows from left to right [Maf01,MK98].

We can represent this pipeline structure using a infinite-time diagram like to Effigy 5.1, as shown in Effigy 5.3. Here four load instructions are executed sequentially, which are chosen because the
`lw`
instruction is the just i in our MIPS subset that consistently utilizes all v pipeline segments. Find also that the right half of the register file is shaded to represent a read operation, while the left half is shaded to represent write.

Figure v.3.
Partitioning of the MIPS single-cycle datapath developed previously, with replication in infinite, to form a pipeline processor that computes four
`lw`
instructions. The segments are arranged horizontally, and information flows from left to right, synchronously with the clock cycles (CC1 through CC7) [Maf01,MK98].

In club to ensure that the single-bike datapath conforms to the pipeline blueprint constraint of one cycle per segment, we need to add together buffers and control betwixt stages, similar to the way we added buffers in the multicycle datapath. These buffers and control circuitry are shown in Figure 5.iv as red rectangles, and store the results of the i-th phase and so that the (i+1)-th stage can use these results in the next clock cycle.

Effigy five.four.
Improver of control and buffer circuits to Figure 5.3 produces the MIPS pipelined datapath [Maf01,MK98].

In summary, pipelining improves efficiency past first regularizing the instruction format, for simplicity. We then split up the instructions into a fixed number of steps, and implement each pace as a pipeline segment. During the pipeline design phase, we ensure that each segment takes almost the same amount of fourth dimension to execute equally other segments in the pipeline. Likewise, we desire to continue the pipeline full wherever possible, in order to maximize utilization and throughput, while minimizing set up-up fourth dimension.

Popular:   Can Picture the Teacher in Their Mind

In the next department, we volition run into that pipeline processing has some hard problems, which are called
hazards, and the pipeline is besides susceptible to exceptions.

### 5.3. Pipeline Control and Hazards

The control of pipeline processors has like issues to the control of multicycle datapaths. Pipelining leaves the meaning of the ix command lines unchanged, that is, those lines which controlled the multicycle datapath. In pipelining, we set up command lines (to defined values) in each stage for each pedagogy. This is done in hardware by extending pipeline registers to include control information and circuitry.

#### 5.three.ane. Pipeline Command Bug and Hardware

Detect that there is nothing to command during education fetch and decode (IF and ID). Thus, nosotros tin begin our control activities (initialization of control signals) during ID, since control will just be exerted during EX, MEM, and WB stages of the pipeline. Recalling that the various stages of control and buffer circuitry between the pipeline stages are labelled IF/ID, ID/EX, EX/MEM, and MEM/WB, we have the propagation of command shown in Figure 5.five.

Figure v.5.
Propagation of control through the EX, MEM, and WB states of the MIPS pipelined datapath [Maf01,MK98].

Here, the following stages perform work equally specified:

• IF/ID: Initializes control by passing the
rs,
rd, and
rt
fields of the education, together with the opcode and
funct
fields, to the control circuitry.

• ID/EX: Buffers control for the EX, MEM, and WB stages, while executing control for the EX stage. Control decides what operands volition be input to the ALU, what ALU performance will be performed, and whether or not a co-operative is to exist taken based on the ALU Zero output.

• EX/MEM: Buffers control for the MEM and WB stages, while executing control for the MEM stage. The command lines are set for memory read or write, too equally for data option for retention write. This stage of control also contains the co-operative control logic.

• MEM/WB: Buffers and executes control for the WB phase, and selects the value to be written into the register file.

Effigy 5.half dozen shows how the control lines (red) are arranged on a per-stage basis, and how the stage-specific command signals are buffered and passed along to the adjacent applicable stage.

Figure 5.six.
Propagation of control through the EX, MEM, and WB states of the MIPS pipelined datapath [Maf01,MK98].

Reading Assigment: Study the propagation of control signals for the example programme given on p. 471 of the textbook, which is illustrated stepwise on pp. 472-476 of the textbook.

#### five.three.2. Overview of Hazards

Pipeline processors have several issues associated with controlling smooth, efficient execution of instructions on the pipeline. These bug are generally called
hazards, and include the post-obit iii types:

• Structural Hazards
occur when dissimilar instructions collide while trying to admission the same piece of hardware in the same segment of a pipeline. This type of hazard tin be alleviated by having redundant hardware for the segments wherein the standoff occurs. Occasionally, information technology is possible to insert stalls or reorder instructions to omit this type of gamble.

• Information Hazards
occur when an instruction depends on the result of a previous instruction still in the pipeline, which upshot has non yet been computed. The simplest remedy inserts stalls in the execution sequence, which reduces the pipeline’southward efficiency. The solution to data dependencies is twofold. Showtime, i tin can
the ALU outcome to the writeback or data fetch stages. 2nd, in selected instances, information technology is possible to restructure the code to eliminate some data dependencies. Forwarding paths are shown as sparse blue or red lines in Figure 5.4.

• Control Hazards
tin can result from branch instructions. Here, the branch target address might not be gear up in time for the co-operative to be taken, which results in
stalls
(dead segments) in the pipeline that have to be inserted as local wait events, until processing can resume after the branch target is executed. Control hazards can exist mitigated through accurate branch prediction (which is difficult), and past
delayed co-operative
strategies.

Nosotros next examine hazards in detail, and talk over several techniques for eliminating or relieving hazards.

#### v.3.3. Data Hazards

Definition.
A
information hazard
occurs when the current instruction requires the result of a preceding didactics, only there are insufficient segments in the pipeline to compute the issue and write it back to the register file in time for the electric current instruction to read that result from the register file.

We typically remedy this trouble in one of iii ways:

• Forwarding: In order to resolve a dependency, i adds special circuitry to the pipeline that is comprised of wires and switches with which one forwards or transmits the desired value to the pipeline segment that needs that value for computation. Although this adds hardware and control circuitry, the method works because it takes far less time for the required value(s) to travel through a wire than it does for a pipeline segment to compute its effect.

• Code Re-Ordering: Here, the compiler reorders statements in the source lawmaking, or the assembler reorders object lawmaking, to place 1 or more statements between the current instruction and the instruction in which the required operand was computed as a result. This requires an “intelligent” compiler or assembler, which must have detailed information about the construction and timing of the pipeline on which the data hazard would occur. We phone call this blazon of software a
hardware-dependent compiler.

• Stall Insertion: It is possible to insert one or more stalls (no-op instructions) into the pipeline, which delays the execution of the current instruction until the required operand is written to the register file. This decreases pipeline efficiency and throughput, which is reverse to the goals of pipeline processor design. Stalls are an expedient method of last resort that can be used when compiler action or forwarding fails or might not be supported in hardware or software design.

The post-obit case is illustrative.

Example.
Suppose we have the following sequence of instructions:

```              sub   \$2,   \$1,   \$3   # Register ii is the output of sub      and   \$8,   \$2,   \$5   # Operand #one depends on Register ii information      or    \$9,   \$6,   \$2   # Operand #2 depends on Annals 2 data      add   \$seven,   \$2,   \$2   # Add together consequence depends on Register 2 information      sw    \$6,20(\$2)        # Shop (memory write) depends on Register two
```

whose pipeline scheduling diagram is shown in Figure 5.vii.

Figure 5.7.
Example of data hazards in a sequence of MIPS instructions, where the red (blue) arrows indicate dependencies that are problematic (not problematic) [Pat98,MK98].

Trouble: The outset instruction (`sub`), starting on clock bicycle one (CC1) completes on CC5, when the consequence in Register 2 is written to the register file. If nosotros did nothing to resolve data dependencies, then no pedagogy that read Register two from the register file could read the “new” value computed by the sub instruction until CC5. The dependencies in the other instructions are illustrated by solid lines with arrowheads. If register read and write cannot occur within the same clock cycle (we volition run across how this could happen in Section five.3.4), then simply the fifth instruction (sw) can admission the contents of annals 2 in the mode indicated by the catamenia of sequential execution in the MIPS lawmaking fragment shown previously.

Solution #ane – Forwarding: The consequence generated past the
`sub`
pedagogy tin be
forwarded
to the other stages of the pipeline using special control circuitry (data omnibus switchable to whatsoever other segment, which can exist implemented via a decoder or batten switch). This is indicated notionally in Figure 5.7 past solid cerise lines with arrowheads. If the register file can read in the outset half of a bike and write in the second half of a cycle, then the forwarding in CC5 is non problematic. Otherwise, nosotros would have to delay the execution of the
`add`
teaching past one clock cycle (run across Effigy v.ix for insertion of a
stall).

Solution #2 – Code Re-Ordering: Since all Instructions two through 5 in the MIPS lawmaking fragment crave Register 2 as an operand, we do not have instructions
in that item lawmaking fragment
to put between Teaching 1 and Instruction 2. However, permit usa assume that we have other instructions that (a) do not depend on the results of Instructions one-5, and (b) themselves induce no dependencies in Instructions 1-5 (e.thousand., by writing to register 1, 2, iii, 5, or six). In that case, we could insert two instructions between Instructions ane and ii, if register read and write could occur meantime. Otherwise, we would have to insert three such instructions. The latter case is illustrated in the following figure, where the inserted instructions and their pipeline actions are colored dark green.

Effigy 5.8.
Case of code reordering to solve data hazards in a sequence of MIPS instructions [Pat98,MK98].

Solution #3 – Stalls: Suppose that we had no instructions to insert between Instructions 1 and 2. For instance, there might be data dependencies arising from the inserted instructions that would themselves have to exist repaired. Alternatively, the plan execution gild (functional dependencies) might non let the reordering of code. In such cases, nosotros take to insert stalls, also called
bubbling, which are no-op instructions that merely filibuster the pipeline execution until the dependencies are no longer problematic with respect to pipeline timing. This is illustrated in Figure 5.9 by inserting three stalls between Instructions 1 and 2.

Effigy 5.9.
Example of stall insertion to solve data hazards in a sequence of MIPS instructions [Pat98,MK98].

As mentioned previously, the insertion of stalls is the least desireable technique because information technology delays the execution of an instruction without accomplishing whatsoever useful work (in contrast to code re-ordering).

#### 5.3.iv. Structural Hazards

Definition.
A
structural hazard
occurs when at that place is insufficient hardware to support a computation in a given pipeline segment.

For case, consider the data dependency between the first and quaternary instructions (`sub`
and
`add`) of the case in Section v.iii.three. Hither, a register file write and a register file read are scheduled in CC5. This tin can exist resolved by (a) duplicating hardware, or (b) modifying the existing hardware to support concurrent operations. If nosotros duplicated the register file, then nosotros could perform concurrent read and write operations, but there would be a
consistency trouble. That is, at a given clock cycle, registers in i register file could have different values than the corresponding registers in the other register file. This inconsistency is clearly unacceptable if authentic computation is to be maintained.

Instead, we can modify the annals file so that it (1) performs annals write on the first half of the clock cycle and (2) performs annals read on the second half of the clock bicycle. In earlier hardware, designers sometimes inserted a filibuster between write and read that was very small in relation to the clock wheel time, in order to ensure convergence of the annals file write.

Other structural hazards could occur during the branch instruction, if there were non two ALUs in the EX segment of the pipeline. That is, with only one ALU, we would have to simultaneously compute the BTA and determine (via subtraction) whether or not the branch condition was fulfilled. This would not exist possible without two concurrent adders in the ALU, which is what we currently have in our MIPS pipeline pattern shown in Effigy 5.4.

A further structural hazard could occur if we only used i memory for both instructions and data. For example, in Figure five.seven, suppose the
`sub`
`sw`
teaching. Then, we would exist writing to data retentiveness in CC4 for Pedagogy #ane and reading from instruction memory in CC4 for Instruction #iv. Clearly, if at that place was simply 1 memory there would be a conflict.

Like to the problem with the concurrent reads and writes on the register file, in that location are two ways to solve this dilemma. First, we can design a special-purpose memory module that permits writing (reading) on the kickoff (resp. second) half of the clock bike, equally nosotros said could exist done with the annals file. Yet, this requires special (expensive) hardware. Second, nosotros tin can utilize 2 fast caches, ane for instructions, and one for data, that access a large, slower main memory in which instructions and data are both stored. The latter method is used in exercise because caches and main retentivity already be, and the retentiveness management hardware for these types of components besides exists. Thus, nosotros can use off-the-shelf hardware to solve a trouble that would otherwise require special-purpose development of expensive hardware. Although this might not exist as much fun equally developing new hardware, it is more than cost-effective, which matters when one is designing and producing computers for profit.

#### 5.iii.5. Command (Branch) Hazards

Command hazards are the near difficult types of hazards arising from normal functioning of a program. In the adjacent section, we will see that exceptions (e.k., overflow) tin play peculiarly interesting types of havoc with smooth pipeline execution.

The about common type of control hazard is the branch instruction, which has two alternative results: (ane) jump to the co-operative target address if the instruction succeeds, or (2) execute the educational activity later on the branch (at PC+iv of instruction memory) if the branch fails.

The problem with the branch teaching is that we ordinarily practise non know which result will occur (i.eastward., whether or not the branch will exist taken) until the co-operative condition is computed. Frequently, the branch status depends on the result of the preceding pedagogy, so we cannot precompute the co-operative status to find out whether or non the branch will be taken.

The following iv strategies are employed in resolving command dependencies due to branch instructions.

5.3.5.i. Assume Co-operative Not Taken.
As we saw previously, we tin can insert stalls until nosotros find out whether or not the branch is taken. However, this slows pipeline execution unacceptably. A common alternative to stalling is to continue execution of the didactics stream as though the branch was non taken. The intervening instructions betwixt the branch and its target are and so executed. If the co-operative is not taken, this is not a harmful or disruptive technique. Nonetheless, if the branch is taken, then we must discard the results of the instructions executed later the branch statement. This is done past flushing the IF, ID, and EX stages of the pipeline for the discarded instructions. Execution continues uninterrupted afterwards the co-operative target.

The cost of this technique is approximately equal to the toll of discarding instructions. For instance, if branches are not taken 50 percent of the time, and the price of discarding results is negligible, then this technique reduces by 50 percent the cost of control hazards.

five.3.5.2. Reducing Branch Delay.
In the MIPS pipeline compages shown schematically in Figure 5.4, we currently assume that the branch condition is evaluated in Stage three of the pipeline (EX). If we movement the branch evaluation upward one stage, and put special circuitry in the ID (Decode, Stage #2), then we can evaluate the branch condition for the
`beq`
instruction. This would let u.s.a. to take the co-operative in EX instead of MEM, since nosotros would accept prepare for Stage three (EX) the
Zero
output of the comparator that would normally exist computed in EX. The hardware needed to compute equality is merely a series of parallel
xnor
gates
and-ed together, then inverted.

Exercise: Determine how this type of circuitry could be configured.

The advantage of this technique is that merely one instruction needs to exist discarded rather than two, as in the previous section. This reduces hardware cost and time required to affluent the pipeline, since just the IF and ID stages would need to exist flushed, instead of the three stages in the preceding example.

5.iii.5.3. Dynamic Co-operative Prediction.
Information technology would exist useful to be able to predict whether or not a majority of the branches are taken or not taken. This can be washed in software, using intelligent compilers, and can also be done at runtime. We concentrate on the software-intensive techniques starting time, since they are less expensive to implement (being closer to the compiler, which is easier to alter than the hardware).

The most advantageous situation is 1 where the branch status does non depend on instructions immemdiately preceding it, as shown in the post-obit lawmaking fragment:

```            add together \$5, \$v, \$6     # One of the registers for beq comparison is modified      sub \$4, \$3, \$6     # Null important to the co-operative here      and \$7, \$8, \$vi     # Zero important to the co-operative hither      and \$9, \$6, \$6     # Nothing important to the branch here

beq \$5, \$six,
target
```

Here, the branch compares Registers v and 6, which are last modified in the
`add`
instruction. We tin therefore precompute the co-operative condition as
`sub r \$5, \$half-dozen`, where
r
denotes a destination annals. If r = 0, then nosotros know the branch will be taken, and the runtime module (pipeline loader) tin can schedule the jump to the branch target address with full confidence that the branch volition exist taken.

Some other approach is to go along a history of branch statements, and to record what addresses these statements branch. Since the vast majority of branches are used every bit tests of loop indices, then nosotros know that the branch will almost e’er jump to the loopback signal. If the branch fails, then we know the loop is finished, and this happens simply one time per loop. Since most loops are designed to have many iterations, branch failure occurs less ofttimes in loops than does taking the branch.

Thus, information technology makes good sense to assume that a branch volition bound to the place that it jumped to before. Yet, in dense conclusion structures (e.thousand., nested or cascaded
if
statements), this situation does non always occur. In such cases, one might non exist able to tell from the preceding branch whether or not the branching behavior volition be repeated. It is then reasonable to apply a multi-branch lookahead.

Reading Assigment: Written report the discussion of multi-bit branch prediction schemes given on pp. 501-502 of the textbook.

Another clever technique of making branches more than efficient is the
branch delay slot. Nosotros previously discussed delayed branches when we overviewed the multicycle datapath implementation of the
`beq`
educational activity. In summary, the concept of efficient branching has two parts. First, the branch target address is computed in the ID stage of the pipeline, to determine as early every bit possible the instruction to fetch if the co-operative succeeds. Since this is done in the second stage of the pipeline, there is an instruction I following this (in the beginning or IF stage). Later on I moves to the ID phase, and so the branch target (pointed to by either PC+4 or the BTA) is loaded into the IF stage.

It is this pedagogy (I) that is called the
co-operative delay slot
(BDS). In the BDS can be safely placed an teaching that does not take data dependencies with respect to (a) the branch condition, (b) the instruction following the branch condition, or (c) the co-operative target. This ensures that, when the instruction J is executed (J is the instruction to which control is transferred later the co-operative condition is evaluated, whether J is pointed to past PC+4 or BTA), so the teaching I will accept been executed previously, and the pipe volition not have a stall where I would have been. Equally a result, the pipage will remain full throughout the branch evaluation and execution, unless an exception occurs.

#### 5.3.6. Exceptions as Hazards

Hardware and software must work together in any architecture, especially in a pipeline processor. Here, the ISA and processor control must be designed then that the following steps occur when an exception is detected:

1. Hardware detects an exception (eastward.g., overflow in the ALU) and stops the offending educational activity at the EX stage.

2. Pipeline loader and scheduler let all prior instructions (e.thousand., those already in the pipeline in MEM and WB) to consummate.

3. All instructions that are present in the pipeline after the exception is detected are flushed from the pipeline.

4. The address of the offending teaching (usually the address in primary memory) is saved in the EPC register, and a code describing the exception is saved in the Cause register.

5. Hardware command branches to the exception treatment routine (part of the operating system).

6. The exception handler performs ane of 3 actions: (1) notify the user of the exception (due east.g., split up-by-zero or arithmetics-overflow) then end the program; (two) try to correct or mitigate the exception so restart the offending instruction; or (3) if the exception is a beneficial interrupt (e.m., an I/O request), then salvage the plan/pipeline state, service the interrupt request, then restart the program at the instruction pointed to past EPC + 4.

In any instance, the pipeline is flushed as described.

In general, we tin can say that, if a pipeline has N segments, and the EX phase is at segment i < i < N, and so two observations are key to the prediction of pipeline functioning:

• Flushing
negates the processing of the (i-1) instructions following the offending instruction. These must be reloaded into the pipage, at the cost of i cycles (one cycle to flush, i-i cycles to reload the i-ane instructions
after
the exception is processed).

• Completing
the N-i instructions that were loaded into the pipeline prior to the offending instruction takes North-i clock cycles, which are executed (a) prior to, or (b) concurrently with, the reloading of the instructions i-1 that followed the i-th pedagogy (in the EX phase).

It is readily seen that the total number of wasted cycles equals (i-1) + (N-i) = Due north – 1, which is precisely the number of cycles that it takes to fix or reload the pipeline.

The proliferation of unproductive cycles can be mitigated by the following technique:

1. Freeze the pipeline state equally soon as an exception is detected.

2. Process the exception via the exception handler, and decide whether or not to halt or restart the pipeline.

3. If the pipeline is restarted, reload the (i-one) instructions following the offending instruction, meantime with completing execution of the (N-i) instructions that were being processed prior to the offending instruction.

If Step three tin exist performed equally stated, then the best-case penalization is just one wheel, plus the time incurred by executing the exception handler. If the entire pipeline needs to be flushed and restarted, then the worst-case penalty is N cycles incurred by flushing the pipe, then reloading the pipeline
after
the instructions preceding the offending educational activity have been executed. If the offending instruction must exist restarted, and then a maximum of
i
cycles are lost (one bike for affluent, plus (i-1) cycles to restart the instructions in the pipe following the offending instruction).

In the next section, we collect the concepts almost pipeline performance that we have been discussing, and prove how to compute the CPI for a pipeline processor under constraint of stalls, structural hazards, branch penalties, and exception penalties.

### 5.4. Pipeline Performance Analysis

As we said early on in this grade, we are trying to teach the technique of
functioning analysis, which helps i to intelligently determine whether or not a given processor is suitable computationally for a specific application. In this section, we develop performance equations for a pipeline processor, and exercise so in a stepwise way, and so yous tin run across how the various hazards and penalties affect performance.

#### v.4.1. CPI of a Pipeline Processor

Suppose an N-segment pipeline processes M instructions without stalls or penalties. Nosotros know that information technology takes North-one cycles to load (setup) the pipeline, and K cycles to consummate the instructions. Thus, the number of cycles is given by:

Ncyc
= Due north + M – 1 .

The cycles per pedagogy are hands computed:

CPI = Northcyc/M = 1 + (N – ane)/M .

Thus, CPI for a finite programme will always be greater than i. This stands in sharp contradiction to the outset fallacy of pipeline processing, which says:

Fallacy #1:
CPI of a pipeline processor is always equal to ane.0, since one instruction is candy per bike.

This statement is fallacious because information technology ignores the overhead that we take just discussed. The fallacy is similar to claiming that you simply spend eight hours at the part each day, so you lot must have 16 hours per mean solar day of “fun time”. However, you accept to take time to commute to/from the office, buy groceries, and do all the other homely tasks of life, many of which are in no mode related to “fun fourth dimension”. In do, such tasks are drudgery that is a blazon of overhead.

#### five.4.2. Event of Stalls

At present permit us add together some stalls to the pipeline processing scheme. Suppose that we have a N-segment pipeline processing M instructions, and we must insert K stalls to resolve data dependencies. This means that the pipeline now has a setup penalty of North-ane cycles, as earlier, a stall penalty of K cycles, and a processing cost (equally before) of M cycles to process the M instructions. Thus, our governing equations become:

Northcyc
= Northward + M + K – i .

and

CPI = Ncyc/M = 1 + (N + K – 1)/K .

In practice, what does this tell united states? Namely, that the stall penalty (and all the other penalties that nosotros will examine) adversely touch on CPI. Here is an example to bear witness how nosotros would analyze the trouble of stalls in a pipelined program where the percentage of instructions that incur stalls versus non-stalls are specified.

Example.
Suppose that an N-segment pipeline executes M instructions, and that a fraction fstall
of the instructions require the insertion of M stalls per teaching to resolve data dependencies. The total number of stalls is given by fstall
· M · K (fraction of instructions that are stalls, times the total number of instructions, times the average number of stalls per instruction). By substitution, our preceding equations for pipeline performance become:

Ncyc
= N + Chiliad + (fstall
· M · Yard) – 1 .

and

CPI = Ncyc/Chiliad = ane + (fstall
· K) + (Due north – 1)/M .

So, the CPI penalisation due to the combined furnishings of setup cost and stalls now increases to fK + (N – 1)/Thou. If fstall
= 0.1, G = three, N = 5, and K = 100, then CPI = 1 + 0.3 + 4/100 = 1.34, which is 34 pct larger than the fallacious assumption of CPI = 1.

This leads to the next fallacy of pipeline processing:

Fallacy #2:
Stalls are non a large trouble with pipelines – you just accept to worry about the number of stalls, not the percentage of instructions that induce stalls.

This fallacy is particularly unsafe. Information technology is analogous to maxim that the only thing that matters is the number of dwelling house burglaries each year, not the burglary rate per capita. If you lot move to a new neighborhood, you want to know
both
the number and per-capita incidence of crimes in that neighborhood, not simply the robbery count. And so yous can determine, from the population, whether or non information technology is safe to live there.

Similarly, with a pipeline processor, you want to determine whether or not the instruction mix or ordering of instructions causes information dependencies, and what is the incidence of such dependencies. For instance, ane,000 pedagogy program with 20 stalls will run more efficiently than a 1,000 education program with xx percent of the instructions requiring 1 stall each to resolve dependencies.

#### five.iv.3. Effect of Exceptions

For purposes of give-and-take, assume that nosotros have Thousand instructions executing on an N-segment pipeline with no stalls, but that a fraction fex
of the instructions raise an exception in the EX stage. Further presume that each exception requires that (a) the pipeline segments earlier the EX stage be flushed, (b) that the exception exist handled, requiring an average of H cycles per exception, then that (c) the pedagogy causing the exception and its following instructions be reloaded into the pipeline.

Thus, fex
· Chiliad instructions will cause exceptions. In the MIPS pipeline, each of these instructions causes three instructions to be flushed out of the pipe (IF, ID, and EX stages), which incurs a penalisation of four cycles (ane bicycle to flush, and 3 to reload) plus H cycles to handle the exception. Thus, the pipeline functioning equations get:

Ncyc
= N – 1 + (1 – fex) · M + (fex
· Thousand · (H + 4)) ,

which we tin can rewrite as

Ncyc
= Thou + [North – 1 – K + (i – fex) · M + (fex
· Thou · (H + 4))] .

Rearranging terms, the equation for CPI tin be expressed as

CPI = Ncyc/Yard = 1 + [1 – fex
+ (fex
· (H+iv)) – 1 + (Northward – 1)/K] .

Afterward combining terms, this becomes:

CPI = Ncyc/Chiliad = ane + [(fex
· (H+three)) + (N – 1)/M] .

Nosotros tin can see by test of this equation and the expression for CPI due to stalls that exceptions accept a more than detrimental effect, for two reasons. Showtime, the overhead for stalls (One thousand stalls per affected education) is K < 4 cycles in the MIPS pipeline (since the pipeline has only five segments). Second, the cost of each exception is H+three cycles per affected instruction. Since H > 0 for a nontrivial exception handler, the toll of an exception in the MIPS pipeline (under the preceding assumptions) volition exceed the cost of remedying a risk using stalls. The skillful news, however, is that in that location are usually fewer exceptions in programs than data or structural dependencies, with the exception of I/O-intensive programs (many I/O interrupts) and arithmetic-intensive programs (possible overflow or underflow exceptions).

#### 5.4.four. Outcome of Branches

Branches nowadays a more complex picture show in pipeline performance analysis. Recollect that there are three means of dealing with a branch: (1) Presume the branch is non taken, and if the branch is taken, affluent the instructions in the pipe after the branch, and so insert the instruction pointed to by the BTA; (2) the antipodal of i); and (three) use a delayed branch with a branch delay slot and re-ordering of code (assuming that this can be washed).

The first ii cases are symmetric. Presume that an fault in branch prediction (i.east., taking the branch when you expected not to, and conversely) requires L instruction to be flushed from the pipeline (1 bike for flushing plus L-i “dead” cycles, since the branch target can be inserted in the IF stage). Thus, the toll of each branch prediction error is Fifty cycles. Further assume that a fraction fbr
of the instructions are branches and fexist
of these instructions result in branch prediction errors.

The penalty in cycles for branch prediction errors is thus given by

branch_penalty
= fbr
· fbe
· M instructions · L cycles per education .

The pipeline performance equations then become:

Ncyc
= North – ane + (1 – fbr
· fexist) · Grand + (fbr
· fbe
· Grand · L) ,

which we can rewrite as

Northwardcyc
= M + [Due north – one – Chiliad + (i – fbr
· fexist) · Thousand + (fbr
· fbe
· Yard · L) ,

Rearranging terms, the equation for CPI can be expressed every bit

CPI = Ncyc/M = 1 + [(1 – fbr
· fbe) + (fbr
· fexist
· 50) – 1 + (N – 1)/M] .

After combining terms, this becomes:

CPI = Ncyc/M = 1 + [(fbr
· fexist
· (L-i)) + (Northward – ane)/1000] .

In the case of the co-operative delay slot, we assume that the branch target accost is computed and the co-operative condition is evaluated at the ID stage. Thus, if the branch prediction is right, there is no penalisation. Depending on the method by which the pipeline evaluates the branch and fetches (or pre-fetches) the branch target, a maximum of ii cycles penalty (one bike for flushing, i bike for fetching and inserting the co-operative target) is incurred for insertion of a stall in the instance of a branch prediction error. In this example, the pipeline functioning equations become:

Ncyc
= N – 1 + (ane – fbr
· fexist) · M + (fbr
· fbe
· 2M) ,

which implies the following equation for CPI as a part of branches and branch prediction errors:

CPI = Northwardcyc/M = 1 + [fbr
· fexist
+ (N – one)/One thousand] .

Since fbr
<< 1 is usual, and fbe
is, on average, assumed to exist no worse than 0.five, the product fbr
· fbe, which represents the additional co-operative penalisation for CPI in the presence of delayed branch and BDS, is generally small.

_______________//_______________

This concludes our discussion of pipelining. We next concentrate on the discussion and assay of supporting technologies, such every bit memories and buses.

Popular:   Pelvic Bones in Whales Are an Example of Structures

### References

[Maf01] Mafla, E.
Course Notes, CDA3101, at URL
`http://www.cise.ufl.edu/~emafla/`
(equally-of 11 April 2001).