# CSCI-UA 0201-007

R14: Assessment 13 & ALU & RegFile & Pipeline

#### Today's Topics

- Assessment 13
  - Review pipelined CPU

## Assessment 13

Q1 Single-cycle CPU

#### Q1 FSM

In the lecture example on "electronic eyes" (see slide 28 of <u>https://nyu-cso.github.io/notes/arch-seq.pdf</u>). The desired pattern of lights to be lit up is: left, middle, right, middle, left, middle, right ...

What is the **minimum** number of distinct state values needed for a FSM to implement this electronic eyes device?





#### Q2 single-cycle CPU

• Q2.1 Data path



- Suppose the RISC-V instruction being executed is add x6, x7, x8, where x6 is the destination register, and x7/x8 are the 1st/2nd source register operand, respectively.
- What are the values corresponding to Instruction[11-7] that are fed into the "write register" pins of the RegisterFile?
- 00110

- Write register  $\rightarrow$  register code
- The code of register xi is i.
- For example, the code of x5 is (00101)<sub>2</sub>,
- x6: (00110)<sub>2</sub>
- Note: use 5 bits since we have 32 registers

- There are 3 Mux in the Figure whose selectors are to be set by the control logic, located at the top-right, bottom-middle, and bottom-right.
- Which of the following instructions' execution cause the top-right Mux's selector to be set to 1?
- A. add x6, x7, x8 //x6 = x7+x8
- B. beq x6, x7, 100, if x6 and x7 have the same value.
- C. beq x6, x7, 100, regardless of whether x6 and x7 have the same value.
- D. Id x5, 40(x6) //load a doubleword (8-byte) from Memory[x6+40]Memory[x6+40] to register x5
- E. addi x6, x7, 200 //x6 = x7+200
- F. sd x5, 40(x6) //store a doubleword (8-byte) from register x5 to Memory[x6+40]*Memory*[x6+40]

beq x5, x6, 100

If (x5==x6) goto PC+2\*100

#### Control which how to compute next PC (SB-type instruction)



- There are 3 Mux in the Figure whose selectors are to be set by the control logic, located at the top-right, bottom-middle, and bottom-right.
- Which of the following instructions' execution cause the top-right Mux's selector to be set to 1?
- A. add x6, x7, x8 //x6 = x7+x8
- B. beq x6, x7, 100, if x6 and  $x^{7}$  have the same value.
- C. beq x6, x7, 100, regardless of whether x6 and x7 have the same value.
- D. Id x5, 40(x6) //load a doubleword (8-byte) from Memory[x6+40]Memory[x6+40] to register x5
- E. addi x6, x7, 200 //x6 = x7+200
- F. sd x5, 40(x6) //store a doubleword (8-byte) from register x5 to Memory[x6+40]*Memory*[x6+40]

- Continuing from Q2.2, which of the following instructions cause the value for the bottom-middle Mux's selector (aka ALUSrc) to be set to 1?
- A. add x6, x7, x8
- B. beq x6, x7, 100, if x6 and x7 have the same value.
- C. beq x6, x7, 100, regardless of whether x6 and x7 have the same value.
- D. ld x5, 40(x6)
- E. addi x6, x7, 200
- F. sd x5, 40(x6)



- Control ALU input
- 1 -> select Imm
- 0 -> select read data 2 (i.e. register)

- Continuing from Q2.2, which of the following instructions cause the value for the bottom-middle Mux's selector (aka ALUSrc) to be set to 1?
- A. add x6, x7, x8
- B. beq x6, x7, 100, if x6 and x7 have the same value.

x6 + 40 (imm)

C. beq  $x_{6+40 \text{ (imm)}}$  regardless of whether x6 and x7 have the same D. ld x5, 40(x6) (x7 + 200 (imm)) E. addi x6, x7, 200 F. sd x5, 40(x6)



sd x5, 40(x6)

- Which of the following instructions cause the value of the RegWrite input to the RegisterFile to be set?
- A. add x6, x7, x8
- B. beq x6, x7, 100, if x6 and x7 have the same value.
- C. beq x6, x7, 100, regardless of whether x6 and x7 have the same value.
- D. ld x5, 40(x6)
- E. addi x6, x7, 200
- F. sd x5, 40(x6)



 Set when we want to store a value into a register

• Which of the following instructions cause the value of the RegWrite input to the RegisterFile to be set?

funct7

rs2

rs1

funct3

A. add x6, x7, x8

- B. beq x6, x7, 100, if x6 and x7 have the same value.
- C. beq x6, x7, 100, regardless of whether x6 and x7 have the same value.
- D. Id x5, 40(x6)
   x5= mer
   immediate
   rs1
   funct3
   rd
   opcode

   E. addi x6, x7, 200
   [11:7] rd

   F. sd x5, 40(x6)
   write in memory: memory[x6+40]=x5

add x5, x6, x7

[11:7] rd

opcode

rd



- Set when we want to store a value into a register
- With what data?



- Set when we want to store a value into a register
- With what data?



- Control what to write back to the register
- 1 -> select read data
  - Id x5, 40(x6)
  - x5=Mem[x6+40]
- 0 -> select ALU result
  - add x6, x7, x8
  - x6=x7+x8

- Continuing from Q2.3, which of the following instructions cause the value for the bottom-right Mux's selector (aka MemToReg) to be set to 1?
- A. add x6, x7, x8
- B. beq x6, x7, 100, if x6 and x7 have the same value.
- C. beq x6, x7, 100, regardless of whether x6 and x7 have the same value.
- D. ld x5, 40(x6)
  - E. addi x6, x7, 200
  - F. sd x5, 40(x6)

## Pipeline

Design & Hazards

#### **RISC-V** Pipeline

- Pipeline increases throughput by overlapping execution of multiple instructions
- We split the instruction memory from the data memory
  - Otherwise reading data would delay reading an instruction
- There are 5 stages in the RISC-V pipeline

#### **Pipelined Datapath**



#### Pipeline latency and throughput

- Latency=max(stage time) \* (#stages 1) + last\_stage\_time
  - e.g., stage times is
    - IF: 200ps; 1<sup>st</sup> Reg: 100ps; ALU: 200ps; Data access: 200ps; 2nd Reg: 100ps;
  - max(stage time)=200ps=clock cycle
  - latency=200ps\*4+100=900ps
- Throughput=1/max(stage time)=clock rate
  - e.g., 1/200ps



## Assessment 13

Q2&3 Pipelined CPU

#### Q3 Pipelining performance

#### clock rate=throughtput=1/max stage time

- old: 1/200
- new: 1/400
- old 2x faster

|     | IF  | ID  | EX  | MEM | WB  | max |
|-----|-----|-----|-----|-----|-----|-----|
| old | 200 | 100 | 200 | 200 | 100 | 200 |
| new |     |     | 400 |     |     | 400 |

#### Q3 Pipelining performance

- Suppose the 5 stage pipeline has the following latency for each pipeline stage, 200ps (Instruction fetch aka IF), 100ps (Register read aka ID), 200ps (ALU operation aka EX), 200ps (Data access aka MEM), 100ps (Register write aka WB).
- Suppose we build a new CPU by adding the multiplication function to ALU, which causes the ALU latency to increase from 200ps to 400ps. Which of the following statements are true?
- A. The new CPU has twice the instruction throughput of the original one.
- B. The old CPU has twice the instruction throughput as fast as the new one.
- C. The ALU latency increase would cause the new CPU to run at a slower clock rate than the old CPU.
- D. The ALU latency increase would cause the new CPU to run at a faster clock rate than the old CPU.

#### Q4 Pipelining performance

 Suppose we change the RISC-V ISA to restrict load/store instructions to use a base register only (without an immediate offset/displacement). Thus, load/store instructions no longer need to use the ALU to compute addresses. As a result, we can change the CPU to overlap the data access (aka MEM) and ALU operation (aka EX) into one stage, resulting in a 4-stage pipeline. Note that in the merged MEM-EX stage, an instruction either performs data access or ALU operation, but not both. Hence the merged stage still takes 200ps, same as the latency of either MEM or EX in Q2.



### Q4.1 Clock speed

|     | IF  | ID  | EX  | MEM | WB  | max |
|-----|-----|-----|-----|-----|-----|-----|
| old | 200 | 100 | 200 | 200 | 100 | 200 |
| new |     |     | 20  | 00  |     | 200 |

clock cycle (=max) remains unchanged

- How does the new 4-stage design affect the clock speed?
- A. Clock for 4-stage pipelined CPU must run faster than that in the 5stage pipelined CPU.
- B. Clock for 4-stage pipelined CPU must run slower than that in the 5stage pipelined CPU.
- C. Clock for 4-stage pipelined CPU can run at the same speed as that of the 5-stage pipelined CPU.

#### Q4.2 Instruction latency

- How does the new 4-stage design affect the instruction latency?
- A. instruction latency (e.g. for load) for 4-stage pipelined CPU is lower than that in the 5-stage pipelined CPU.
- B. instruction latency (e.g. for load) for 4-stage pipelined CPU is higher than that in the 5-stage pipelined CPU.
- C. instruction latency (e.g. for load) for 4-stage pipelined CPU is the same as that in the 5-stage pipelined CPU.
  - Instruction latency:
  - 5-stage: 400ps\*4+100
  - 4-stage:
    - 200ps\*3+100

No dependency => parallelis Pipeline: parallelly execute the

However, not always true => hazard.

- For a specific instruction, age aepenas on the previous stage.
- But we can utilize the independency across instructions
  - We can execute the next instruction without waiting for the finish of the previous instruction => pipeline











#### • Structure Hazard

- Caused by limited hardware resources.
- Solution: add resources (e.g., separating inst. and data mem)
- Data Hazard
- Control Hazard

Caused by the dependency between instructions: The execution of i2 depends on some output of i1. => wouldn't happen in sequential model, because i2 is executed after finishing i1 and thus i2 can always see the output of i1.

- Structure Hazard
  - Caused by limited hardware resources.
  - Solution: add resources (e.g., separating inst
- Data Hazard
- Control Hazard

Caused by the dependency between

The execution of i2 depends on some output of i1.

=> wouldn't happen in sequential model, because i2 is executed after finishing i1 and thus i2 can always see the output of i1.

i2 must wait for the finish of i1? Can we get the output sooner and thus doesn't affect the execution of i2?







- The input of some stage s2 in i2 depends on the output of some stage s1 in i1:
  - E.g. i2's input of EX stage depends on the i1's output of EX stage
  - Trying to forward i1's output of s1 to i2's s2.
  - But this doesn't work all the time.
    - Need to delay i2's s2 until i1 has the output.
    - How? By adding bubble (nop instruction)

Sometimes i1 cannot have the output at the time when i2 needs it (e.g., i2's input of EX depends on i1's output of MEM)

x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) i2: add x4, x5, x6 (x4 = x5 + x6) x4 should be 2



x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) However, when i2 starts EX stage, i2: add x4, x5, x6 (x4 = x5 + x6) i1 just starts its MEM stage! x4 should be 2 => We do not have the correct input yet! i1: ld x5, x6(100) IF ΕX ID WB i2: add x4, x5, x6 IF ID MEM WB input of i2's EX stage depends on the output of i1's MEM stage

x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) However, when i2 starts EX stage, i2: add x4, x5, x6 (x4 = x5 + x6) i1 just starts its MEM stage! x4 should be 2 => We do not have the correct input yet! i1: ld x5, x6(100) IF EX ID We need to delay i2's EX stage, so that i1 can prepare the required output. How many bubbles? i2: add x4, x5, x6 IF ID input of i2's EX stage depends on the output of i1's MEM stage

x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) i2: add x4, x5, x6 (x4 = x5 + x6) x4 should be 2

How many bubbles? How many cycles it needs for i1 to prepare the output before i2 uses it.



x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) How many bubbles? i2: add x4, x5, x6 (x4 = x5 + x6) How many cycles it needs for i1 to prepare x4 should be 2 the output before i2 uses it. i1 has just finished MEM i1: ld x5, x6(100) IF MEM EΧ ID out = 1 bubble EΧ IF ID WB i2: add x4, x5, x6 IF ID MEM WB x6 = 1 When i2 starts EX,

x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) i2: add x4, x5, x6 (x4 = x5 + x6) x4 should be 2

How many bubbles? How many cycles it needs for i1 to prepare the output before i2 uses it.



x5 = 0, x6 = 1, mem[101] = 1 i1: ld x5, x6(100) (x5 = mem[x6 + 100]) i2: add x4, x5, x6 (x4 = x5 + x6) x4 should be 2

How many bubbles? How many cycles it needs for i1 to prepare the output before i2 uses it.



- How does the new 4-stage design affect the instruction throughput?
- A. instruction throughput for 4-stage pipelined CPU is lower than that in the 5stage pipelined CPU, under ideal (no hazard) scenarios.
- B. instruction throughput for 4-stage pipelined CPU is higher than that in the 5stage pipelined CPU, under ideal (no hazard) scenarios.
- C. instruction throughput for 4-stage pipelined CPU is the same as that in the 5stage pipelined CPU, under ideal (no hazard) scenarios. throughput is still 1/200ps
- D. 4-stage pipelined CPU tends to have fewer hazards than 5-stage pipelined CPU.
  - E. 4-stage pipelined CPU tends to have more hazards than 5-stage pipelined CPU.
  - F. 4-stage pipelined CPU has the same amount of hazards as 5-stage pipelined CPU.

- More stages, tend to have more hazards
  - E.g. sequential model: 1 stage, no hazards
- Why?
  - Suppose i2 depends on the output of i1
  - Intuitively, with more stages, there are more cases that i1 is still in process when i2 needs the input, and causes a hazard.





assume x5 = 0 x6 = 0 beq x5, x6, 100 add x1, x2, x3

Should jump to here: add x7, x8, x9 How many bubbles? Note: jump instruction can decide the address of the next instruction in MEM stage



















#### Common question for lab5

- Incomplete problem: Some circuits will be tested together
  - try to finish all circuits in one exercise, and check again
- When implementing
  - Don't change the contents above the dash line, and
  - use Tunnels (not pins) as inputs and outputs
- Building sub-circuits/sub-components and use Tunnels will help a lot
  - especially for complex implementation, i.e. bonus exercise: Logical Shift Right

#### Congrats to all

- Great job!
- Thanks for attending and supporting!
- Good luck to all your final works~

