The clock cycle time (CCT) is the time for one clock period (usually of the processor clock, which runs at a constant rate, usually published as part of the documentation for a computer)
- note: Although clock cycle time has traditionally been fixed, to save energy or temporarily boost performance, today’s processors can vary their clock rates, so we would need to use the average clock rate for a program.
The clock rate (CR) is the inverse of the clock cycle time
- $CR = \frac{1}{CCT}$
- (usually measured in $Hz$ or its multiples)

Single-Cycle

The response time (or execution time) is the total time required for the computer to complete a task (including disk accesses, memory accesses, I/O activities, operating system overhead, CPU execution time, etc.)
- The latency is the time it takes to complete an individual instruction
  - (note: the latency can refer to (i) the number of stages in a pipeline. or (ii) The number of stages between two instructions during execution)
  - todo latency = execution time ?
  - $Latency = CPI \times CCT$
The performance is the reciprocal of response time: $Performance_{X} = \frac{1}{Response time _{X}}$
The CPU (execution) time (of task) is the actual time the CPU spends computing for a specific task (excluding other activities)
The throughput (or bandwidth) is the number of tasks (instructions) completed per unit time
$CPU Clock Cycles = IC \times CPI$
- The instruction count (IC) is the number of instructions executed by the program
- The clock cycles per instruction (CPI) is the average number of clock cycles per instruction for a program or program fragment
- The CPU clock cycles (or total clock cycles) is the total number of clock cycles consumed by the program
The CPU (execution) time (of program) is
- $Execution Time = IC \times CPI \times CCT = \frac{CPU clock cycles}{Clock rate}$
$CPU Clock Cycles = i = 1 \sum n IC_{i} \times CPI_{i}$
- $IC_{i}$ is the number of instructions of type $i$
- $CPI_{i}$ is the average number of clock cycles per instruction of type $i$

The number of instructions in the program (IC) is determined by the efficiency of the algorithm implementation, the compiler, and the processor’s instruction set architecture (ISA). The implementation of the processor determines both the clock cycle time and the CPI.

Instruction Replacement

EXAMPLE

Given

$CR = 3 GHz$

$CPI_{A} = 9 cc/ins$

$CPI_{B} = 2 cc/ins$

Some of the $A$ instructions in the program are replaced by $B$ instructions, after which, the program gets shorter by $3 sec$

Each two $A$ instructions are replaced by five $B$ instructions.

Question: How many $A$ instructions were replaced?

Answer:

$CPI_{A} = 9 cc/ins ⟹$ 2 $A$ instructions has $18 cc$

$CPI_{B} = 2 cc/ins ⟹$ 5 $B$ instructions has $10 cc$

Therefore, by replacing 2 $A$ instructions with 5 $B$ instructions, we save $8 cc$ , thus, by replacing one $A$ instruction, we save $4 cc$ .

The cycles saved = $3 sec \times 3 GHz = 9 \times 1 0^{9} cc$

The number of $A$ instructions replaced = $\frac{9 \times 1 0 ^{9}}{4} = 2.25 \times 1 0^{9}$

$N_{A} = \frac{a \cdot Δ T \cdot CR}{a \cdot CPI _{A} - b \cdot CPI _{B}}$

Every $a$ instructions of type $A$ replaced by $b$ instructions of type $B$
$Δ T$ is the time saved (in $sec$ )
$CR$ is the clock rate of the CPU (in $Hz$ )
$CPI_{A}$ and $CPI_{B}$ are the cycles per instruction for $A$ and $B$ (resp.)
$N_{A}$ is the total number of $A$ instructions replaced

Instruction Count

EXAMPLE

Given

CPU A with $CPI = 5.5$ , and $Clock rate = 2 GHz$

CPU B with $CPI = 2.2$ , and $Clock rate = 3 GHz$

A program with $200, 000$ instructions in running time (after assembly for CPU A)

How many instructions (in running time) would CPU B execute to match the running time of CPU A?

$ET_{A} = ET_{B}$ (ET=Execution Time)

$\frac{IC _{A} \times CPI _{A}}{Clock rate _{A}} = \frac{IC _{B} \times CPI _{B}}{Clock rate _{B}}$

$\frac{200 , 000 \times 5.5}{2} = \frac{IC _{B} \times 2.2}{3}$

$IC_{B} = \frac{200 , 000 \times 5.5 \times 3}{2 \times 2.2} = 750, 000$

$ET_{A} = ET_{B} ⟹ \frac{IC _{A} \times CPI _{A}}{CR _{A}} = \frac{IC _{B} \times CPI _{B}}{CR _{B}} ⟹ IC_{B} = \frac{IC _{A} \times CPI _{A} \times CR _{B}}{CPI _{B} \times CR _{A}}$

Pipelining

In this section, $CPI = 1$ , therefore, $Latency = CCT$

The pipeline depth is the number of stages ( $= 5$ ) in the pipeline
${IF, ID, EX, MEM, WB}$ are the stage delay in the pipeline

$Latenc y_{pipelined} = CC T_{pipelined} \times pipeline-depth$

- $\displaystyle\mathrm{Latency_{pipelined}}=\frac{\text{Execution-time}} {\text{Number-of-instructions}}$  #todo is it correct

$CC T_{single} = IF + ID + EX + MEM + WB$
$CC T_{pipelined} = max (IF, ID, EX, MEM, WB)$
$Throughput = \frac{1}{CCT}$
$ET (n) = CCT \times (n + depth - 1)$

Although the latency is worse in the pipelined processor, the throughput is significantly improved

$Speedup = \frac{Latenc y _{single}}{Latenc y _{pipelined}} = \frac{CP I _{single} \times CC T _{single}}{CP I _{pipelined} \times CC T _{pipelined}}$
$Speedup = \frac{Pipeline-depth}{1 + Pipeline stall CPI} \times \frac{CCT _{single}}{CCT _{pipelined}}$
- $N \to \infty lim \frac{( N \cdot Latenc y _{single} ) + Overhea d _{single}}{( N \cdot Latenc y _{pipelined} ) + Overhea d _{pipelined}}$ is the speedup of the pipelined processor over the single-cycle processor, where:
  - $Overhea d_{single}$ and $Overhea d_{pipelined}$ are the time taken to execute some given number of instructions, for the single-cycle and pipelined processors, respectively.
When the stages are perfectly balanced, then:
- $CC T_{pipelined} = \frac{CC T _{single}}{depth}$ , thus, $Speedup = depth$ (Under ideal conditions and with a large number of instructions)

Exercise

Given the following times for each one of 5 stages of the pipeline: (assume $CPI = 1$ )

$IF = 300 ps$

$ID = 400 ps$

$EX = 350 ps$

$MEM = 500 ps$

$WB = 100 ps$

A. What is the clock cycle time (single-cycle / pipelined)?

B. What is the latency of lw instruction (single-cycle / pipelined)?

C. For a large number of instructions, what is the speedup of the pipelined processor over the single-cycle processor?

D. If it is possible to split one stage into two stages, each taking half the time of the original stage.

What is the best choice of stage to split?

What would be the new clock cycle time of the pipelined processor?

What would be the new latency of the lw instruction?

How does the change affect the throughput?

Answer

$CC T_{single} = IF + ID + EX + MEM + WB = 1650 ps$

$CC T_{pipelined} = max (IF, ID, EX, MEM, WB) = 500 ps$

$Latenc y_{single} = CC T_{single} = 1650 ps$

$Latenc y_{pipelined} = CC T_{pipelined} \times depth = 500 ps \times 5 = 2500 ps$

$Speedup = \frac{Latenc y _{single}}{CC T _{pipelined}} = 3.3$ (given that there is no stalls)

The best choice to split is the longest stage, which is the MEM stage with $500 ps$ , that will be split into two stages each taking $250 ps$ , thus the new longest stage will be the ID stage with $400 ps$ , and the new clock cycle time will be $400 ps$ , the new latency will be $400 ps \times 6 = 2400 ps$ , the throughput will be improved as the clock cycle time is reduced.

The speedup of the pipelined processor (with the split stage) over the single-cycle processor is $\frac{1650 ps}{400 ps} = 4.125$ , and over the pipelined processor (without the split stage) is $\frac{500 ps}{400 ps} = 1.25$

Speedup

The speedup is it is the improvement in speed of execution of a task executed on two similar architectures with different resources
Speedup can be defined for two different types of quantities: latency and throughput
- $S_{L} = \frac{Latency _{old}}{Latency _{new}}$
- $S_{T} = \frac{Throughput _{new}}{Throughput _{old}}$

Amdahl’s Law

$Speedup (N) = \frac{1}{( 1 - P ) + \frac{P}{N}} = \frac{ET _{old}}{ET _{new}}$

$N$ is the number of processors
$P$ is the fraction of the program that can be parallelized
$1 - P$ is the fraction of the program that must be executed sequentially
$Speedup (N)$ is the speedup of the program when executed on $N$ processors
$ET_{old}$ and $ET_{new}$ are the execution times of the program before and after the improvement (resp.)

Explorer

Performance