#### Fundamentals of Computer Systems Caches

Stephen A. Edwards

**Columbia University** 

#### Summer 2021

Illustrations Copyright © 2007 Elsevier

#### **Computer Systems**

Performance depends on which is slowest: the processor or the memory system



#### Memory Speeds Haven't Kept Up



# Our single-cycle memory assumption has been wrong since 1980.

Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 2003.

#### Your Choice of Memories



#### **Memory Hierarchy**

Fundmental trick to making a big memory appear fast

| Technology | Cost<br>(\$/Gb) | Access Time<br>(ns) | Density<br>(Gb/cm2) |
|------------|-----------------|---------------------|---------------------|
| SRAM       | 30 000          | 0.5                 | 0.00025             |
| DRAM       | 10              | 100                 | 1 – 16              |
| Flash      | 2               | 300*                | 8 – 32              |
| Hard Disk  | 0.1             | 10 000 000          | 500 – 2000          |

\*Read speed; writing much, much slower

### A Modern Memory Hierarchy



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm A desktop machine:

| Level           | Size   | Tech.    |
|-----------------|--------|----------|
| L1 Instruction* | 64 K   | SRAM     |
| L1 Data*        | 64 K   | SRAM     |
| L2*             | 512 K  | SRAM     |
| L3              | 2 MB   | SRAM     |
| Memory          | 4 GB   | DRAM     |
| Disk            | 500 GB | Magnetic |

\*per core

#### **Temporal Locality**

#### FIRST BOOK

#### DEFINITIONS.

1. A point is that which is without parts.

2. A line is length without breadth.

3. 'he extremities of a line are points.

4. A right line, is that which lies evenly between its extremities.

5. A superficies, is that which has only  $\sqrt{2}$  length and breadth.

6. The boundings of a superficies are lines.

7. A plane superficies, is that which lies evenly between its extreme right lines.

8. A rectilineal angle, is the inclination of two right lines to each other, which touch, but do not form one straight line.

An angle is designated either by one letter at the vertex; or three, of which the / middle one is at the vertex, the remaining two any place on the legs.

9. The legs of an angle, are the lines which make the angle.

10. The vertex of an angle is the point in which the legs mutually touch each other.

What path do your eyes take when you read this?

# Did you look at the drawings more than once?

Euclid's Elements

# **Spatial Locality**



If you need something, you may also need something nearby

#### **Memory Performance**

Hit: Data is found in the level of memory hierarchy

Miss: Data not found; will look in next level

Hit Rate =  $\frac{\text{Number of hits}}{\text{Number of accesses}}$ 

Miss Rate =  $\frac{\text{Number of misses}}{\text{Number of accesses}}$ 

Hit Rate + Miss Rate = 1



The expected access time  $E_L$  for a memory level L with latency  $t_L$  and miss rate  $M_L$ :

 $E_L = t_L + M_L \cdot E_{L+1}$ 

#### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

#### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

> Hit Rate =  $\frac{750}{1000}$  = 75% Miss Rate = 1 – 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What's the expected access time?

#### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

> Hit Rate =  $\frac{750}{1000}$  = 75% Miss Rate = 1 – 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What's the expected access time? Expected access time of main memory:  $E_1 = 100$  cycles Access time for the cache:  $t_0 = 1$  cycle Cache miss rate:  $M_0 = 0.25$ 

 $E_0 = t_0 + M_0 \cdot E_1 = 1 + 0.25 \cdot 100 = 26$  cycles

#### Cache

Highest levels of memory hierarchy

Fast: level 1 typically 1 cycle access time

With luck, supplies most data

Cache design guestions:



What data does it hold? Recently accessed

How is data found? Simple address hash

What data is replaced?

Often the oldest

#### What Data is Held in the Cache?

Ideal cache: always correctly guesses what you want before you want it.

Real cache: never that smart

#### **Caches Exploit**

#### **Temporal Locality**

Copy newly accessed data into cache, replacing oldest if necessary

#### **Spatial Locality**

Copy nearby data into the cache at the same time

Specifically, always read and write a block at a time (e.g., 64 bytes), never a single byte.





# **Direct-Mapped Cache Behavior**

A dumb loop:

repeat 5 times

load from 0x4; load from 0xC; load from 0x8.

|      | li    | \$t0, | 5         |
|------|-------|-------|-----------|
| 11:  | beq   | \$t0, | \$0, done |
|      | lw    | \$t1, | 0x4(\$0)  |
|      | lw    | \$t2, | 0xC(\$0)  |
|      | lw    | \$t3, | 0x8(\$0)  |
|      | addiu | \$t0, | \$t0, -1  |
|      | j     | 11    |           |
| done | e:    |       |           |



Cache when reading 0x4 last time

Assuming the cache starts empty, what's the miss rate?

### **Direct-Mapped Cache Behavior**

Bvte Tag Set Offset Memory 00...00 001 00 Address 3 A dumb loop: V Tag Data Set 7 (111) repeat 5 times 0 Set 6 (110) 0 load from 0x4; Set 5 (101) 0 load from 0xC: Set 4 (100) 0 mem[0x00...0C] Set 3 (011) load from 0x8. 00...00 mem[0x00...08] Set 2 (010) 00...00 Set 1 (001) 00...00 mem[0x00...04] Set 0 (000) 0 li \$t0, 5 Cache when reading 0x4 last time l1: bea \$t0, \$0, done lw \$t1, 0x4(\$0) lw \$t2, 0xC(\$0) Assuming the cache starts empty, what's ٦w \$t3, 0x8(\$0) the miss rate? addiu \$t0, \$t0, -1 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 11 i МММНННННННННН done: 3/15<sup>-</sup>= 0.2 = 20%

# Direct-Mapped Cache: Conflict



### **Direct-Mapped Cache: Conflict**







#### 2-Way Set Associative Behavior

|      | li    | \$t0, | 5     |       |
|------|-------|-------|-------|-------|
| 11:  | beq   | \$t0, | \$0,  | done  |
|      | lw    | \$t1, | 0x4   | (\$0) |
|      | lw    | \$t2, | 0x24  | (\$0) |
|      | addiu | \$t0, | \$t0, | -1    |
|      | j     | 11    |       |       |
| done | e:    |       |       |       |

Assuming the cache starts empty, what's the miss rate? <u>4 24 4 24 4 24 4 24 4 24</u> <u>M M H H H H H H H H</u> 2/10 = 0.2 = 20%

Associativity reduces conflict misses

Way 0

Way 1

|   | •    | vayı        |   | •    |             |       |
|---|------|-------------|---|------|-------------|-------|
| V | Tag  | Data        | V | Tag  | Data        |       |
| 0 |      |             | 0 |      |             | Set 3 |
| 0 |      |             | 0 |      |             | Set 2 |
| 1 | 0000 | mem[0x0024] | 1 | 0010 | mem[0x0004] | Set 1 |
| 0 |      |             | 0 |      |             | Set 0 |

| An | Eight      | -way        | / Fi  | ully        | ' A   | SSC   | ocia | hti | ive   | e C  | Ca   | cł   | ne   |       |       |    |     |      |         |
|----|------------|-------------|-------|-------------|-------|-------|------|-----|-------|------|------|------|------|-------|-------|----|-----|------|---------|
|    | Set # 0    | bits<br>byt | of =  | - 1         |       | Sè    | t    | 1   | 01    | nli  | 1    |      |      |       |       |    |     |      |         |
|    | Way 7      | Way 6       | Aso   | A.<br>Way 5 |       | Way   | / 4  | v   | Nay ( | 3    | v    | Vay  | 2    | Wa    | y 1   | ,  | Way | 0    |         |
| P  | V Tag Data | V Tag D     | ata V | Tag D       | ata V | / Tag | Data | VТ  | ag [  | Data | V T  | ag I | Data | / Tag | Data  | V  | Гад | Data | ]Set [) |
|    | <b>(-)</b> | Θ           | Y     | Ð           |       | É     | ) ~  |     |       |      |      |      |      |       |       |    |     |      |         |
|    | No conf    | lict m      | isse  | s: o        | nlv   | cor   | npu  | lso | or∖   | or   | · ca | ap   | acit | v m   | nisse | es |     |      |         |

RUN out of space

Either very expensive or slow because of all the associativity



#### Direct-Mapped Cache Behavior w/ 4-word block



### Direct-Mapped Cache Behavior w/ 4-word block



|      | li         | \$t0, | 5         |
|------|------------|-------|-----------|
| 11:  | beq        | \$t0, | \$0, done |
|      | lw         | \$t1, | 0x4(\$0)  |
|      | lw         | \$t2, | 0xC(\$0)  |
|      | lw         | \$t3, | 0x8(\$0)  |
|      | addiu      | \$t0, | \$t0, -1  |
|      | j          | 11    |           |
| done | <b>.</b> . |       |           |

| Assuming the cache starts empty, what's |  |  |  |  |  |  |  |  |  |
|-----------------------------------------|--|--|--|--|--|--|--|--|--|
| the miss rate?                          |  |  |  |  |  |  |  |  |  |
| 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8           |  |  |  |  |  |  |  |  |  |
| мннннннннннн                            |  |  |  |  |  |  |  |  |  |
| 1/15 = 0.0666 = 6.7%                    |  |  |  |  |  |  |  |  |  |

done:

Larger blocks reduce compulsory misses by exploting spatial locality

# The Desktop Machine Revisited ¿ache = blacks X ways X sets



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm On-chip caches:

| Cach | e Size | Sets | Ways   | Block   |
|------|--------|------|--------|---------|
| L1I* | 64 K   | 512  | 2-way  | 64-byte |
| L1D* | 64 K   | 512  | 2-way  | 64-byte |
| L2*  | 512 K  | 512  | 16-way | 64-byte |
| L3   | 2 MB   | 1024 | 32-way | 64-byte |

\*per core

|  | Chip        | Year | Freq.<br>(MHz) | L1<br>Data      | Instr               | L2                       |
|--|-------------|------|----------------|-----------------|---------------------|--------------------------|
|  | 80386       | 1985 | 16–25          | off-cł          | nip                 | none                     |
|  | 80486       | 1989 | 25–100         | 8K uni          | nified off-chi      |                          |
|  | Pentium     | 1993 | 60–300         | 8K              | 8K                  | off-chip                 |
|  | Pentium Pro | 1995 | 150–200        | 8K              | 8K                  | 256K–1M<br>(MCM)         |
|  | Pentium II  | 1997 | 233–450        | 16K             | 16K                 | 256K–512K<br>(Cartridge) |
|  | Pentium III | 1999 | 450–1400       | 16K             | 16K                 | 256K–512K                |
|  | Pentium 4   | 2001 | 1400–3730      |                 | 12k op<br>ace cache | 256K–2M                  |
|  | Pentium M   | 2003 | 900–2130       | 32K             | 32K                 | 1M–2M                    |
|  | Core 2 Duo  | 2005 | 1500–3000      | 32K<br>per core | 32K<br>per core     | 2M-6M                    |

# Intel On-Chip Caches