Background **VLIW, 1982 Computer Science 703 Advance Computer Architecture** 2006 Semester 2 Sources for this lecture: Lecture Notes - Intel Itanium Architecture Software Developer's Manual Josh Fisher, Yale University, Multiflow - Mark Smotherman, "Understanding EPIC architectures and 25May06 • Want to group multiple instructions into a single implementations", from ACM Southeast Conference, 2002 "long instruction" that is executed on different **VLIW & EPIC Architectures** functional units in parallel. Itanium Architecture (IPF : Itanium Processor Effect is very similar to a very long pipeline: branches Family) James Goodman (and cache misses) are deadly HP: Explicitly Parallel Instruction Computing (EPIC) · Ignored cache misses by ignoring caches • Greatest innovation: *Trace scheduling*-capturing · History: came through HP: 2 separate histories larger blocks of parallelism by predicting most likely Bob Rau/Mike Schlanskar: Cydrome path through multiple basic blocks, then adding fixup Josh Fisher: Multiflow code where wrong branch was predicted. \_ Department **Computer Science** 5/26/2006 C\$703 3 5/26/2006 C\$703 Tasks for ILP Execution Four Classes of Architecture Three steps for capturing ILP · VLIW : Compiler determines which instructions are assigned to 1. Check dependencies between instructions to Each of these tasks can be performed at least which FU (a very long instruction word) determine which instructions can be grouped partially at compile time - Highly restricted; implementation is architecture (# of functional together for parallel execution 1. Compiler indicates which instructions can be units determines code!) 2. Assign instructions to the functional units on the • Dynamic VLIW : Compiler does grouping, FU assignment; executed concurrently (or hardware infers it from the hardware hardware determines execution time order). - Can respond to events that cannot be anticipated by compiler (like 3. Determine when instruction begins execution 2. Compiler designates a functional unit for each data caches) instruction (or the hardware dynamically assigns a · EPIC Compiler does grouping; FU assignment, initiation free one). determined by hardware - Functional units dynamically scheduled, so architecture not tied to 3. Compiler indicates exactly which instructions should implementation be initiated in each cycle (or hardware assures that

#### - Still major benefit of compiling to specific implementation.

C\$703

Superscalar processors : all three done in hardware

# Four Levels of Compiler Contribution

C\$703

|                 | Grouping | Fn unit asgn | Initiation |
|-----------------|----------|--------------|------------|
| Superscalar     | Hardware | Hardware     | Hardware   |
| EPIC            | Compiler | Hardware     | Hardware   |
| Dynamic<br>VLIW | Compiler | Compiler     | Hardware   |
| VLIW            | Compiler | Compiler     | Compiler   |

Table 1. Four Major Categories of ILP Architectures.

C\$703

Mark Smotherman, "Understanding EPIC architectures and implemntations," Southeast ACM Conference 2002

10

## Four Architectural Models

resources are/will be free and issues when ready).

C\$703

5/26/2006

5/26/2006



C\$703

Mark Smotherman. "Understanding EPIC architectures and implemntations," Southeast ACM Conference 2002

8

5/26/2006

5/26/2006

• Example sequence: C = A + B

Load R1, A Load R2, B Add R3, R1, R2 Store C, R3

 Instructions 1 & 2 can be executed concurrently; 3 depends on 1 & 2; 4 depends on 3

C\$703

12

5/26/2006

5/26/2006

| VLIW Bundles |  |
|--------------|--|
|              |  |

| Ld/St unit 0 | Ld/St unit 1 | integer ALU    | branch unit |
|--------------|--------------|----------------|-------------|
| Load R1, A   | Load R2, B   | nop            | nop         |
| nop          | nop          | nop            | nop         |
| nop          | nop          | Add R3, R1, R2 | nop         |
| Store C, R3  | nop          | nop            | nop         |

· Multiflow improved instruction size by compressing instructions to save space

CS703

3. Compiler control of the memory hierarchy

#### **EPIC Specification of Bundles**

# **IPF** Format



20

5/26/2006

# Compiler hints to memory hierarchy

CS703

· Compiler can predict temporal locality quite

CS703

19

5/26/2006

- well
- Provides hints:

1. Predicated execution

2. Unbundled branches

4. Control speculation

5. Data speculation

5/26/2006

5/26/2006

5/26/2006

- Indicate temporal locality at L1
- Indicate no temporal locality at L1
- No temporal locality at L2
- No temporal locality at all levels

# **Control speculation**

- · Hoist loads ahead of branches: if you didn't need it, not much lost
- Problem: what if load causes an exception?
- · Solution: explicitly speculative load
- Load causing exception returns tagged result (NaT: not a Thing or NaTVal: not a Thing Value for FP)
- Speculation check instruction raises exception if NaT still around

C\$703

# **Data speculation**

- · Hoist load instructions earlier
- · Problem: aliasing: compiler often can't disambiguate pointers: how to avoid passing a store?
- · Solution: Explicitly speculative load - Advanced Load Address Table (ALAT) has addresses
  - Followed by data-verifying load instruction
  - If store has occurred, data-verifying load re-executes load instruction

CS703

21

## **Two Variations of Check**

Case 1: Check occurs before loaded value is used (ld.c)

- Load is repeated and execution continues
- Case 2: Check occurs after loaded value has been used to generate other values (chk.a)
- If unsuccessful, chk.a branches to compiler-generated recovery code.

#### ALAT

#### 4.4.5.1 Data Speculation Concepts

4.4.5.1 Data Speculation Concepts An ambiguous memory dependency is said to exist between a store (or any operation that may update memory state) and a load when it cannot be statically determined whether the load and store might access overlapping regions of memory. For convenience, a store that cannot be statically disambiguated relative to a particular load is said to be ambiguous relative to that load. In such cases, the compiler cannot change the order in which the load and store instructions were originally specified in the program. To overcome this scheduling limitation, a special kind of load instruction called an advanced load can be scheduled to execute earlier than one or more stores that are ambiguous relative to that load.

| 5/26/2006 | C\$703 | 22 | 5/26/2006 | C5703 | 23 |
|-----------|--------|----|-----------|-------|----|