

| <ul> <li><b>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII</b></li></ul>                                                                                                                                                                                                                                                                                                                       | <b>Interaction of</b> <ul> <li>Virtually-addressed cache</li> <li>Multi-level cache</li> <li>Cache coherence</li> <li>Non-blocking cache</li> </ul>                                                                                                                                                                                                                                                                                                                             | <ul> <li><b>Scalable Memory Systems (2)</b></li> <li>Directory-based protocols vs. snooping <ul> <li>Snooping has serious limitations of scale</li> <li>Directory-based is always slower, but scalable</li> <li>Basic protocol is simpler (3 states), but requires more serial events</li> </ul> </li> <li>Maintaining a sharing list in the directory</li> <li>Distributed writes (why they are slow)</li> <li>Dealing with races</li> </ul> |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <page-header><page-header><page-header><page-header><page-header><page-header><page-header></page-header></page-header></page-header></page-header></page-header></page-header></page-header>                                                                                                                                                                                        | <ul> <li><b>Better programming Models</b></li> <li>Critical sections</li> <li>Atomic RMW operations         <ul> <li>T&amp;S, T&amp;T&amp;S, Atomic Swap</li> <li>Compare &amp; Swap</li> <li>LL/SC</li> </ul> </li> <li>Notion of transactional memory         <ul> <li>Atomic insertion of a transaction (linearizability)</li> <li>Hardware support (SLE)</li> <li>Implementing Transactional Memory             <ul> <li>Hardware software</li> </ul> </li> </ul></li></ul> | <ul> <li><b>Since the Test</b></li> <li>Evaluating Performance</li> <li>ISAs (Wulf) <ul> <li>Importance of Regularity, Orthogonality, &amp; Composability</li> <li>Primitives, not solutions</li> <li>Run-time vs. Compile-time trade-offs</li> </ul> </li> </ul>                                                                                                                                                                             |
| <ul> <li><b>Instruction-Level Parallelism</b></li> <li>How to capture ILP         <ul> <li>Discover inter-instruction dependences</li> <li>Assign insructions to functional units</li> <li>Determine when instructions execute</li> </ul> </li> <li>Branching         <ul> <li>Cost of branching</li> <li>Prediction: costs and problems</li> <li>Speculation</li> </ul> </li> </ul> | <ul> <li><b>ILP: OoO Execution</b></li> <li>Hazards: RAW, WAR, WAW</li> <li>Busy Bits for synchronization</li> <li>Tomasulo Algorithm</li> <li>Register Renaming</li> <li>The "Imprecise interrupt"</li> </ul>                                                                                                                                                                                                                                                                  | <ul> <li><b>EPIC Architectures and Itanium</b></li> <li>How much to do at compile time?</li> <li>Four architectural models <ul> <li>VLIW</li> <li>Dynamic VLIW</li> <li>Epic</li> <li>Superscalar</li> </ul> </li> <li><b>Predicated execution</b></li> <li>Compiler hints for memory hierarchy</li> <li>Control speculation <ul> <li>Dealing with exceptions</li> <li>Data speculation: the ALAT</li> </ul> </li> </ul>                      |

## Cray Architectures

- Multiple register sets
- Vector instructions
  - vector registers
  - loads/stores with a stride
  - chaining

PRESENTATION 2006

Pl

- very high memory bandwidth
- Memory hierarchy (no cache)