Processor parallelism considerations and memory latency reduction in shared memory multiprocessors
Lilja, David John
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/22386
Description
Title
Processor parallelism considerations and memory latency reduction in shared memory multiprocessors
Author(s)
Lilja, David John
Issue Date
1991
Doctoral Committee Chair(s)
Yew, Pen-Chung
Department of Study
Electrical and Computer Engineering
Discipline
Electrical and Computer Engineering
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Engineering, Electronics and Electrical
Computer Science
Language
eng
Abstract
A wide variety of computer architectures have been proposed to exploit parallelism at different granularities. These architectures have significant differences in instruction scheduling constraints, memory latencies, and synchronization overhead, making it difficult to determine which architecture can achieve the best performance on a given program. Trace-driven simulations and analytic models are used to compare the instruction-level parallelism of a superscalar processor and a pipelined processor with the loop-level parallelism of a shared memory multiprocessor. It is shown that the maximum speedup for a loop with a cyclic dependence graph is limited by its critical dependence ratio, independent of the number of iterations in the loop. The fine-grained processors are better suited for executing these loops with cyclic dependence graphs, while the multiprocessor has better performance on the very parallel loops with acyclic dependence graphs. When executing programs with a variety of loops and sequential code, the best performance is obtained using a multiprocessor architecture in which each individual processor has a fine-grained parallelism of two to four.
A major problem with this type of shared memory multiprocessor architecture is the long latency in fetching operands from the shared memory. Private data caches are an effective means of reducing this latency, but they introduce the complexity of a cache coherence mechanism. Both hardware and software schemes have been proposed for maintaining coherence in these systems. Unfortunately, hardware schemes have very high memory requirements, and software schemes rely on imprecise compile-time memory disambiguation. A new compiler-assisted directory coherence mechanism is proposed that combines the best aspects of the hardware and software approaches while eliminating many of their disadvantages. The pointer cache directory significantly reduces the size of a hardware directory by dynamically binding pointers to cache blocks only when the blocks are actually referenced. Compiler optimizations can further reduce the size of the directory by signaling the hardware to allocate pointers only when they are needed. Detailed trace-driven simulations show that the performance of this new approach is comparable to other coherence schemes, but with significantly lower memory requirements.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.