EETIMES
September 01, 1997,
Issue: 969
Section: News

Pursuit of superscalar put on backburner -- Hot Chips trailblazes a path to parallelism
Ron Wilson

Palo Alto, Calif. - Much of the ninth annual Hot Chips Symposium here last week was given over to the search for parallelism. From intriguing speculations about the nature of Intel Corp.'s coming Merced CPU to details of a massive multiprocessing system, designers sought every possible opportunity to execute operations simultaneously, whether at the task, thread, instruction or even sub-operation level. The results of their inquiries will begin appearing in a few years in new CPU architectures and a whole new approach to multiprocessing.

The pursuit of parallelism appears to have replaced the quest for wider superscalar chips and higher clock speeds as the road to application performance. Papers presented here suggest that as the quest unfolds, CPUs will increasingly resemble highly parallel microcode engines. At the same time, hardware support for instantaneous switching between threads will appear in even relatively conventional CPUs, and the network interface will disappear from the I/O system and reappear in the system core logic, right next to the memory controller.

Traditionally, CPU architects have sought to exploit instruction-level parallelism. In superscalar designs, they have attempted to find neighboring instructions that could execute simultaneously, and to dispatch these in a single clock cycle to separate execution pipelines.

But some architects have argued that this approach is too fine-grained. There can be more opportunities to dispatch operations in parallel, these architects claim, if one looks at the sub-operations-the adds, shifts, loads and so on-that make up a basic machine instruction. This has proved particularly true in executing the complex X86 instruction set.

The speculation appeared in an intriguing form in a talk by Bruce Lightner, vice president of development at Metaflow Technologies Inc. (La Jolla, Calif.). In an evening panel session titled "If I Were Defining Merced . . ." Lightner suggested a return to the days of microcoded machines, in which hardware decomposed machine instructions into sub-operations and then executed these.

Because a single set of sub-operations could be used to emulate virtually any modern instruction set-including the X86 and PA-RISC instructions that Merced must execute-Lightner suggested that Merced could make use of such an approach. By increasing the number of sub-operation execution units, a designer could theoretically increase the number of machine instructions decomposed and dispatched in each clock cycle.

Superscalar questioned

This approach may prove more rewarding than superscalar operation, which came under significant criticism at the symposium. In a paper on performance analysis, a team for Digital Equipment Corp.'s systems research center warned that, despite all the design sophistication that has gone into superscalar CPUs, much of the hardware is wasted.

The paper showed that most of the time only half the execution units in a superscalar CPU are busy, and that in some applications 15 of every 16 opportunities to issue an instruction are lost. The result is not only idle hardware, but execution speeds far below what the user expected.

Such figures have led many architects to look beyond instruction-level parallelism to the thread level. If a program is organized into multiple independent threads-sequences of instructions that do not exchange data with one another during execution-then a CPU could theoretically keep several threads loaded into its decode buffer. When one thread stalled, waiting for memory or a resource contention, the CPU could switch without hesitation to a different copy of the register file and dispatch instructions from a different thread. Thus, many of the lost issue opportunities would be recovered.

Lightner suggested such a model for Merced. But the idea has already been put to practical use in a research chip designed jointly by the Concurrent VLSI Architecture Group at the Massachusetts Institute of Technology and Cadence Design Systems Inc. (San Jose, Calif.)

The MIT Multi-ALU Processor (MAP) chip is designed to exploit three levels of concurrency at once: instruction, thread and task. Each chip includes six integer-execution units and one floating-point unit, four register files, cache structures and network port. The chips can be wired together into a network of processors able to share register data.

The MAP arrangement allows an optimizing compiler to schedule instructions across different clusters of processors to extract all the available instruction-level parallelism. But it also lets each processor cluster work on multiple threads, so if a cluster finds itself waiting for an instruction that has been held up elsewhere, it can move on to a different thread.

As the number of processors grows, the MAP architecture also lets different tasks share the network, exploiting yet another level of parallelism. But this raises issues of memory protection, which MAP solves with a system of guarded pointers. In effect, each pointer to memory contains permission informa-tion, which determines whether the task can access the data.

The problems of communication among tasks came up at Hot Chips, just as they did at the Hot Interconnects conference the previous week. Fundamentally, the ability to make a machine run faster by adding processors depends on the efficiency of intertask communication. If there are few intertask messages, or if the messages can be handled efficiently, it is possible to break a program into smaller tasks and dispatch them to more CPUs. If communication is onerous, the communications overhead will erase any gains from adding CPUs.

The MAP architecture attempts to limit communications overhead by resolving memory-protection problems with its guarded pointers. Other architectures discussed in the symposia attempted to chip away at other bottlenecks, such as address translation and memory coherency.

One approach is to absorb these functions into the interconnect itself. This is done in the IEEE Scalable Coherent Interconnect (SCI) spec. In another presentation at the Merced panel, Pete Wilson, architecture specialist at Motorola Inc., pointed out that Sequent Computer Systems Inc. (Beaverton, Ore.) has already used SCI to connect Intel Quad boards containing Pentium Pro CPUs. Given the historical trend for Intel designers to absorb each spin of Sequent's interconnect circuitry into Intel's next-generation CPU, Wilson suggested that Intel might modify SCI-as it has modified GTL and Rambus DRAM-to create a new, semi-proprietary interconnect network for Merced. This would give networks of Merced CPUs the ability to work with a flat, protected address space while shoveling address-translation and coherency problems onto the network fabric itself.

Similar thinking is emerging from designers experimenting with clusters of workstations. At Hot Interconnects, several teams suggested that workstation clusters could approach the parallel tasking efficiency of CPU clusters by more tightly integrating networking hardware to the CPU.

A team from the University of California at Santa Barbara examined SCI as a means of implementing message-passing and a flat address space in a cluster of Ultrasparc workstations. They concluded that the network offered very low latencies for even single-word loads and stores and would, with hardware evolution, offer very high bandwidth.

Another paper, from Shubhendu Mukherjee and Mark Hill of the University of Wisconsin at Madison, argued strongly for including network-adapter hardware in the inner circle of CPU connections, on a level with the main memory interface. The pair pointed out that the increasing distribution of tasks among workstations was severely straining network latency. Much of this latency was due to the fact that issues such as virtual-memory address translation, protection and coherency-all dealt with by hardware in a memory reference-were handled by the operating system in a network reference.

Latencies cut

Thus, by moving the network interface to the VM hardware on the CPU, the authors said, these functions could move out of the operating system, drastically reducing latencies. A paper from Matt Welsh, Anindya Basu and Thorsten von Eicken from Cornell University supported essentially the same conclusion, as

did a paper from Princeton University.

In another paper, Dave Dunning and Greg Regnier of Intel's Server Architecture Lab reported work on a standard user-task interface to such a networking system. In the Virtual Interface Architecture currently in definition at Intel, user tasks would be able to make direct, simple accesses to other tasks on the other side of the network, through a virtual address space. The Intel work is specifying the software interface, but is not directed at a particular hardware implementation, according to the authors.