STEVEN M. NOWICK: Some Research Highlights and Links to Publications

1. CAD Tools and Algorithms for Asynchronous Controllers

One of my main research areas is developing CAD (computer-aided design) tools and algorithms for the synthesis
and optimization of asynchronous controllers. My students and I have developed a software package, called
MINIMALIST (click to download tool) (Fuhrer, Nowick, Theobald, et al.). For an introduction to the tool,
see our technical report ( minimalist.ps ,minimalist.pdf) (click to download report).

The MINIMALIST package is a collection of many of our recent synthesis, optimization and testability CAD algorithms
and tools for the automated design of asynchronous controllers, such as:

-state minimization
-optimal state assignment
-2-level hazard-free logic minimization (exact and heuristic)
-synthesis-for-testability
-basic technology mapping

We are planning to incorporate several other tools in the near future: multi-level optimization, performance-oriented
technology mapping and timing analysis. The package has a number of useful features, such as an interactive shell,
online help and a simple graphical user interface. It is also extensible, much like the synchronous Berkeley SIS package:
it is designed to easily plug in new tools. MINIMALIST has approximately 45,000 lines of C++ code. As of
November 1999, it has been downloaded to over 40 sites in 18 countries.

Here are some relevant papers and reports, which you can download:

An Introduction: Why Asynchronous Circuits?

C.H. van Berkel, M.B. Josephs and S.M. Nowick, "Scanning the Technology: Applications of Asynchronous Circuits",

pieee99.ps

pieee99.pdf

A. Davis and S.M. Nowick, "An Introduction to Asynchronous Circuit Design",

tr-intro-97.ps

tr-intro-97.pdf

(Note: Section 6 of this tech report ontains a 23-page history of asynchronous design.)

My Research: Asynchronous CAD Tools and Algorithms

Introduction to the MINIMALIST CAD Package:

minimalist-tr.ps

minimalist-tr.pdf

Burst-mode asynchronous controllers (my PhD thesis):

S.M. Nowick, "Automatic Synthesis of Burst-Mode Asynchronous Controllers", PhD Thesis (March 1993).

(Appears in revised form as Stanford University Computer Systems Laboratory technical report, CSL-TR-95-686,

December 1995).

(

thesis.ps

thesis.pdf

2-level hazard-free logic minimization: basic algorithm::

S.M. Nowick and D.L. Dill, "Exact Two-Level Minimization of Hazard-Free Logic with Multiple-Input Changes",

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14:8, pp. 986-997 (Aug. 1995).

(

tcad95.ps

tcad95.pdf

2-level hazard-free logic minimization: advanced algorithms (exact + heuristic):

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 17:11, pp. 1130-1147 (Nov. 1998).

(

tcad98.ps

tcad98.pdf

optimal state assignment:

IEEE/ACM International Conference on Computer-Aided Design (ICCAD),

iccad95.ps

iccad95.pdf

optimal state minimization:

IEEE/ACM International Conference on Computer-Aided Design (ICCAD),

iccad99.ps

iccad99.pdf

synthesis-for-testability:

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 16:12, pp. 1514-1521 (Dec. 1997).

(

tcad97.ps

tcad97.pdf

The synthesis-for-testability algorithm produces 100% testable hazard-free multi-level logic, assuming full scan, for both
stuck-at and robust path delay fault models.

The optimal state assignment algorithm produces a critcal race-free state assignment which guarantees optimal hazard-free logic.
The algorithm is an exact solution under an 'input encoding model' (successfully used by the synchronous "KISS" algorithm by
De Micheli, and also in "NOVA" by Villa et al.). As an important special case, our algorithm is truly exact when it is targeted to
output logic (i.e., not just with respect to an input encoding model).

In asynchronous machines, output logic is especially important: it usually determines the machine's latency (state changes are
often less critical, and can occur in background mode, since they do not need to fit into a fixed clock cycle). Our algorithm produces
a critical race-free state assignment which yields minimum-cost 2-level hazard-free output logic, both in terms of (a) # of products,
and (b) # of critical literals (i.e. input literals contributing to output functions), over the entire design space.

Most recently, our optimal state minimization algorithm, called "OPTIMISTA", extends the above results to an even larger global
minimization problem: it effectively merges 3 distinct synthesis steps -- state minimization, state assignment, logic minimization --,
and finds the best solution over the entire design space, targeted to optimum output logic. That is, it selects the state minimization
and critical race-free state assignment which yield the best possible output logic implementation, over the space of possible solutions.
This is the first general solution to this combined global optimization problem.

Several of our individual CAD tools have been transferred to Intel, IBM Research, and HP Labs, as well as to a number of university
sites. One of our individual tools for 2-level hazard-free logic minimization, called hfmin (Fuhrer, Nowick), is widely used. It is part
of the asynchronous CAD tool suite at Intel, where it was used in the design of an experimental high-performanceasynchronous
instruction-length decoder ("RAPPID" Project). At HP Labs, it was used in the design an experimental low-power asynchronous
infrared communications chip ("Stetson" Project). hfmin has also been incorporated into several university CAD packages: 3D (UC
San Diego), ACK and ATACS(University of Utah), and the UCLOCK and locally-clocked packages (Stanford and Columbia University).

2. High-Performance Asynchronous Datapaths

I am also interested in the design of high-speed asynchronous datapath components. These include some of the basic building blocks
for digital systems: adders, multipliers, pipelines, etc. We have three ongoing projects: (a) speculative completion, (b) high-speed pipelines,
and (c) low-latency FIFOs.

I introduced a new method, called speculative completion, for designing fast asynchronous datapath components (e.g. adders) which
have variable-speed operation. Typically, the components are faster on common-case inputs and slower for rarer inputs. (In contrast,
synchronous components usually operate at a fixed worst-case speed.) The method allows the use of synchronous-style function blocks,
along with a multi-slotted matched delay which can signal early completion, and some simple detection logic. We have simulated high-speed
Brent-Kung adders using this approach; for some applications, our asynchronous adders are 25% faster (on average) than a comparable
synchronous design: 1.65 nanoseconds (asynchronous) vs. 2.04 nanoseconds (synchronous) for 32-bit adders in 0.6 micron technology.
(For a complete description, see our Async-97 Symposium paper: async97.ps ,async97.pdf . Note that results in this paper are a bit
out-of-date: not quite as good as our new results reported above!)

My student, Montek Singh, and I are also working on new design styles for high-speed gate-level asynchronous pipelines, especially
targeted to dynamic logic. The idea is to obtain maximum performance by building ultra fine-grain pipelines: where each stage consists
of only a single gate. Such fine-grain pipelining becomes extremely awkward in synchronous designs, which at this granularity would
require multi-GigaHertz clocks (in modern technologies).Our designs use simple local timing optimizations to improve throughput.
The best of our designs, simulated in old technology (0.6 micron), have an operating speed of over 1.2 GigaOperations/sec. We
expect these results to scale to multi-GigaOperation/second in more modern technology (0.2 micron).

Finally, my student Tiberiu Chelcea and I are currently working on low-latency asynchronous FIFO's. The designs use token rings,
and recent simulations on a 16-place FIFO show latencies of only 2.0 nanoseconds and throughput of around 350 MegaOperations/sec.,
using old technology (0.6 micron).

3. Asynchronous Applications: "Compressed-Code" Embedded Processors

I am also interested in various applications of asynchronous design, especially to embedded systems. Recently, Martin Benes, Andy Wolfe
and I designed two high-speed asynchronous Huffman decoders for compressed-code embedded processors. Embedded processors are
widely used in many commercial applications, such as consumer electronics, automobiles, airplanes, modems, etc. A "compressed-code
embedded processor" is one in which the program is stored in compressed form in instruction memory, and is decompressed "on-the-fly"
during instruction cache refill. The goal is to reduce overall chip size, by compressing the instruction memory. We use Huffman encoding
to compress each instruction block in memory, and have designed asynchronous Huffman decoders to decode the instructions, on each cache
miss, as they are fetched from memory into cache.

Our best Huffman decoder has an output decode rate of 163 MBytes/second, and occupies less than 1 mm² of chip area, in old technology
(0.8 micron). This performance is as good, or better, than the best existing synchronous Huffman decoders reported in the literature (after
normalizing for technology and voltage), yet the designs have 5-10 times smaller area than most existing designs. Also, interestingly,
unlike many of the high-speed synchronous Huffman decoders, ours are non-pipelined. See our two papers for details:
(a) our initial design + chip: 1997 Conference on Advanced Research in VLSI (arvlsi97.ps ,arvlsi97.pdf ); (b) our improved
design (83% faster): 1998 Async Symposium (IEEE Symposium on Advanced Research in Asynchronous Circuits and Systems)
(async98.ps ,async98.pdf ).

Unlike synchronous designs, our asynchronous decoders have both variable-speed input and output operation: they adapt their performance
to the actual data (since they are not bound by a fixed clock rate). Our designs exploit the high variability of Huffman code lengths: they are
highly-optimized for common shorter code lengths, and hence have very good average-case performance.