1. CAD Tools and Algorithms for Asynchronous Controllers
One of my main research areas is developing CAD
(computer-aided design) tools and algorithms for the synthesis
and optimization of asynchronous controllers.
My students and I have developed a software package, called
MINIMALIST
(click to download tool) (Fuhrer, Nowick,
Theobald, et al.). For an introduction to the tool,
see our technical
report ( minimalist.ps
,minimalist.pdf)
(click
to download report).
The MINIMALIST package is a collection
of many of our recent synthesis, optimization and testability CAD algorithms
and tools for the automated design
of asynchronous controllers, such as:
-state minimizationWe are planning to incorporate several other tools in the near future: multi-level optimization, performance-oriented
-optimal state assignment
-2-level hazard-free logic minimization (exact and heuristic)
-synthesis-for-testability
-basic technology mapping
Here are some relevant papers and reports, which you can download:
An Introduction: Why Asynchronous Circuits?
The synthesis-for-testability
algorithm produces 100% testable hazard-free multi-level logic,
assuming full scan, for both
stuck-at and
robust path
delay fault models.
The optimal state assignment
algorithm
produces a critcal race-free state assignment which guarantees optimal
hazard-free logic.
The algorithm is an exact solution
under an 'input encoding model' (successfully used by the synchronous "KISS"
algorithm by
De Micheli, and also in "NOVA" by
Villa et al.). As an important special case, our algorithm
is truly exact when it is targeted to
output logic (i.e.,
not just with respect to an input encoding model).
In asynchronous machines, output
logic is especially important: it usually determines the machine's
latency (state changes are
often less critical, and can occur
in background mode, since they do not need to fit into a fixed clock cycle).
Our algorithm produces
a critical race-free state assignment
which yields minimum-cost 2-level hazard-free output logic, both in terms
of (a) # of products,
and (b) # of critical literals (i.e.
input literals contributing to output functions), over the entire design
space.
Most recently, our optimal
state minimization algorithm, called "OPTIMISTA", extends the above
results to an even larger global
minimization problem: it
effectively merges 3 distinct synthesis steps -- state minimization, state
assignment, logic minimization --,
and finds the best solution over
the entire design space, targeted to optimum output logic. That is,
it selects the state minimization
and critical race-free state assignment
which yield the best possible output logic implementation, over the
space of possible solutions.
This is the first general solution
to this combined global optimization problem.
Several of our individual CAD tools
have been transferred to Intel, IBM Research, and HP Labs,
as well as to a number of university
sites. One of our individual
tools for 2-level hazard-free logic minimization, called hfmin
(Fuhrer,
Nowick), is widely used. It is part
of the asynchronous CAD tool suite
at
Intel, where it was used in the design of an experimental
high-performanceasynchronous
instruction-length decoder
("RAPPID" Project). At HP Labs,
it was used in the design an experimental low-power asynchronous
infrared communications chip
("Stetson" Project). hfmin
has also
been incorporated into several university CAD packages:
3D
(UC
San Diego),
ACK and ATACS(University
of Utah), and the UCLOCK
and locally-clocked packages (Stanford
and Columbia University).
2. High-Performance Asynchronous Datapaths
I am also interested in the design
of high-speed asynchronous datapath components. These include some
of the basic building blocks
for digital systems: adders,
multipliers, pipelines, etc. We have three ongoing projects:
(a) speculative completion, (b) high-speed pipelines,
and (c) low-latency FIFOs.
I introduced
a
new method, called
speculative completion, for designing
fast asynchronous datapath components (e.g. adders) which
have variable-speed operation.
Typically, the components are faster on common-case inputs and slower for
rarer inputs. (In contrast,
synchronous components usually operate
at a fixed worst-case speed.) The method allows the use of synchronous-style
function blocks,
along with a multi-slotted matched
delay which can signal early completion, and some simple detection
logic. We have simulated high-speed
Brent-Kung adders using this approach;
for some applications, our asynchronous adders are 25% faster
(on
average) than a comparable
synchronous design:
1.65 nanoseconds (asynchronous) vs. 2.04 nanoseconds (synchronous) for
32-bit adders in 0.6 micron technology.
(For a complete description, see
our Async-97 Symposium paper: async97.ps
,async97.pdf
. Note that results in this paper are a bit
out-of-date: not quite as
good as our new results reported above!)
My student, Montek Singh, and I are also working on new design
styles for
high-speed gate-level asynchronous pipelines, especially
targeted to dynamic logic. The idea is to obtain maximum performance
by building ultra fine-grain pipelines: where each stage consists
of only a single gate. Such fine-grain pipelining becomes extremely
awkward in synchronous designs, which at this granularity would
require multi-GigaHertz clocks (in modern technologies).Our designs
use simple local timing optimizations to improve throughput.
The best of our designs, simulated in old technology (0.6 micron),
have an operating speed of over 1.2 GigaOperations/sec. We
expect these results to scale to multi-GigaOperation/second in more
modern technology (0.2 micron).
Finally, my student Tiberiu Chelcea and I are currently working on low-latency
asynchronous FIFO's. The designs use token rings,
and recent simulations on a 16-place FIFO show latencies of only 2.0
nanoseconds and throughput of around 350 MegaOperations/sec.,
using old technology (0.6 micron).
3. Asynchronous Applications: "Compressed-Code" Embedded Processors
I am also interested in various applications of asynchronous design,
especially to
embedded systems. Recently,
Martin Benes, Andy Wolfe
and I designed two
high-speed asynchronous Huffman decoders
for compressed-code embedded processors. Embedded processors
are
widely used in many commercial applications, such as consumer electronics,
automobiles, airplanes, modems, etc. A "compressed-code
embedded processor" is one in which the program is stored in
compressed form in instruction memory, and is decompressed "on-the-fly"
during instruction cache refill. The goal is to reduce overall chip
size, by compressing the instruction memory. We use Huffman encoding
to compress each instruction block in memory, and have designed asynchronous
Huffman decoders to decode the instructions, on each cache
miss, as they are fetched from memory into cache.
Our best Huffman decoder has an output decode rate of 163 MBytes/second,
and occupies less than 1 mm2 of chip area, in old technology
(0.8 micron). This performance is as good, or better, than the
best existing synchronous Huffman decoders reported in the literature (after
normalizing for technology and voltage), yet the designs have 5-10
times smaller area than most existing designs. Also, interestingly,
unlike many of the high-speed synchronous Huffman decoders, ours are
non-pipelined. See our two papers for details:
(a) our initial design + chip:
1997 Conference on Advanced Research
in VLSI (arvlsi97.ps
,arvlsi97.pdf
);
(b)
our improved
design (83% faster): 1998 Async Symposium (IEEE
Symposium on Advanced Research in Asynchronous Circuits and Systems)
(async98.ps
,async98.pdf
).
Unlike synchronous designs, our asynchronous decoders have both
variable-speed
input and output operation: they adapt their performance
to the actual data (since they are not bound by a fixed clock rate).
Our designs exploit the high variability of Huffman code lengths: they
are
highly-optimized for common shorter code lengths, and hence have very
good average-case performance.