\documentclass[twocolumn]{article}
\usepackage{times}
\usepackage{alltt}

\title{Porting a Network Cryptographic Service to the RMC2000: A Case
Study in Embedded Software Development}

\makeatletter
\def\section{%
  \@startsection{section}{1}{\z@}{-0.47\baselineskip}%
  {0.47\baselineskip}{\bfseries}}
\def\subsection{%
  \@startsection{subsection}{2}{\z@}{-0.5\baselineskip plus .5ex minus .2ex}%
  {0.5\baselineskip plus .2ex}{\itshape}}
\makeatother

\author{
\begin{tabular}{ccc}
Stephen Jan & Paolo de Dios & Stephen A. Edwards \\
\multicolumn{3}{c}{Dept. of Computer Science} \\
\multicolumn{3}{c}{Columbia University, New York, NY} \\
\multicolumn{3}{c}{\{sj178, pd119\}@columbia.edu\ \ sedwards@cs.columbia.edu} \\
\end{tabular}}

% For blind review
\author{}

\date{}

\textwidth 7in
\oddsidemargin -0.25in
\evensidemargin -0.25in
\topmargin -1in
\textheight 9.5in
\parindent 1em

\frenchspacing

\begin{document}

\maketitle

\begin{abstract}
This paper describes our experience porting a transport-layer
cryptography service to an embedded microcontroller.  We describe some
key development issues and techniques involved in porting networked
software to a connected, limited resource device such as the Rabbit
RMC2000 we chose for this case study. We examine the effectiveness of
a few proposed porting strategies by examining important program and
run-time characteristics.
\end{abstract}

\section{Introduction}

Embedded systems present a different software engineering
problem. These systems are unique in that the hardware and the
software are tightly integrated. The limited nature of an embedded
systems operating environment require a different approach to
developing and porting software. In this paper, we discuss the key
issues in developing and porting a Unix system level transport-level
security (TLS) service to an embedded microcontroller.  We discuss our
design decisions and experience porting this service using Dynamic~C,
a C variant, on the RMC2000 microcontroller from Rabbit
Semiconductor\footnote{www.rabbitsemiconductor.com}.  The main
challenges came when APIs for operating-system services such as
networking were either substantially different or simply absent.

Porting software across platforms is such a common and varied software
engineering exercise that much commercial and academic research has
been dedicated to identifying pitfalls, techniques, and component
analogues for it.  Porting software has been addressed by high level
languages~\cite{brown1972levels,kernighan1988c}, modular
programming~\cite{hanson1997c}, and component based abstraction,
analysis and design techniques~\cite{vinoski1997corba}. Despite the
popularity of these techniques, they are of limited use when dealing
with the limited and rather raw resources of a typical embedded
system.  In fact, these abstraction mechanisms tend to consume
more resources, especially memory, making them impractical for
microcontrollers. Though some have tried to migrate some of these
abstractions to the world of embedded
systems~\cite{gokhale1999techniques}, porting applications in a
resource-constrained system still requires much reengineering.

This paper presents our experiences porting a small networking
service to an embedded microcontroller with an eye toward illustrating
what the main problems actually are.  Section~\ref{sec:crypto}
introduces the network cryptographic service we ported.
Section~\ref{sec:related} describes some relevant related work, and
Section~\ref{sec:rmc2k} describes the target of our porting efforts,
the RMC~2000 development board.

Section~\ref{sec:port} describes the issues we encountered while
porting the cryptographic network service to the development board,
Section~\ref{sec:experiments} describes the performance experiments we
conducted, and finally Section~\ref{sec:conclusions} summarizes our findings.

\section{Network Cryptographic Services}
\label{sec:crypto}

For our case study, we ported
``iSSL,''\footnote{http://sourceforge.net/projects/issl} a
public-domain implementation of the Secure Sockets Layer (SSL)
protocol~\cite{freier1996ssl}, a Transport Layer Security (TLS)
standard proposed by the IETF~\cite{dierks1997tls}.  SSL is a protocol
that layers on top of TCP/IP to provide secure communications, e.g.,
to encrypt web pages with sensitive information.

Security, unfortunately, is not cheap.  Establishing and maintaining a
secure connection is a computationally-intensive task; negotiating
a SSL session can signifigantly degrade server performance,
potentially limiting the number of simultaneous sessions.  Goldberg et
al.~\cite{goldberg1998secure} observed it reducing throughput by an
order of magnitude.

iSSL is a cryptographic library that layers on top of the Unix sockets
layer to provide secure point-to-point communications.  After a normal
unencrypted socket is created, the iSSL API allows a user to bind to
the socket and then do secure read/writes on it.

To gain experience using the library, we first implemented a simple
Unix service that used the iSSL library to establish a secure
redirector.  Later, we ported this service to the RMC2000.

Because SSL forms a transparent layer on top of TCP, it can be easily
factored out of the server and placed in separate hardware.  Many
commercial systems use coprocessor cards that perform SSL functions to
offload work from the server's main processor.  Our case study
implements a similar service.

The iSSL package uses the RSA and AES cipher algorithms and can
generate session keys and exchange public keys.  Because the RSA
algorithm uses a difficult-to-port ``bignum'' package, we only ported
the AES cipher, which uses the Rijndael
algorithm~\cite{daemen1998block}.  By default, iSSL supports
key lengths of 128, 192, or 256 bits and block lengths of 128, 192,
and 256 bits, but to keep our implementation simple, we only implemented
128-bit keys and blocks.  During porting, we also referred to the AESCrypt
implementation developed by Eric Green and Randy
Kaelber\footnote{http://aescrypt.sourceforge.net}.

\section{Related Work}
\label{sec:related}

Cryptographic services for transport layer security (TLS) have long
been available as operating system and application server
services~\cite{engelschall2000mod}. The concept of an embedded TLS
service or custom ASIC for stream ciphering are commercially available
as SSL/TLS accelerator products from vendors such as Sun Microsystems
and Cisco. They operate as black boxes and the development issues to
make these services available to embedded devices have been rarely
discussed. Though the performance of various cryptographic algorithms
such as AES and DES have been examined on many
systems~\cite{schneier1999performance}, including embedded
devices~\cite{yang2001performance}, a discussion on the challenges of
porting complete services to a device have not received such a
treatment.

The scope of embedded systems development has been covered in a number
of books and
articles~\cite{gassle1992art,gassle1997dumb}. Optimization techniques
at the hardware design level and at the pre-processor and compiler
level are well-researched and benchmarked
topics~\cite{gassle1992art,leupers2000code,zivojnovic1995dspstone}. Embedded
programming guidelines for optimization, style and robustness are
outlined for specific languages such as
ANSI~C~\cite{barr1999programming}. Design patterns have even been
proposed to increase portability and leverage reuse among device
configurations for embedded software~\cite{champlain1996patterns}.

Overall, we found the issues involved in porting software to the
embedded world have not been written about extensively, and are
largely considered ``just engineering'' doomed to be periodically
reinvented.  Our hope is that this paper will help engineers be more
prepared in the future.

\section{The RMC2000 Environment}
\label{sec:rmc2k}

Typical for a small embedded system, the RMC2000 TCP/IP Development
Kit includes 512k of flash RAM, 128k SRAM, and runs a 30~MHz, 8-bit
Z80-based microcontroller (a Rabbit 2000).  While the Rabbit~2000,
like the Z80's, manipulates 16-bit addresses, it can access up to 1~MB
through bank switching.

It includes a 10Base-T network interface and comes with software
implementing TCP/IP, UDP and ICMP.  The development environment
includes compilers and diagnostic tools, and the board has a 10-pin
programming port to interface with the development environment.

\subsection{Dynamic C}

The Dynamic~C language, developed in concert with Rabbit
microcontrollers, is an ANSI~C variant with many extensions to support
the Rabbit~2000 in embedded system applications.  For example, the
language directly supports cooperative and preemptive multitasking,
battery-backed variables, and atomicity guarantees for shared
multibyte variables.

Unlike ANSI, Dynamic~C's function-local variables are \texttt{static}
by default.  While this can be overriden with a directive, ignoring it
can dramatically change program behavior.

Dynamic~C does not support the \texttt{\#include} directive, using
instead \texttt{\#use}, which gathers precompiled function prototypes
from libraries.  It took some effort to decide which \texttt{\#use}
directives should replace the many \texttt{\#include} directives in
the source files.

Dynamic~C omits and modifies some ANSI~C behavior. Bit fields and
enumerated types are not supported.  There are also minor differences
in the \texttt{extern} and \texttt{register} keywords. As mentioned
earlier, the default storage class for variables is \texttt{static},
not \texttt{auto}, which can dramatically change the behavior of
recursively-called functions. Variables initialized in a declaration
are stored in flash memory and cannot be changed.

Dynamic~C's support for inline assembly is more comprehensive than
most~C implementations, and it can also integrate~C into assembly
code, as in the following:

\begin{alltt}\small
#asm nodebug
InitValues::
   ld hl,0xa0;
c  start_time = 0;   \texttt{// Inline C}
c  counter = 256;    \texttt{// Inline C}
   ret
#endasm
\end{alltt}

We used the inline assembly feature in the error handling routines
that caught exceptions thrown by the hardware or libraries, such as
divide-by-zero.  We could not rely on an operating system to handle
these errors, so instead we specified an error handler using the
\texttt{defineErrorHandler(void *errfcn)} system call.  Whenever the
system encounters an error, the hardware passes information about the
source and type of error on the stack and calls this user-defined
error handler.  In our implementation, we used (simple) inline
assembly statements to retrieve this information.  Because our
application was not designed for high reliability, we simply ignored
most errors.


\subsection{Multitasking in Dynamic C}

Dynamic~C provides both cooperative multitasking, through costatements
and cofunctions, and preemptive multitasking through either the
\texttt{slice} statement or a port of Labrosse's $\mu$C/OS-II
real-time operating system~\cite{labrosse1998microc-os-ii}.

Dynamic~C's costatements provide multiple threads of control
through independent program counters that may be switched among
explicitly, such as in this example:

\begin{alltt}\small
for (;;) \{
  costate \{
    waitfor( tcp_packet_port_21() );
    \textrm{// handle FTP connection}
    yield(); \textrm{// Force context switch}
  \}
  costate \{
    waitfor( tcp_packet_port_23() );
    \textrm{// handle telnet connection}
  \}
\}
\end{alltt}

The \texttt{yield} statement immediately passes control to another
costatement.  When control returns to the costatement that has
yielded, it resumes at the statement following the \texttt{yield}.
The statement \texttt{waitfor(\emph{expr})}, which provides a
convenient mechanism for waiting for a condition to hold, is
equivalent to \texttt{while (!\emph{expr}) yield();}.

Cofunctions are similar, but also take arguments and may return a
result.

In our port, we used costatements to handle multiple connections with
multiple processes.  We did not use $\mu$C/OS-II.

\subsection{Storage Class Specifiers}

To avoid certain race conditions, Dynamic~C generates code that
disables interrupts while multibyte variables marked \texttt{shared}
are being changed, guaranteeing atomic updates.

For variables marked \texttt{protected}, Dynamic~C generates extra
code that copies their value to battery-backed RAM before every
modification.  Backup values are copied to main memory when when
system is restarted or when \texttt{\_sysIsSoftReset()} is called.  We
did not need this feature in this port.

The Rabbit 2000 microcontroller has a 64K address space but uses
bank-switching to access 1M of total memory.  The lower 49.5K is
fixed, ``root'' memory, and the top 8K is bank-switched access to the
remaining memory.  A user can explicitly request a function to be
located in either root or extended memory using the storage class
specifiers \texttt{root} and \texttt{xmem} (Figure~\ref{fig:specifiers}).

We had to explicitly request certain functions, such the error
handler, to be located in root memory, but we let the compiler place all
the other functions.

\begin{figure}
\begin{alltt}\small
\textrm{// Interrupts disabled during changes to a, b, and c}
\textrm{// Updates guaranteed atomic} 
shared float a, b, c;

main() \{
   protected int state1; \textrm{// Battery-backed}
   ...
   \textrm{// restore protected variables}
   _sysIfSoftReset()
\}

\textrm{// place func1 in root memory}
root int func1() \{ ... \}

\textrm{// Place following assembly code in root memory}
#memmap root
#asm root
 ...
#endasm

\textrm{// place func2 in extended memory}
xmem int func2() \{ ... \}
\end{alltt}
\caption{Fragment illustrating various Dynamic-C-specific storage
  class specifiers.}
\label{fig:specifiers}
\end{figure}

\subsection{Function Chaining}

Dynamic C provides function chaining, which allows segments of code to
be embedded within one or more functions.  Invoking a named function
chain causes all the segments belonging to that chain to execute.
Such chains enable initialization, data recovery, or other kinds of
tasks on request.  Our port did not use this feature.

\begin{alltt}\small
\textrm{// Create a chain named ``recover'' and add three functions}
#makechain recover
#funcchain recover free_memory
#funcchain recover declare_memory
#funcchain recover initialize

\textrm{// Invoke all three functions in the chain in some sequence}
recover();
\end{alltt}

% Nice tutorial at http://www.ecst.csuchico.edu/~beej/guide/net/html/

\begin{figure*}
\begin{tabular}{@{}c@{}c@{}}
\begin{minipage}[t]{0.5\textwidth}
\scriptsize
\begin{verbatim}
int echo_server() {
  int sock, newsock, len;
  struct sockaddr_in addr;
  char buf[LEN];

  if ((sock = socket(AF_INET, SOCK_STREAM, 0)) < 0)
    return -1;

  memset(&addr, 0, sizeof(addr));
  addr.sin_family      = AF_INET;
  addr.sin_addr.s_addr = htonl(INADDR_ANY);
  addr.sin_port        = htons(MYPORT);
  if ( bind(sock, (struct sockaddr *) &addr,
            sizeof(struct sockaddr_in)) < 0 ) return -1;
  if ( listen(sock, LISTENQ) < 0 ) return -1;
  for (;;) {
    if ((newsock = accept(sock, NULL, NULL) ) < 0 )
      return -1;
    if ((len = recv(newsock, buf, LEN, 0)) < 0)
      return -1;
    if (send(newsock, buf, len, 0) < 0) return -1;
    close(conn_s);
  }
}
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{0.5\textwidth}
\scriptsize
\begin{verbatim}
int echo_server()
{
  tcp_Socket sock;
  int status;
  char buf[LEN];

  sock_init();
  for (;;) {
    tcp_listen(&sock, PORT, 0, 0, NULL, 0);
    sock_wait_established(&sock, 0, NULL, &status);
    sock_mode(&sock, TCP_MODE_ASCII);                
    while (tcp_tick(&sock)) {
      sock_wait_input(&sock, 0, NULL, &status);
      if (sock_gets(&sock, buf, LEN))
        sock_puts(&sock, buf);
    }
  }
}
\end{verbatim}
\end{minipage}
\\
(a) & (b)\\
\end{tabular}
\caption{A comparison of (a) traditional BSD sockets-based code and
  (b) equivalent code in the Dynamic C environment illustrating the
  significant differences in API.}
\label{fig:sockets}
\end{figure*}

\section{Porting and Development Issues}
\label{sec:port}

A program rarely runs unchanged on a dramatically different platform;
something always has to change.  The fundamental question is, then,
how much must be changed or rewritten, and how difficult these
rewrites will be.

We encountered three broad classes of porting problems that demanded
code rewrites.  The first, and most common, was the absence of certain
libraries and operating system facilities.  This ranged from fairly
simple (e.g., Dynamic~C does not provide the standard \texttt{random}
function), to fairly difficult (e.g., the protocols include timeouts,
but Dynamic~C does not have a timer), to virtually impossible (e.g.,
the iSSL library makes some use of a filesystem, something not
provided by the RMC2000 environment).  Our solutions to these ranged
from creating a new implementation of the library function (e.g.,
writing a \texttt{random} function) to working around the problem
(e.g., changing the program logic so it no longer read a hash value
from a file) to abandoning functionality altogether (e.g., our final
port did not implement the RSA cipher because it relied on a fairly
complex ``bignum'' library that we considered to complicated to
rework).

A second class of problem stemmed from differing APIs with similar
functionality.  For example, the protocol for accessing the RMC2000's
TCP/IP stack differs quite a bit from the BSD sockets used within
iSSL.  Figure~\ref{fig:sockets} illustrates some of these differences.
While solving such problems is generally much easier than, say,
porting a whole library, reworking the code is tedious.

A third class of problem required the most thought.  Often,
fundamental assumptions made in code designed to run on workstations
or servers, such as the existence of a filesystem with nearly
unlimited capacity (e.g., for keeping a log), are impractical in an
embedded systems.  Logging and somewhat sloppy memory management that
assumes the program will be restarted occasionally to cure memory
leaks are examples of this.  The solutions to such problems are either
to remove the offending functionality at the expense of features
(e.g., remove logging altogether), or a serious reworking of the code
(e.g., to make logging write to a circular buffer rather than a file).

\subsection{Interrupts}

We used the serial port on the RMC2000 board for debugging.  We
configured the serial interface to interrupt the processor when a
character arrived and then either prompt the system to send a status
message back out the port, reset the application, or reset the
application maintaining program state.

A Unix operating environment provides a high-level mechanism for
handling software interrupts:

\begingroup\small
\begin{alltt}
main() \{ 
  signal(SIGINT, sigproc); \textrm{// Register signal handler}
\}
void sigproc() \{ \textrm{/* Handle the signal */} \}
\end{alltt}
\endgroup

By contrast in Dynamic~C, we had to handle all the details ourselves.
For example, to set up the interrupt from the serial port, we had to
enable the serial port to generate interrupts, register the interrupt
routine in the vector table, and finally enable the interrupt
receiver.

\begin{alltt}\small
main() \{
  \textrm{// Set serial port A as input interrupt}
  \textrm{// SADRShadow is pointer to value at register}
  WrPortI(SADR, &SADRShadow, 0x00);
  \textrm{// Register interrupt service routine}
  SetVectExtern2000(1, my_isr);
  \textrm{// Enable external INT0 on SA4, rising edge}
  WrPortI(I0CR, NULL, 0x2B);
  \textrm{/* Main program */}
  \textrm{// Disable interrupt 0}
  WrPortI(I0CR, NULL, 0x00);
\}

nodebug root interrupt void my_isr() \{
 \textrm{/* handle interrupt */}
\}
\end{alltt}

We could have avoided the use of interrupts if we designed the
application to open another ``debugging'' network connection, but this
would have made it impossible to debug a system that had a
networking-related bug.

\subsection{Memory}

A significant difference between general platform development and
embedded system development is memory. Most embedded devices have very
little memory in comparison to a full fledged computer. Expecting to
run into memory issues, we used a well-defined
taxonomy~\cite{zurell2000c} to plan out memory requirements.  This
turned out to be unnecessary, however, because out application had
very modest memory requirements.

Dynamic~C does not support the standard library functions
\texttt{malloc} and \texttt{free}.  Instead, it provides the
\texttt{xalloc} function that allocates extended memory only
(arithmetic, therefore, cannot be performed on the returned pointer).
More seriously, there is no analogue to \texttt{free}; allocated
memory cannot be returned to a pool.

Instead of implementing our own memory management system (which would
have been awkward given the Rabbit's bank-switched memory map), we
chose to remove all references to \texttt{malloc} and statically
allocate all variables.  This prompted us to drop support of the
variable key and block sizes in the iSSL library.

\subsection{Program Structure}

As we often found during the porting process, the original
implementation made use of high-level operating system functions such
as \texttt{fork} that were not provided by the rudimentary environment
on the RMC2000.  This forced us to restructure the program
significantly.

The original TLS implementation handles an arbitrary number of
connections using the typical BSD sockets approach shown below.  It
first calls \texttt{listen} to begin listening for incoming
connections, then calls \texttt{accept} to wait for a new incoming
connection.  Each request returns a new file descriptor passed to a
newly-\texttt{fork}ed process that handles the request.  Meanwhile,
the main loop immediately calls \texttt{accept} to get the next
request.

\begin{alltt}\small
listen(listen_fd)
for (;;) \{
  accept_fd = accept(listen_fd);
  if ((childpid = fork()) == 0) \{
    \textrm{// process request on accept\_fd}
    exit(0); \textrm{// terminate process}
  \}
\}
\end{alltt}


Unfortunately, the Dynamic~C environment provides neither the standard
Unix \texttt{fork} nor an equivalent of \texttt{accept}.  In the
RMC~2000's TCP implementation, the socket bound to the port also
handles the request, so each connection is required to have a
corresponding call to \texttt{tcp\_listen}.  Furthermore, although
Dynamic~C provides concurrent processes in the form of costatements,
their number is fixed at program compile-time.

Thus, to handle multiple connections and processes, we split the
application into four processes: three processes to handle requests
(allowing a maximum of three connections), and one to drive the TCP
stack (Figure~\ref{fig:requests}).  We could easily increase the
number of processes (and hence simultaneous connections) by adding
more costatements, but the program would have to be re-compiled.

\begin{figure}
\begin{alltt}\scriptsize
for (;;) \{
  costate \{
    tcp_listen(socket1,TLS_PORT, ...);
    while (sock_established(socket1) == 0) yield;
    \textrm{// handle request}
  \} 
  costate \{
    tcp_listen(socket2,TLS_PORT, ...);
    while((0 == sock_established(socket2))) yield;
    \textrm{// handle request}
  \}
  costate \{
    tcp_listen(socket2,TLS_PORT, ...);
    while((0 == sock_established(socket2))) yield;
    \textrm{// handle request}
  \}
  costate \{
        \textrm{// drive TCP stack}
        tcp_tick(NULL);
  \}
\}      
\end{alltt}
\caption{The structure of the main loop of the TLS server, which can
  handle at most three requests because it is limited to four
  processes.}
\label{fig:requests}
\end{figure}


\section{Experimental Results}
\label{sec:experiments}

To estimate which optimization techniques were worthwhile, we compared
the~C implementation of the AES algorithm (Rijndael) included with the
iSSL library with a hand-coded assembly version supplied by Rabbit
Semiconductor.  A testbench that pumped keys through the two
implementations of the AES cipher showed that the handcrafted assembly
implementation ran faster than the~C port by a factor of 15--20.

We tried a variety of optimizations on the~C code (e.g., moved
data to root memory, unrolled some loops, turned off debugging,
enabled compiler optimization), but this only improved its run time by
perhaps 20\%.

Code size appeared uncorrelated to execution speed.  The assembly
implementation was 9\% smaller than the~C, but ran more than an order
of magnitude faster.

Debugging and testing consumed the majority of the development time.
Many of these problems came from our lack of experience with Dynamic~C
and the RMC2000 platform, but unexpected, undocumented, or simply
contradictory behavior of the hardware or software and its
specifications also presented challenges.

\section{Conclusions}
\label{sec:conclusions}

We described our experiences porting a library and server for
transport-level security protocol (specifically iSSL) onto a small
embedded development board (the RMC 2000, based on the Z80-inspired
Rabbit 2000 microcontroller).  While the Dynamic~C development
environment supplied with the board gave useful, necessary support for
some hardware idiosyncrasies (e.g., its bank-switched memory
architecture), support for concurrent programming (through
language-level support for cooperative multitasking in the form of
costatements and cofunctions), and a TCP/IP stack, porting the
service, originally designed for a Unix-like environment, was not
trivial.

The biggest challenges revolved around different or missing APIs, such
as the substantial difference between BSD-like sockets and provided
TCP/IP implementation or the simple absence of a filesystem.  Our
solutions to these problems involved either writing substantial amounts of
additional code to implement the missing library functions or
reworking the original code to use (or simply avoid) the API.

Finally, we compared the speed of our direct port of a C
implementation of the RSA (Rijndael) ciper with a hand-optimized
assembly version and found a disturbing factor of 15--20 in
performance in favor of the assembly.

From all of this, we conclude that there must be a better way.
Understanding and dealing with differences in operating environment
(effectively, the API) is a tedious, error-prone task that should be
automated, yet we know of no work beyond high-level language compilers
that confront this problem directly.

% Applies to the next page
\advance\textheight -5.25in

\bibliographystyle{plain}
\bibliography{date2003}

% Applies to the current page
\advance\footskip 5.25in

\end{document}