sipvxml - SIP based VoiceXML browser


sipvxml [options]


Binary evaluation "alpha" version is available. The code runs on Linux (tested on RedHat 7.1).


Sipvxml-1.20, "alpha code" released Jan, 2002.


Session Initiation Protocol (SIP) is a signaling protocol used for establishing and terminating Internet telephony call. VoiceXML is a language designed to create audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input and recording of audio for telephony applications. Sipvxml is a SIP based VoiceXML browser. Users can connect to the browser using SIP, and take part in the application defined interactive voice response systems. It also brings the advantage of VoiceXML technology to a telephone user by using a SIP-PSTN gateway. Our motivation in developing sipvxml is to allow telephone users to interact actively in our IP telephony test-bed, in particular with the conferencing service and the voice mail service.


  1. Uses SIP for signaling and RTP for media transport
  2. Supports RFC2833 for signaling of DTMF digits (audio/telephone-event)
  3. Based on VoiceXML 1.0 specification
  4. Interworks with Cisco IP phone and Cisco SIP/PSTN gateway (TBD).
  5. Supports service specification in SIP URI
Planned features include
  1. Support for more tags and attributes in VoiceXML
  2. Support call tranfer
  3. Support RFC 2198 for multiple digits per packet
  4. Support audio/tone type
  5. Support speech recognition
  6. Enhance the DTMF grammar


Print the version information and exit.
-d category
Run the application in debug mode with all program trace output to stdout. Makes the software print out debugging information for the particular category. Currently supported categories are all, sql, net, sdp, misc. The option can be repeated to allow debugging multiple categories
-o tracefile
Put all the program trace information in the tracefile. This option works in conjunction with -v option.
-p port
Use the specified port number for listening to incoming SIP calls. Default is 5060.
Use dotted decimal IP address for self address instead of host name. This option is strongly discourages, and is provided only for test purpose to test your application against some primitive devices that do not support DNS resolution for SIP "Via" header.
-u url_or_path
Default initial URL for the VoiceXML scripts. sipvxml can extract the appropriate initial script URL from the SIP request URI. url_or_path can be a "http" URL or can simply be a file path accessible to the application.
-t url
This option is used for testing the VoiceXML interpreter from the command line. The url specifies the initial URL for testing.


VoiceXML: The Voice Extensible Markup Language, is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, and recording of audio for telephony applications. It brings the advantage of web technologies to a telephone users by providing an interactive voice response (IVR) mechanism.

Session Initiation Protocol: SIP is a popular Internet telephony protocol. It uses Session Description Protocol (SDP) for describing multimedia sessions and Real-time Transport Protocol (RTP) for carrying real time multimedia traffic including audio packets.

SIP-based VoiceXML browser: A SIP-VoiceXML browser is a software application to which an Internet telephone user can connect and interact with the system. The browser is in some way similar to a web browser. Consider how a web browser fetches the content from the web server and displays it to the users. Similarly the VoiceXML browser fetches the vxml pages from the web server and presents the interactive dialog to the telephone user. The following diagram shows an example scenario where the the browser can be accessed from a SIP phone users as well as a regular telephone user.

The browser typically uses the web server for fetching the documents, VoiceXML pages or the media files. The VoiceXML pages can be either statically stored on the web server or can be dynamically generated based on some server side programming logic like HTTP-Common Gateway Interface, Java Servlet or Java Server Pages. The media files can either be stored on the web server or can be streamed in real-time from an RTSP media server. The advantage of using the media server is that it can directly stream the RTP packets to the caller's SIP phone.

Overall design: This section presents an overview of our implementation. The following diagram shows the various components in the browser implementation.

The implementation consists of the following modules:

SIP interface
We use our SIP C/C++ library for implementing a SIP user agent interface to the browser that can receive Internet telephony calls.
RTP interface
We use an RTP Library to handle the RTP and RTCP implementation.
DTMF detection
A very simple DTMF detection implementation can detect touch tones in the incoming audio stream. We also detect telephone-events based on RFC 2833.
XML parser
We use the Apache XML parser with DOM interface.
HTTP fetcher
XML parser has built-in HTTP fetcher. We also use a simple HTTP fetcher for getting non-XML pages, e.g., audio files.
Text-to-speech convertor
We use IBM ViaVoice TTS SDK for speech synthesis.
Speech recognizion
Current implementation does not use any speech recognition engine. All user input is through touch-tone.
VoiceXML interpreter
This is the main component of our design. It runs the interpreter on the fetched VoiceXML pages and invokes various other services.
Grammar matching rules
We have implemented a very basic DTMF grammar matching rule. This module works in conjunction with the VoiceXML interpreter.
An example control flow is shown in the above diagram and explained below:
  1. When the browser receives a new incoming SIP call it creates three different threads: RTP receive thread, RTP send thread, and the VoiceXML interpreter thread. The RTP receive thread is responsible for receiving RTP/RTCP packets from the caller and invoking the DTMF detection module. The RTP send thread is responsible for sending out RTP/RTCP packets to the caller. A separate thread helps in maintaining the required bandwidth (e.g., 64 kb/s for G.711 audio) for outgoing packets and can be decoupled with the speed of the speech synthesizer. The interpreter thread executes the main logic for the system. The incoming SIP can carry the service URL or a default service URL can be pre-configured in the browser using the -u command line option. For example if the request-uri in the incoming SIP INVITE is sip:dialog.vxml.http%3a// then the browser will invoke the interpreter thread with the initial VoiceXML URL as On the other hand if the request-uri is some other value, e.g., then the interpreter is invoked with the default initial VoiceXML URL passed from the command line -u parameter.
  2. The interpreter thread calls the XML parser with the initial URL. We use the DOM API in C++.
  3. The XML parser fetches the page from the web server or a local file system (based on the initial URL).
  4. It presents the returned XML document as a tree datastructure.
  5. The interpreter thread does some initialization and then invokes the Form Interpretation Algorithm on the selected form from the document.
  6. The Form Interpretation Algorithm (FIA) is specfied in the VoiceXML specification. It can invoke various other modules based on the content of the VoiceXML document. For example it can invoke the the text-to-speech SDK to sysnthesize any prompts.
  7. The FIA can also invoke the HTTP fetcher to fetch any additional document, for example, an external grammar file or a media file for audio prompt.
  8. The HTTP fetcher implements a simple HTTP GET method to retrieve a document over HTTP.
  9. The media file retrieved from the web server using HTTP fetcher is fragmented and enqueued for sending out to the caller. The fragmentation here means splitting the file into the small blocks of 20ms packets.
  10. The speech synthesizer output is also fragmented and enqueued for sending out to the caller.
  11. The VoiceXML document specifies the active grammar matching rules in various scopes. The FIA can set the active grammar for the matching engine based on the current scope.
  12. The RTP receive thread, receives the RTP packets and invokes the DTMF detector in the audio data.
  13. If the DTMF detector detects a touch tone, it feeds the detected input to the grammar matching engine.
  14. The receive thread can also handle the special RTP packets for touch tones based on RFC 2833.
  15. The grammar macthing engine received the user input (touch tones) from the DTMF detector, and tries to match the active grammar. If a match is found it informs the FIA to use the input and take further action
  16. The RTP send thread periodically sends the media packets to the caller. It does not send any packet during silence.

DTMF tones: There are a number of ways in which DTMF can be transported in a call. The most common way is to not distinguish it with the spoken voice. So the DTMF tones are encoded using the currently used audio codec and sent across to the remote party without making any distinction between DTMF and regular speech. A second way to to define a special RTP packet format (RFC2833) to carry the DTMF digit. Such a special packet contains the digit(s) instead of encoded audio. In the first case the receiver has to do the DTMF detection, where as in the second case the sender or the gateway has to do the DTMF detection. We have implemented both the methods. (A third method of transporting DTMF along with the SIP signaling messages is not considered in this project)

Text-to-speech: A browser after accepting an incoming call, fetches an initial VoiceXML page from a web server using HTTP. Once the page is available it starts its interpreter and presents any dialog as specified in the page. For instance, the vxml page may ask the user to enter a four digit pin number to authenticate. Such a dialog is written in text in the vxml page, that needs to be converted to speech. The brower invokes a text-to-speech convertor to convert any prompt and presents it to the telephone user.

We use the callback mechanism of ViaVoice to do Text-to-speech. i.e., you give a piece of text to the library and it calls a callback function when the conversion is done. You can use this callback function to packetize the audio and send it to the remote. Your packets should be 20ms long, i.e., if you are using G.711 Mu Law then the payload will be 160 bytes per packet. If the size of the converted audio is longer than this then you will need to fragment and send the packets of 160 bytes every 20 ms.

DTMF Grammar: We have implemented a very simple DTMF grammar A typical dtmf tag in vxml page may look like:

 <dtmf type="application/x-dtmf">
   1 | 2 | 3 | 4 | *
The MIME type for this grammar is "application/x-dtmf". The matching rule relies on the terminating digit '#' in the current implementation. Every input must be terminated by a '#', for instance, to enter '2' in the above grammar the user has to punch in key '2' followed by '#' key. In implicit timeout is also implemented so that the input is automatically excepted if the user does not press the '#' key for some time. The default timeout is approximately 5 seconds. If no grammar is specified in the document then the interpreter will accept any input. Users can press '*' '*' '#' anytime to signal the help event.

Supported tags: VoiceXML specification lists many tags. Although a complete voice XML browser implementation should support all the tags, we implement a sub-set. In particular we support the following tags:

assign, audio, block, catch, clear, disconnect, dtmf, error, exit, field, filled, form, goto, help, noinput, nomatch, prompt, submit, value, var, vxml.
We do not support any java script or any other script in the browser. There are certain restriction in using the above tags.

Example VoiceXML files

We have tested the system for the limited functionality with the some example test pages. Some of the pages are shown in this section.

A simple example with goto tag is shown below:

<?xml version="1.0"?>
<vxml version="1.0">
  <meta name="author" content="John Doe"/>
  <meta name="maintainer" content="hello-support@hi.example"/>
  <var name="hi" expr="'Hello World!'"/>
      <value expr="hi"/>
      <goto next="#say_goodbye"/>
  <form id="say_goodbye">

Following example shows a dtmf grammar, user input, and jumping to next page.

<?xml version="1.0"?>
<vxml version="1.0">
    <dtmf type="application/x-dtmf">
      1 | 2 | 3
    <field name="ans">
      <prompt>Please enter your choice 1, 2 or 3!</prompt>
       <goto next="nextfile.vxml"/>
    <catch event="noinput">
      Sorry I did not hear anything.
       Please enter the choice as one of 1, 2 or 3.

The following example uses an external CGI script to speak out the name of the person by looking at the yp password database. You need to provide the user identifier of the person using the telephone key-pad. The script can also be accessed here

<?xml version="1.0"?>
<vxml version="1.0">
  <field name="userid">
      Hello and welcome to the Columbia VoiceXML engine.
      You can press star star pound any time for help.
      Enter the userid of the person you want the name for,
      using the buttons on your telephone.
      For example for h g s, press 4 4 7 followed by the pound key.

  <catch event="noinput help error">
      Enter the userid of the person you want the name for,
      using your telephone buttons.
      For example for h g s, press 4 4 7 followed by the pound key.
      Here letters h and g appear on the same button as digit 4.
      And letter s appears on the same button as digit 7 of your
      telephone key pad.

    <submit next="" namelist="userid"/>

The corresponding cgi script in Tcl is shown below. Here the script d2uid.tcl converts the number sequence to unix user id, e.g., 447 to hgs. The script uses the cgi-tcl library for handling cgi form inputs.
#!/usr/bin/env tclsh

lappend auto_path /home/kns10/lib
package require cgi


puts "Content-Type: text/plain\n"

set userid ""
catch {cgi_import userid}

# Output the text.
if {$userid == ""} {
  puts "<?xml version=\"1.0\"?>
<vxml version=\"1.0\">
      You entered an invalid user identifier. Try calling again!
} else {
  set name "Unknown name"

  catch {
    set unixids [split [exec /home/kns10/bin/d2uid.tcl $userid] \n]
    set unixid [lindex $unixids 0]
    set name [exec ypcat passwd | grep "^$unixid:" | cut -d: -f5]
    # puts "$unixids, $unixid, $name"

  puts "<?xml version=\"1.0\"?>
<vxml version=\"1.0\">
      The name you are looking for is $name.


Currently the compilation instructions for sipvxml is slightly different from the rest of the CINEMA components. This is because the IBM ViaVoice SDK is compiled on Linux using the egcs (older gcc) compiler. Compiling some part in egcs and some in gcc does not work. Current implementation can work only on Linux with egcs compiler. Follow the following steps to compile the system.
  1. Download the XML parser Xerces-C version 1.5.1 or higher.
  2. Change the file src/util/NetAccessors/Socket/UnixHTTPURLInputStream.cpp to support the query string in the HTTP GET request. This is done by the following additions.
    line135: after definition of portNumber
        const XMLCh*        query = urlSource.getQuery();
        char*               queryAsCharStar = XMLString::transcode(query);
        ArrayJanitor  janBuf4(queryAsCharStar);
    line189: before strcat(fBuffer, " HTTP/1.0\r\n");
        if (queryAsCharStar != 0)
          strcat(fBuffer, "?");
          strcat(fBuffer, queryAsCharStar);
  3. For source distribution, compile the library using the egcs and egcs++ compilers on Linux. The version 1.5.1 needs this patch to compile the XML parser with a non-standard gcc compiler (Thanks to Jonathan Lennox). Later version may already have this patch incorporated in the code.
    $ cd /your/home/dir
    $ gunzip -c xerces-c-src1_5_1.tar.gz | tar xvf -
    $ cd xerces-c-src1_5_1/src
    $ ./runConfigure -plinux -cegcs -xegcs++ -minmem -tnative
    $ make
    This will create a shared library in the lib directory.
  4. Download and install the IBM ViaVoice TTS SDK for Linux. The standard installation puts the shared library in /usr/lib.
  5. Uncompress the sipvxml source distribution. Compile the sources as follows:
    $ cd /your/home/dir
    $ gunzip -c sipvxml-1.20.tar.gz | tar xvf -
    $ cd sipvxml-1.20
    $ ./configure --with-xerces=/your/home/dir/xerces-c-src1_5_1
    $ make -s sipvxml
    This will create the executable sipvxml/sipvxml. Set the LD_LIBRARY_PATH environment variable to include the directory containing the XML parser's library and run the server using -h option to see the usage.
    $ export LD_LIBRARY_PATH=/your/home/dir/xerces-c-src1_5_1/lib:$LD_LIBRARY_PATH
    $ ./sipvxml/sipvxml -h


You need a SIP user agent to test the application. You can use our test user agent (sipua) to do the initial testing of the system. Alternatively you can use any SIP user agent or phone that can support DTMF touch-tones either in the encoded audio stream or as RFC 2833 RTP payload for MIME type audio/telephone-event.

We will set-up a test environment on port 5076 for others to test our implementation, and can be reached at (TBD). Also a phone number (+1-212-9397137) will be set up for testing purpose.

Basic procedure for testing is simple. Just start the sipvxml server and dial in to it from a SIP phone (or from a PSTN phone through a SIP/PSTN gateway).

If you just want to test your VoiceXML pages or the interpreter then you can use sipvxml with -t option.

Possible extensions

This section describes some of the possible extensions to the existing project. This also give project ideas for students willing to work in the Internet Real-time Lab for course credit. Contact Prof. Schulzrinne if you are interested in doing some part of the project. The numbers in the parenthesis indicate the points. A 3-credit project should have accumulated 100 points or higher.

VoiceXML version 2.0 (10)

The current implementation is based on the earlier version 1.0 of the specification. Look into the form interpretation algorithm implementation and see if we need to modify it for the new version 2.0 specification or not. Check the consistency for other tags also. Version 2.0 specification is available.

Menu and choices (50)

menu is a special form of a form that allows specifying dialogs more easily in some cases. Modify the interpreter to understand the menu and choice tags. This will need extensive testing. We can assume only dtmf grammar in choice elements, or an explicit dtmf attribute for the choice tag.

Code cleanup (30)

There are many TODOs marked by XXX in the interpreter code. Try to resolve them. This might be very difficult for those who are not already familiar with the code.

RTP enhancements and DTMF detection (70)

RFC 2198 allows sending multiple digits in a single packet. Modify the interpreter code to also accept this payload format. Also change the code to accept audio/tone MIME type for telephony tones/DTMF digits. The application should correctly handle redundant coding format, i.e., the same digit is specified in both telephone-event and tone sub-types. Consider moving the RFC 2198 code to the RTP library (librtp/ directory in the source distribution).

RTP buffering (30)

Current implementation does not buffer the incoming packet, nor does it look into the time stamp field of RTP header. This means the implementation will break if the packets arrive out of order. Modify the receive thread implementation to buffer the received packets before doing DTMF detection or RFC 2833 detection. An example buffering mechanism can be found on Advanced Internet Services class web page (See homework 5 and 7 description).

DTMF grammar enhancement (50)

The current DTMF grammar requires a terminating '#' after every input. Enhance the grammar matching code, possibly using our gmatch utility, to remove this requirement. The new grammar matching rule should allow specifying timeout also as part of the matching rule. Consider the following example.
 <dtmf type="application/x-dtmf">
   1 | 2 | 3 | 4 | \* | #
A special keyword T is used to indicate a timeout. So following grammar can be use to enter a phone number.
 <dtmf type="application/x-dtmf">
   7??? | [34]????  | 1?????????? | 011*T | ???????
In this example, the phone numbers are either internal 4 digit number (starting with 7) or a 5 digit number (starting with 3 or 4), a local number (7 digit number), an US long-distance number (starting with 1) or an international number (starting with 011). The value of T is 1 second. If you expect more delay then use multiple T's. e.g., 011*TTT will wait for 3 seconds before assuming the current set of dialed digits to be the international number. The matching rules are ordered from left to right, so if a rule 3? appears before 34? then 34? will be ignored. However, a matching rule of 3?T and 34? can co-exist. The matching rule is applied from left to right when multiple rules are specified using the binary OR ("|") operator.

? is used to match a single character, while * can match any sequence of characters including none. Square brackets [ ] are used to match one digit from a sequence, e.g., [345] matches either 3, 4, or 5.

Complex matching of the form "011*#" is also done. This particular example expects a terminating dtmf tone, "#", at the end of the international number. You can assume that any * in the matching rule will be followed by a T or a #. It is always a good practice to include a terminating character for multi-digit inputs. Note, however, that "*" can appear only once in the matching sequence. A literal '*' DTMF tone is specified using \* in the grammar.

Most of the scenarios should have simple grammar similar to the first example. However, some scenarios may require more complex grammars. Formal definition of this grammar is for further.

Speech recognition (100)

Use ViaVoice speech recognition engine to integrate the spoken input part into the system. This also means that we need to use the specified speech grammar by the Voice activity group of W3C. The grammar can return the matched variables or the tokens to the interpreter and should be handled correctly in the processing of the user input.

Call transfer (100)

Implement the transfer tag to allow call transfer. The SIP library implementation should first be extended to support the REFER method to do call transfer. Then extend the interpreter to handle the transfer tag. You will need to use a client that can support call transfer. We can explore the cases when the browser acts as a back-to-back SIP user agent and does more advanced call control by changing the session description in re-INVITE to the caller, for example. See the internet-draft for more information on SIP interface to VoiceXML dialog server. Also see W3C Call Control requirements for voice browsers.

Better prompts (30)

Implement the count attribute to support counter in prompt tag. This enhances the dialogue specification. This means that the VoiceXML document can spacify different prompts for different iterations. For instance, give limited prompt for first iteration and more enhanced help for the second prompt if the user did not respond on first prompt.

Bargein (30)

Implement the bargein attribute for prompts. (Lower priority)

Initial dialog (30)

Implement the initial tag. (Lower priority)

Sub-dialogs and multiple documents (100)

Implement sub-dialogs and multi-document support in the interpreter. (Lower priority)

Use HTTP POST for submission (50)

Currently we use HTTP GET for submit tag. This is not secure given that the form variables get stored in the web server's log file and can be carried to next link. Modify the code to use HTTP POST. Most of the modifications will be limited to XML parser's enhancement and changes in HTTP fetcher.

Intermixed audio and prompt (50)

Current implementation can not handle intermixed prompt with audio. Implement this so that something like the following will work.
  Your new message is <audio src="msg625.wav"/>. 
  Press 1 to delete.

Speech enhancements (100)

VoiceXML specifies various tags for enhancing synthesized speech. See the the Speech Synthesis Markup Language. Use the IBM SDK to implement these new tags.

Control logic in VoiceXML (100)

Current implementation uses variables with out any fine grained control. You can just assign and cause a variable. You can not update it, for example increment a counter variable. This causes various programming restrictions.

Enhance the code to support simple arithmatic and string operation for variable. Also make it refer to variable using the dotted representations as specified in VoiceXML specification (e.g., mainform.done represents the variable done in the object mainform). Implement the cond attributes for simple conditional operators.

Enhance the implementation to support the if, else and elseif tags.

Error Logging (50)

Implement the log tag in VoiceXML. The logs should be stored in the log file used by the developer for debugging purpose.

Enhance the logging to post the log messages to an HTTP web server so that others can use our VoiceXML browser for testing and debugging their VoiceXML pages.

Recording (100)

Implement the record tag to allow recording of audio stream. It should allow recording on either the web server (use PUT method to upload the locally recorded file) or the RTPS media server (using our RTSP client library and RTSP media server). Details have to be worked out for using record and submit tags with real-time recording.

Throwing exceptions (50)

Implement throw tag to allow throwing exceptions.

Different file formats (30)

Implement support for wav and rtpdump formats for prompt files. The code for wav and rtpdump will be provided from an ongoing project.

Prefetching (100)

Implement the resource fetching as specified by the VoiceXML specification. This should be incorporated in all applicable tags. This involves caching also.

Properties (80)

Implement common properties, e.g., generic DTMF recognizer properties, and prompt properties. Consider adding these properties into the specific tags also.

Scripting (100)

Implement the script tag for simple scripts (e.g., Javascript). Since the implementation is in C and C++ we can consider defining another simple script with library functions. The idea is to allow simple calculation and error checking as done by Javascript for HTML pages.


Authors can be reached at


Thanks to Sean and Visda for help in implementation and testing.

Sipvxml uses XML parser from Apache (Xerces-C), text-to-speech convertor from IBM (ViaVoice) and RTP Library from Lucent Elemedia.


Copyright 2001-2002 by Columbia University; all rights reserved
Sipvxml is subject to licensing.

Last updated by Kundan Singh