This page provides a brief description of the rhoScript language and compiler.
It is currently incomplete, and will, in the future, contain a more detailed
specification. At the moment, it describes the syntax of rhoScript, which includes
briefly how the arithmetic encoder works, and an overview of the compiler.
The rhoScript High-Level Language Syntax
The high-level language has a very simple syntax. A function is surrounded by
parentheses, and contains zero or more tokens separated by spaces. A token is
either a function, or matches the regex ([*a-zA-Z0-9]+|"([^"]|\\")*").
The entire program is a sequence of tokens.
There are three basic "types" of tokens. The first type of tokens are builtin
functions. They take argument(s) off of a global stack, operate on the data, and
then push the result(s) back on to the stack. The second type of tokens are functions,
integers, and strings. These have no effect other than that they are pushed on to the
Finally, there are modifiers.
Modifiers modify (gasp) the behavior of a function. All modifiers must appear before
any other tokens in a function. The actual meaning of these tokens are described below.
The rhoScript Low-Level Language Syntax
The low-level language has a significantly more complicated syntax. It is entirely
based on the type of the elements on the stack.
As a first-order approximation, we can imagine the low-level language as a sequence
of variable-length tokens, each of which represents a single builtin function being
called, or indicating the beginning of a function being placed on the stack.
Determining the decoding of a token requires knowledge of the stack.
Take as an example a stack already containing two integers, with the next token being
26. If there are 35 different
possible builtins that could be called on two integers (they need not require
exactly two integers, only be valid with two integers on the stack -- the following
are all valid functions of two integers: duplicate the top element, increment
the top element, add the top two elements, swap the top two elements), then we can
think of the decoding of this token as the 26th possible function (in some consistent
ordering). Note that this means the fewer number of possible functions we could be
referencing, the smaller the size of the token.
Functions are represented by a specific token indicating the start of a token,
followed by any modifiers,
followed by the length of the function (in bits), followed by the data in the function.
The rhoScript Common Lisp Compiler
We implemented rhoScript in Common Lisp for several reasons. Least importantly,
common lisp is fast: we use SBCL
, which compiles
lisp to x86 efficiently. Next, lisp has macros, which makes defining the
builtin functions much nicer and reduces much of the boilerplate code needed.
The last of the unimportant reasons is that we had never written a serious project
in common lisp and wanted to get better at it.
We now get to the important reason we picked common lisp: it is very easy to
generate lisp code programatically. Since lisp is just the AST, we don't need
to do anything fancy in the compiler outputting code. While this explains why we
chose lisp as the target language, it doesn't explain why we use lisp as the
compiler language. We discuss that below.
The compiler is divided in to two pieces, which we will refer to as the compilation
component and the runtime component. The compilation component turns the high-level
rhoScript in to low-level code; it invokes the runtime component to help with this.
The runtime component takes the low-level code, uncompresses it to the actual commands,
compiles it to lisp, and then runs that lisp code.
The compilation component works in several steps. First, it takes the input program
and associates the named commands to the builtin functions, and splits the functions
in to different lisp defuns. Before each command is executed for the first time, it
inspects the stack and uses this information to determine how to compress itself.
It is for this reason that the compilation requires the runtime environment: it is
not possible to, statically, determine the types of the values on the stack for
all sequences of commands. As such, the compiler begins by running type inference
as far as possible, and when it gets stuck (i.e., it has encountered a function whose
output it can not predict statically) it invokes the runtime component to run up until
that point. The compiler then resumes computing the mapping to compiled commands.
Given the mapping of commands to compressed versions, the compiler component then
runs an arithmetic encoder over each function. It prefixes the function with the
length, and then recurses. The result is a single arithmetic-encoded piece of data,
which is now the low-level code.
The runtime component has the inverse role. It begins with the arithmetic-encoded
data. It begins by determining the initial contents of the stack and decoding as much
of the code as possible with the same static analysis. The entire state of the
runtime code-generation component is then placed in to a continuation and is placed
at the end of the decompressed function calls, the point at which
the static analysis stopped. The runtime component then runs the generated code
until we reach the point of the continuation. The compiler is then
resumed, doing more static analysis, and generating more code. This process repeats
until all of the code is decoded.
It is now apparent why it helps that the target language is the same as the language
the compiler is implemented in: since it is not possible to generate all of the
compiled code ahead of time, we need to be able to call compiled code from the compiler,
and be able to call the compiler from the compiled code. This is easy when the two
languages are the same.
If the workings of the compiler are at all interesting to you, check it out on