Overview

This page provides a brief description of the rhoScript language and compiler. It is currently incomplete, and will, in the future, contain a more detailed specification. At the moment, it describes the syntax of rhoScript, which includes briefly how the arithmetic encoder works, and an overview of the compiler.


The rhoScript High-Level Language Syntax

The high-level language has a very simple syntax. A function is surrounded by parentheses, and contains zero or more tokens separated by spaces. A token is either a function, or matches the regex ([*a-zA-Z0-9]+|"([^"]|\\")*"). The entire program is a sequence of tokens.

There are three basic "types" of tokens. The first type of tokens are builtin functions. They take argument(s) off of a global stack, operate on the data, and then push the result(s) back on to the stack. The second type of tokens are functions, integers, and strings. These have no effect other than that they are pushed on to the stack. Finally, there are modifiers. Modifiers modify (gasp) the behavior of a function. All modifiers must appear before any other tokens in a function. The actual meaning of these tokens are described below.


The rhoScript Low-Level Language Syntax

The low-level language has a significantly more complicated syntax. It is entirely arithmetic encoded based on the type of the elements on the stack.

As a first-order approximation, we can imagine the low-level language as a sequence of variable-length tokens, each of which represents a single builtin function being called, or indicating the beginning of a function being placed on the stack. Determining the decoding of a token requires knowledge of the stack.

Take as an example a stack already containing two integers, with the next token being 26. If there are 35 different possible builtins that could be called on two integers (they need not require exactly two integers, only be valid with two integers on the stack -- the following are all valid functions of two integers: duplicate the top element, increment the top element, add the top two elements, swap the top two elements), then we can think of the decoding of this token as the 26th possible function (in some consistent ordering). Note that this means the fewer number of possible functions we could be referencing, the smaller the size of the token.

Functions are represented by a specific token indicating the start of a token, followed by any modifiers, followed by the length of the function (in bits), followed by the data in the function.


The rhoScript Common Lisp Compiler

We implemented rhoScript in Common Lisp for several reasons. Least importantly, common lisp is fast: we use SBCL, which compiles lisp to x86 efficiently. Next, lisp has macros, which makes defining the builtin functions much nicer and reduces much of the boilerplate code needed. The last of the unimportant reasons is that we had never written a serious project in common lisp and wanted to get better at it.

We now get to the important reason we picked common lisp: it is very easy to generate lisp code programatically. Since lisp is just the AST, we don't need to do anything fancy in the compiler outputting code. While this explains why we chose lisp as the target language, it doesn't explain why we use lisp as the compiler language. We discuss that below.

The compiler is divided in to two pieces, which we will refer to as the compilation component and the runtime component. The compilation component turns the high-level rhoScript in to low-level code; it invokes the runtime component to help with this. The runtime component takes the low-level code, uncompresses it to the actual commands, compiles it to lisp, and then runs that lisp code.

The compilation component works in several steps. First, it takes the input program and associates the named commands to the builtin functions, and splits the functions in to different lisp defuns. Before each command is executed for the first time, it inspects the stack and uses this information to determine how to compress itself. It is for this reason that the compilation requires the runtime environment: it is not possible to, statically, determine the types of the values on the stack for all sequences of commands. As such, the compiler begins by running type inference as far as possible, and when it gets stuck (i.e., it has encountered a function whose output it can not predict statically) it invokes the runtime component to run up until that point. The compiler then resumes computing the mapping to compiled commands.

Given the mapping of commands to compressed versions, the compiler component then runs an arithmetic encoder over each function. It prefixes the function with the length, and then recurses. The result is a single arithmetic-encoded piece of data, which is now the low-level code.

The runtime component has the inverse role. It begins with the arithmetic-encoded data. It begins by determining the initial contents of the stack and decoding as much of the code as possible with the same static analysis. The entire state of the runtime code-generation component is then placed in to a continuation and is placed at the end of the decompressed function calls, the point at which the static analysis stopped. The runtime component then runs the generated code until we reach the point of the continuation. The compiler is then resumed, doing more static analysis, and generating more code. This process repeats until all of the code is decoded. It is now apparent why it helps that the target language is the same as the language the compiler is implemented in: since it is not possible to generate all of the compiled code ahead of time, we need to be able to call compiled code from the compiler, and be able to call the compiler from the compiled code. This is easy when the two languages are the same.

If the workings of the compiler are at all interesting to you, check it out on GitHub.