> < ^ Date: Thu, 27 Aug 1992 11:37:12 +0200
> < ^ From: Martin Schoenert <martin.schoenert@math.rwth-aachen.de >
< ^ Subject: Re: Load time for libraries...

Yesterday I wrote about the loading of the libraries. In this e-mail I
will discuss the interpretation of GAP programs.

When a GAP function is read into GAP, the reader builds the syntax tree
for the function. Such a syntax tree consists of a number of nodes, one
for each operator or syntactic construct. Each node has the operands as
descendants. For example one could picture the syntac tree for the
statement

if a < 0  then a := b + 2 * a; fi;

as follows

    if
   /  \
  <    \
 / \    \
a   0    :=
        /  \
       a    +
           / \
          b   *
             / \
            2   a

This statement is executed by evaluating this syntax tree in a *recursive
descent* way. Let us look a little bit closer at this. First 'EVAL' is
called and the entire tree is passed as argument. 'EVAL' sees that the
type of the root node is 'if' (actually 'T_IF', which is 49), and calls
the function 'EvIf', again passing the tree as argument. 'EvIf' first
calls 'EVAL' recursively, this time passing the subtree for 'a < 0' as
argument. 'EVAL' will call 'EvLt' (less than) and so on.

Now if GAP had a bytecode interpreter, things would work a little bit
different. Then a translator would produce code for an abstract
bytecode machine from the syntax tree. For example for the above
syntax tree it might produce something like

    push-var a          ; push the value of the variable a
    push-const 0        ; push the constant 0
    less-than           ; stack a 0 -> stack <result of a < 0>
    branch-if-false l0  ; jump to label l0 if the result is false
    push-var b
    push-const 2
    push-var a
    mult
    add
    store-var a         ; store the result in the variable a again
l0: <code for next statement>

The abstract machines used in such schemes are usually stack machines (it
is interesting to speculate why stack machines, which are not used any
more in hardware processor designs, are still quite popular for bytecode.
I think the reason is that it is not possible to do several things in
parallel in software. In other words, the reason is that there is no
software equivalent of the *<n>-ported register file*). Those
machines usually have 256 (or less) instructions, so the instructions can
be stored in one byte each, hence the name.

This byte code is then interpreted by an abstract machine. This machine
is usually implemented by a large function, which basically looks as
follows

interpret ()
{
    while ( 1 ) {
    switch ( *pc++ ) {
    case push-var:  *sp++ = **(object**)pc;
                    pc += 4;
                    break;
    case add:       sp[-2] = sp[-2] + sp[-1];
                    sp--;
                    break;
    ...
    }
    }
}

Except of course the true code is more complex, because it must test the
types of the operands of the 'add', test whether an overflow occurred,
etc. There are slightly different methods to implement the abstract
machine, but it usually does not make much difference.

One can argue that interpreting the syntax tree is not that much
different from having an abstract machine interpret bytecode. Namely the
type of each node is the bytecode, i.e., an instruction of an abstract
machine, and the children of this node are the operands of this
instruction. So the main difference is that the operand expressions can
become arbitrarily complex.

Now we think that a true bytecode implementation can be faster by about
a factor of 4 than the recursive descent evaluator currently in GAP.
But for this factor of 4 the following optimizations must be made

* The instructions of the abstract machine must be carefully selected
to ensure that common operations can be executed with only a few
instructions. For example there should be an instruction to
increment the value on the stack top by one, because statements
such as 'i := i + 1;' are very common.

* The C compiler must do a reasonable job of optimizing the function
'interpret'. For example, 'pc' and 'sp' (which need to be global
variables) should be allocated registers and only spilled to memory
when necessary, the 'switch' statement should be compiled to code
that uses a jump table, etc. (Those optimizations where not
available in the compilers we used when we started with GAP.)

* The translator, which translates the syntax tree into the bytecode,
must perform some optimizations. For example in a statement such as
'if S.orbit[1]^g = S.orbit[1] then ...' the value of 'S.orbit[1]'
should be evaluated only once.

When we started GAP we decided that we had enough work to do and that we
simply had not enough manpower (and know-how about bytecode interpreters)
to spent too much time on the implementation of the programming language.
So we decided to use the simpler recursive descent implementation.

Now one might want to reevaluate this decision, since the situation is
not the same as it was 6 years ago. We now have data about real GAP
programs. This data (and several thought experiments) give me a fairly
good idea on how a bytecode instruction set architecture for GAP should
look. The C compiler are much better than they were 6 years ago. A lot
more about optimization is known, so that it is quite clear how to
perform common subexpression elimination, etc.

So thinking about writing a bytecode interpreted for GAP might now be a
reasonable thing to do, except there is an even better option, which we
are investigating currently. This works as follows. You take the GAP
code and compile it to C code. Then you call the native C compiler to
compile this to machine code and link this to the GAP kernel. This idea
is used in implementations of LISP (KCL) and Scheme (Scheme->C) with very
good results. It has several advantages. First it is quite portable.
Second the result is machine code, which is executed without the overhead
of an interpreter (such as the one needed for a bytecode system). Third
the C compiler will take care of many optimizations, e.g., most C
compilers nowadays perform common subexpression elimination. So one can
concentrate on other optimizations, such as determing the types of
objects at compile time instead of runtime.

As I said we are investigating this possibility. We hope that we will be
able to achieve a performance an order of magnitude better than the
current implementation. Of course this estimate is only valid for
functions that spent most of there time with simple operations such as
addition of integers, indexing into lists, etc. Functions such as the
Schreier-Sims algorithm, which spents most of its time multiplying
permutations, will not benefit that much. Note that I cannot make any
statements about when such a compiler might be available (except that it
will certainly not be this year ;-).

Martin.

--
Martin Sch"onert, Martin.Schoenert@Math.RWTH-Aachen.DE, +49 241 804551
Lehrstuhl D f"ur Mathematik, Templergraben 64, RWTH, D 51 Aachen, Germany


> < [top]