Forth and Optimization

Forth is a strange old language. It's among a relatively small number of concatenative languages that achieved any measure of success. Anyway, this isn't something we're going to go into much here because we're thinking about something nearly tangential to the language itself.

How you can make it fast.

Stacks are somewhat slow compared to registers. Using the cache has a cost, even when your program is using optimal access patterns, especially on load store architectures(which seem to be coming back into vogue again).

Anyway, on register machines, forth isn't much in terms of hot shit. Sure there are compilers that manage to do great with it, but that doesn't change the fact that as an execution model it kinda leaves a lot to be desired these days. The other issue is that it's not entirely clear how one is to write a compiler that does that kind of transformation on the execution, or it isn't to us, though that's probably more a gap in our knowledge than any sort of extreme difficulty.

Okay, so forth is simple enough, it has a stack of words of some size, a return stack, and no particular datatypes handled seamlessly other than integers of the word size. Each function(in forth such things are called 'words' because they, at least in the past, were looked up in a 'dictionary), each function can consume any number of items from the stack and result in any number of items being placed onto the stack. There's some stuff about immediate mode execution (compile time stuff), but it uses the same kind of model.

This is hard to map onto registers because a function can return any number of things. But there's a possibility for analysis here that might help. Let's make a simple function that rotates the top three items on the stack twice, so that the topmost item becomes the bottom-most.

: rot2 ( a b c -- b a c ) 
   rot ( 3 in, 3 out)
   rot ( 3 in, 3 out) ;

The first line, the definition doesn't really enforce anything here. The ( a b c -- c b a ) is just a comment to help meagre creatures that need reminders of such things (as opposed to the gods of forth or whoever). But what we can see is that by the end of the function there it remains clearly balanced, it will consume and emit 3 items no matter what.

Let's assume we can assemble rot into something sensible, we're going to use some generic bastardized assembly here because it's the best we can do.

ROT:
  move S0, T0
  move S1, S0
  move S2, S1
  move T0, S2
  ret

Of course, in general you'd want to use something like Single Static Assignment as an intermediate here, but we are even less familiar with that than assembly.

To call this without any analysis we think you would need to load the information from the stack into the registers and then call into the function, but since we know how many registers we need, among other things, we can do something like compile rot2 into something like this.

rot2:
  pop S2
  pop S1
  pop S0
  call ROT
  call ROT
  push S0
  push S1
  push S2
  ret

But then that's not really realistic for our desired solution. Ideally it would generate native code through a transpiler into C, then calling whatever C compiler the user selects.

void rot(word *a, word *b, word *c){
    word tmp=*a;
    *a=*b;
    *b=*c;
    *c=tmp;
}
: quadratic ( a b c x -- n )
 >R swap r@ * +
 swap R> dup * * + ;
: rot2 ( a b c -- c b a )
  rot rot ;
: quadratic ( a b c x -- n )
  >R ( R+1, S-1 )
  swap ( R,S )
  R@ ( R, S+1)
  * ( R, S-2,S+1 )
  + ( R, S-2,S+1 ) 
  R> ( R-1, S )
  dup ( R, S+1)
  * ( R, S-1)
  R@ ( R,S+1) 
  * (R, S-1) + ;

With some luck, the C compiler has a large attention span and optimization space. If it doesn't, then it may not be able to optimize the threaded code at all. And more to the point, the compiler still needs to try to be smart about combining operations prior to that, because some idioms in forth may not have obvious optimization opportunities to a C compiler.

This means that at some point, there is a need to accept the call, there's a need to write a compiler that understands the semantics as they apply to the underlying machine that you are targeting.

Of course, when you're Forth and everything is just words to you, maybe C compilers can peer deeply enough to handle most arithmetic optimizations you might want.

But part of what makes Forth attractive is that it's simple to implement. That it's easy to write something that'll give you lots of program space. It is a low level language, it doesn't bother itself tracking types, it trusts the user, for better for worse, and that's what makes forth so fast. It doesn't pause for garbage collection just as it doesn't tag its types.

Ultimately this sort of environment is more like DOS than it is our memory protected machines these days, and that means that this is likely less than suitable for handling untrusted input(not that you can't write 'provably' correct code in forth).

Leave a Reply

Your email address will not be published. Required fields are marked *