Twenty-Seventh, the Language

I have had an irresistible idea lately. What if you had forth but you had 256 stacks. We can't start there, what if it was based on the venerable MUF(Multi user forth) language, which has some relatively simple typing and generally provides a higher level interface, ameliorating one of the biggest issues I personally have in learning the Forth language proper, the untagged nature of the memory space.

Of course, this comes at a performance decrease, 256 stacks is a lot of context to lug around, and every instruction needs to be able to fit N bytes for the addresses of the operands, which means that the decoding will be more complicated.

There are a couple of types that MUF implements, and so too shall this implement:

  • Integer
  • Floating point(double precision)
  • Strings
  • Dynamic growable arrays
  • Associative ordered dictionaries

Given that I'm not writing a MUCK, I don't need to worry about property or database manipulation directly, and the time-slicing is an entirely optional component rather than an essential way to partition a single core across a hundred processor hungry players.

MUF implements call by name, using a string addressed call. For the sake of consistency, and because I can't imagine how awful it would be to have to specify the entire list of argument stacks in a variable length instruction, the CALL instruction shall be expected to draw from stack 0. This instruction will not normally be invoked by the user, but it may if the user wants to permit a dynamic choice of what word is invoked.

As I am basing it off of MUF, so things like sockets, file handles, etc are understood to be integers.

The language design should permit an unhinged degree of indirection, deception, and flow while writing while permitting an unsettling mixture of register and stack based paradigms.

This is a project which I intend to use for an iRODS rule engine, but my intention is to make this flexible enough that you could, if you wanted, write an actual muck with it.

My current thought on how this should look for the primitive instructions is like this:

pop[0] (pops from the 0th stack)
123[1] (pushes 123 to the 1st stack)
swap[1][1] (swaps the top two items in stack 1)
swap[1][2] (swaps the top item from stack 1, and the top item from stack 2)
rot[1][1][1] (rotate the top 3 items of stack 1)
rot[1][1][2] (rotate the top 2 items of stack 1 and the top item of stack 2)
rot[1][2][1] (rotate the top item of stack 1, the top item of stack 2, and the second item of stack 1)
rot[1][2][3] (rotate the first item of stacks 1,2,and 3)
... etc

In either case, I'm quite excited, this should be a language that will allow all sorts of silly hacks.

Forth and Optimization

Forth is a strange old language. It's among a relatively small number of concatenative languages that achieved any measure of success. Anyway, this isn't something we're going to go into much here because we're thinking about something nearly tangential to the language itself.

How you can make it fast.

Stacks are somewhat slow compared to registers. Using the cache has a cost, even when your program is using optimal access patterns, especially on load store architectures(which seem to be coming back into vogue again).

Anyway, on register machines, forth isn't much in terms of hot shit. Sure there are compilers that manage to do great with it, but that doesn't change the fact that as an execution model it kinda leaves a lot to be desired these days. The other issue is that it's not entirely clear how one is to write a compiler that does that kind of transformation on the execution, or it isn't to us, though that's probably more a gap in our knowledge than any sort of extreme difficulty.

Okay, so forth is simple enough, it has a stack of words of some size, a return stack, and no particular datatypes handled seamlessly other than integers of the word size. Each function(in forth such things are called 'words' because they, at least in the past, were looked up in a 'dictionary), each function can consume any number of items from the stack and result in any number of items being placed onto the stack. There's some stuff about immediate mode execution (compile time stuff), but it uses the same kind of model.

This is hard to map onto registers because a function can return any number of things. But there's a possibility for analysis here that might help. Let's make a simple function that rotates the top three items on the stack twice, so that the topmost item becomes the bottom-most.

: rot2 ( a b c -- b a c ) 
   rot ( 3 in, 3 out)
   rot ( 3 in, 3 out) ;

The first line, the definition doesn't really enforce anything here. The ( a b c -- c b a ) is just a comment to help meagre creatures that need reminders of such things (as opposed to the gods of forth or whoever). But what we can see is that by the end of the function there it remains clearly balanced, it will consume and emit 3 items no matter what.

Let's assume we can assemble rot into something sensible, we're going to use some generic bastardized assembly here because it's the best we can do.

  move S0, T0
  move S1, S0
  move S2, S1
  move T0, S2

Of course, in general you'd want to use something like Single Static Assignment as an intermediate here, but we are even less familiar with that than assembly.

To call this without any analysis we think you would need to load the information from the stack into the registers and then call into the function, but since we know how many registers we need, among other things, we can do something like compile rot2 into something like this.

  pop S2
  pop S1
  pop S0
  call ROT
  call ROT
  push S0
  push S1
  push S2

But then that's not really realistic for our desired solution. Ideally it would generate native code through a transpiler into C, then calling whatever C compiler the user selects.

void rot(word *a, word *b, word *c){
    word tmp=*a;
: quadratic ( a b c x -- n )
 >R swap r@ * +
 swap R> dup * * + ;
: rot2 ( a b c -- c b a )
  rot rot ;
: quadratic ( a b c x -- n )
  >R ( R+1, S-1 )
  swap ( R,S )
  R@ ( R, S+1)
  * ( R, S-2,S+1 )
  + ( R, S-2,S+1 ) 
  R> ( R-1, S )
  dup ( R, S+1)
  * ( R, S-1)
  R@ ( R,S+1) 
  * (R, S-1) + ;

With some luck, the C compiler has a large attention span and optimization space. If it doesn't, then it may not be able to optimize the threaded code at all. And more to the point, the compiler still needs to try to be smart about combining operations prior to that, because some idioms in forth may not have obvious optimization opportunities to a C compiler.

This means that at some point, there is a need to accept the call, there's a need to write a compiler that understands the semantics as they apply to the underlying machine that you are targeting.

Of course, when you're Forth and everything is just words to you, maybe C compilers can peer deeply enough to handle most arithmetic optimizations you might want.

But part of what makes Forth attractive is that it's simple to implement. That it's easy to write something that'll give you lots of program space. It is a low level language, it doesn't bother itself tracking types, it trusts the user, for better for worse, and that's what makes forth so fast. It doesn't pause for garbage collection just as it doesn't tag its types.

Ultimately this sort of environment is more like DOS than it is our memory protected machines these days, and that means that this is likely less than suitable for handling untrusted input(not that you can't write 'provably' correct code in forth).