Asteroids Static Binary Translation


I have successfully (IMO) completed two SBTs of Asteroids, both targeted at ARM based platforms. The first time around I was targeting the GameBoy Advance, and thought translating to asm was a good idea. Later, Graham Toal, pulled me into the staticrecompilers group over at Yahoo, and I realized that translating into generic C was a much better idea since the new code is plaform independent and the compiler can aid (greatly) in optimizing the program. So the last three attempts which I guess is really one attempt restarted three times is where I am today. And my desired target is a different ARM platform, the iPod MP3 player.

Graham has a good web site on the subject of SBT: http://www.gtoal.com/sbt. I have taken a different approach, more elementary, but I feel easier to follow, and possibly easier to debug. This webpage is dedicated to outlining how I did my translation, as well as a place to distribute my code for public consumption. When I ported Asteroids to the GBA I wrote this web page.

First off you really must have a working emulator, MAME, is clearly the first place to look because you at least know those games are being emulated well enough to play. You should be able to be at least as good as MAME (just execute a zillion times faster, or run on a underpowered platform).

I went looking into the MAME code originally to see what I could borrow, and was disappointed. Now that two or three years have passed I can get by reading MAME code, but at the time it was all black magic. So I did a search on the internet and the name that kept coming up was Neil Bradley and the EMU emulator from way back when that ran Asteroids on a 486. At first I couldnt find the EMU source but I found an emulator for the MAC by Steve Green, the email address still worked and he emailed me his source code! Later I found code written by Neil Bradley, possibly part of EMU, perhaps not, but he claimed it at least ran Asteroids. Steve Green basically used the same code in his emulator. Neil's code was not written in a way to run fast, but it was certainly quite readable, so readable in fact that it was not a difficult task to make a program (skel.c) that would read Neils code (6502.c and 6502.h in the sources below), and generate a big switch structure, to aid in fat fingering in the translation. The skeleton looked like this:
        //      ticks[0x06]=5; instruction[0x06]=asl6502; adrmode[0x06]=zp6502;
        case 0x06: //asl6502
            //      ticks[0x06]=5; instruction[0x06]=asl6502; adrmode[0x06]=zp6502;
            printf("case 0x%04X: //%02X %02X %02X\n",opadd,opcode,rom[opadd+1],rom[opadd+2]);
            printf("    clockticks+=5;\n");
            //void zp6502()
            //{
            //      savepc=gameImage[PC++];
            //}
            temp=rom[opadd+1];
            //void asl6502()
            //{
            //      adrmode[opcode]();
            //      value = gameImage[savepc];
            //      P= (P & 0xfe) | ((value >>7) & 0x01);
            //      value = value << 1;
            //      put6502memory(savepc,value);
            //      if (value) P &= 0xfd; else P |= 0x02;
            //      if (value & 0x80) P |= 0x80; else P &= 0x7f;
            //}
            checkme(opadd,opadd+2);
            break;
One case statement for each opcode (fortunately the 6502 has less than 256 instructions, and no opcodes are "multiplexed")
From there I manually entered the code I wanted in the translation:
        //      ticks[0x06]=5; instruction[0x06]=asl6502; adrmode[0x06]=zp6502;
        case 0x06: //asl6502
            //      ticks[0x06]=5; instruction[0x06]=asl6502; adrmode[0x06]=zp6502;
            printf("case 0x%04X: //%02X %02X %02X\n",opadd,opcode,rom[opadd+1],rom[opadd+2]);
            printf("    clockticks+=5;\n");
            //void zp6502()
            //{
            //      savepc=gameImage[PC++];
            //}
            temp=rom[opadd+1];
            //void asl6502()
            //{
            //      adrmode[opcode]();
            //      value = gameImage[savepc];
            //      P= (P & 0xfe) | ((value >>7) & 0x01);
            //      value = value << 1;
            //      put6502memory(savepc,value);
            //      if (value) P &= 0xfd; else P |= 0x02;
            //      if (value & 0x80) P |= 0x80; else P &= 0x7f;
            //}
            printf("    value = ReadMemory(0x%02X);\n",temp);
            printf("    P= (P & 0xfe) | ((value >>7) & 0x01);\n");
            printf("    value = value << 1;\n");
            printf("    WriteMemory(0x%02X,value);\n",temp);
            printf("    if (value) P &= 0xfd; else P |= 0x02; ");
            printf("    if (value & 0x80) P |= 0x80; else P &= 0x7f;\n");
            printf("    showme(0x%04X,0x%02X);\n",opadd,opcode);
            checkme(opadd,opadd+2);
            break;
At some point you get tired of typing in instructions and want to try to actually run something. From Steve's emulator and later MAME you can see where the rom data fits in memory for a specific game/machine. Again from the emulator (BTW you have to have the source and be able to re-compile the emulator you are going to write your SBT against) will show you the entry point in the rom to start running, etc. Sure you could find a 6502 data sheet, but you know I really didnt need one.

Unfortunately the 6502 (as with all old micros) is a CISC machine (I am a total RISC convert BTW), so you cant tell by looking at any particular byte in memory if it is the opcode byte or if it supports the opcode which is one or two bytes ahead of it. There has been some recent discussion in the staticrecompilers group on this very topic. One method for figuring out which bytes in the rom are opcodes and which are not is to modify the emulator to tag addresses where instruction (opcode) fetches occur. I call this a hitlist. The other method is a "tree walk" of the rom, we know where the entry point is, look at that opcode, you have to know how many bytes each opcode uses, you need to know where the conditional branches are, and the unconditional branch (the ultimate end of the line instruction). You basically try to recursively follow the possible paths of execution the code can take, building a hitlist as you go. Each branch destination or subroutine call generates more limbs or trees to follow. I actually argue that it is good to use both methods. Start with the tree walk, although it takes a lot of prep time and coding time (since the program has to know so much about the rom's native processor), but once you are ready building the list takes seconds or minutes. It only takes seconds or minutes to modify the emulator to capture a hitlist, but it could take hours or days of playtesting (have to get the high score have to get the extra ship have to die this way and that way and the other way). So either method takes hours or days to complete. Fortunately Asteroids had no funny stuff in it (like jump tables) so the tree walk method proved to be quite successful. Unfortunately I had not prepared for this when I started, I started with a crude playtest method and for a night or two fell into the painful "run the translation until it finds an address that has not been translated, manually add it to the list and recompile and re-run" method. To make a long story short, if you look at the code above you see the checkme() function. I manually went through and added that function to each opcode. I can now start with an empty hitlist, run the translation. The entry point address gets checked against the hitlist and added. The checkme() function as listed above, sends the current address (to get to this point in the translator that address is already in the hitlist) as well as the next address (This is a two byte instruction so the next address to fetch would be at opadd+2), instructions that are not terminal (a unconditional branch is terminal since you dont ever have to fetch and execute the byte following) get a checkme() call. checkme() looks to see if the non-hitlist addresses between opadd and opadd+2 in this case (opadd+1) do not have a hit, if so there is a synchronization problem that must be addressed manually. checkme automatically adds addresses to the hitlist and at the end of the translation the translation program writes the hitlist back to the disk. The next time you run the translation the hitlist gets longer and more accurate. I have another function I use to add branch destinations to the hitlist, with these two functions, and the fact that programs execute in sequential order most of the time (if opadd is in the hitlist when the above case is hit, but opadd+2 was not, by the time we go to look at opadd+2, it will be there and it will get translated on this pass of the translation program, complete unbroken strings will get resolved in a single pass of the translator) Asteroids only takes a few iterations of the translator to completely resolve the hitlist. I dont think playtesting found any addresses, likewise with Asteroids there were no sync problems (I would argue a good commercial CISC compiler should create instruction streams that are intentially difficult to disassemble, basically intentionally creating jumps that land in the middle of instructions, that sort of thing).

So that was long winded, basically now you have your hitlist, you have your rom read in to an array simulating the memory of the source machine. Your translation program is basically:

   for(opadd=ROMBASE;opadd<=ROMEND;opadd++)
   {
       if(hitlist[opadd]) translate(opadd);
   }
...

void translate ( unsigned short opadd )
{
   unsigned char opcode;
   
   opcode=rom[opadd];
   switch(opcode)
   {
       case 0x00:
        printf("  ...
        break;
       case 0x01: //illegal opcode
        printf("  crash(0x%04X);\n",opadd);
        break;
       ...
       
        
   
   }
}

And your resulting translated program is basically:

void execute ( unsigned short opadd )
{
    switch(opadd)
    {
default:
    printf("address %04X not defined\n",opadd);
    abort();

...

case 0x1234: //example unconditional branch
    PCSTART=0x5432;
    return;    

...

case 0x6C7C: //06 0B 2A
    clockticks+=5;
    value = ReadMemory(0x0B);
    P= (P & 0xfe) | ((value >>7) & 0x01);
    value = value << 1;
    WriteMemory(0x0B,value);
    if (value) P &= 0xfd; else P |= 0x02;     
    if (value & 0x80) P |= 0x80; else P &= 0x7f;
    showme(0x6C7C,0x06);
case 0x6C7E: //2A 06 0B
    clockticks+=2;
    saveflags=(P & 0x01);
    P= (P & 0xfe) | ((A >>7) & 0x01);
    A = A << 1;
    A |= saveflags;
    if (A) P &= 0xfd; else P |= 0x02;     
    if (A & 0x80) P |= 0x80; else P &= 0x7f;
    showme(0x6C7E,0x2A);

...

    }

}

int main ( void )
{
    PCSTART=0x4534; //determined by the translator or known from the processor type:
    while(1)
    {
        execute(PCSTART);
    }

}

This switch statement type of thing is really a slick idea, You use the original roms addresses as entry points you dont have any breaks in code, you simply run through until you need to branch, to branch you set PCSTART to the new PC address, return (or break) out of the switch statement and out of the execute function and then re-enter at the new address. Now with Asteroids and probably many other games you will find this isnt enough. Asteroids had a non maskable interrupt, which basically occured at the screen refresh rate, when you get this interrupt I think it would stop what it was doing and build the data for the vector graphics processor, then return to the main program. I tried to solve this a couple of different ways, ultimately the showme() function which was originally placed there to dump the registers after each instruction for debugging purposes, was a good place to check the clockticks to see if enough had gone by to simulate an NMI. I would save the current PC (showme passes its own opadd), put the stuff on the stack just like the 6502 would, and had a second while(1) execute(PCSTART); loop, except this one had an RTI flag if the RTI flag was set (meaning the return from interrupt instruction had been hit) then I exited out of the loop, the PCSTART variable was already set correctly since I had setup the stack right, and the showme function would return to normal program flow (basically one level of recursion to accomplish this). Again Asteroids is an easy one to translate, no other interrupts or weird things like that to make the translation difficult (the 6502 is easy too).

After a lot of typing (perhaps the skeleton program could have done a lot more, and should have done a lot more, unfortunately by the time you figure out what the skeleton program should have done you have fat fingered in enough stuff to not want to re-do the skeleton and have to retype much of work to date) you end up with something that will run for a while without hitting an un-implemented opcode. You may not even wait that long. Since I started with the mac emulator and since it was easy to rip out the mac specific stuff for video, etc, and leave a command line program, I had it dump the opadd, opcode, and registers after each instruction. And I setup the translation to do the same. Not the most efficient method but it worked. I ran tens of thousands then later hundreds of millions of instruction cycles, and hade a special made program to compare the files and stop at the first difference. Generally it was a improperly implemented instruction. Sometimes it was harder to find, 2000 instructions prior a byte in memory had been set wrong and you would have to figure that out. So what I did was in parallel each call to WriteMemory and ReadMemory was logged, you could then search through the emulator version of the memory accesses and the translated memory accesses and find what instruction should have set what byte when, and fix it. And as expected there were some truly difficult problems to figure out.

Eventually after many iterations of running the emulator against the translator, (mind you no game controls have been implemented at this time, this is simply the boot up and demo mode of the game) you have implemented quite a few of the source processors instructions, and worked out many of the bugs and tricky translation problems. I think I made it to one billion instruction cycles before the files were just so huge (hitting that magic 2gb or 4gb boundary that the generic fopen/fread/fwrite functions dislike) it was time to do something else. And yes, once you get to some big number like 10million cycles, you can setup both programs to only dump from instruction 10million to 20million then quit, etc. Anyway, you are probably ready to work the graphics, then work the keyboard, then let the program start slamming into other unimplemented instructions. My skeleton was built in such a way that instructions that were not implemented yet would place the function crash(0x1234,0x45); in the translated code, where 0x1234 would have been the opadd and 0x45 the opcode. A grep of game.c after each translation would yeild a list of opcodes that needed to be implemented, if the hitlist was right you would or could eventually hit those so it made no sense to playtest waiting for that time, just keep implementing instructions until a grep of crash in game.c was clear. By this time, you are as finished as you can be with the hitlist you are using. You can chose to simply start playtesting or you could maybe do some other tricks like take a working emulator and capture keyboard/button inputs based on instruction cycle and then go into the game and simulate those buttons being pressed (if(instcount==121345) WriteMemory(0x2407,0x80);) at the same time and again compare cycle for cycle with the emulator. Eventually you have to just dump the emulator and move forward. This is where I think my method at least for this Asteroids port, really shined. The code is not optimized, so you can look at the big case statement and see each register being used, the flags, etc. Just like the emulator.

You really have to implement at least the graphics at this point on your target system. I didnt realize until it was too late that the Atari Vector Graphics processor was really that, a processor. It had its own assembly language that had to be parsed and emulated just like any other processor, and in this case the vector graphics was quite time consuming, so much so that the GBA version was painfully slow. Either way you have to get the graphics running on your target (or any target for that matter, start with the PC and go onto your console later) you want to "See" the thing boot and go into the demo, maybe implement a key to add a coin or press the start button.

once you get graphics, then get the buttons working, and start playing the game and think you have a good translation or at least something as good as the emulator. Now it is time to speed it up. If for no other reason to keep the compiler abuse down. The asteroids port was a huge file, compiling for Linux (X86) was pretty fast on my Atlon 2100 but cross compiling for arm using gcc 3.4.x took 10 minutes! Optimization is really where you need to read Grahams HOWTO linked above. He has really made an art form of it (as have others in the staticrecompilers group). I gave him a "working" translator, no optimizations, and he made a work of art out of it. The output code was a fraction of the size of what I was building (we are not sure as of this writing if the output actually runs right...yet). To digress, lets look at what we can do with my code, and some of these are ideas stolen directly from Graham.

First, clockticks and the nmi, there is no real reason once the translation is "working" to call showme() every instruction. What if we only called showme on branch instructions, they happen often enough to get reasonably accurate timing on the nmi that you wouldnt notice it. So you have to figure out which instructions are branches, go through and comment out all the showme calls except for the branches. With my method what falls out of this is that only the instructions or addresses that call showme can be interrupted, knowing this and using the checkme like function for branch destinations I was able to make a list if addresses (another hitlist) that would be the only ones to need a case entry in the execute(PCSTART) switch structure. So you can eliminate or basically let the translator comment out the case statements that wont be used:
case 0x6C7C: //06 0B 2A
    clockticks+=5;
    value = ReadMemory(0x0B);
    P= (P & 0xfe) | ((value >>7) & 0x01);
    value = value << 1;
    WriteMemory(0x0B,value);
    if (value) P &= 0xfd; else P |= 0x02;     
    if (value & 0x80) P |= 0x80; else P &= 0x7f;
    //showme(0x6C7C,0x06);
//case 0x6C7E: //2A 06 0B
    clockticks+=2;
    saveflags=(P & 0x01);
    P= (P & 0xfe) | ((A >>7) & 0x01);
    A = A << 1;
    A |= saveflags;
    if (A) P &= 0xfd; else P |= 0x02;     
    if (A & 0x80) P |= 0x80; else P &= 0x7f;
    //showme(0x6C7E,0x2A);
Ideally you would want to take execution strings and replace the clockticks increments with only one larger one per string instead of all the tiny ones all along (I didnt do this, didnt do the proper tree-walk up front). Looking at the code above you can see since 0x6C7E is not an entrypoint of any kind, meaning these two instructions will always execute back to back, the flags set by the first instruction will always be destroyed by the second instruction, so we can eliminate this code. I think the staticrecompilers guys call this dead code elimination:
case 0x6C7C: //06 0B 2A
    clockticks+=5;
    value = ReadMemory(0x0B);
    P= (P & 0xfe) | ((value >>7) & 0x01);
    value = value << 1;
    WriteMemory(0x0B,value);
//    if (value) P &= 0xfd; else P |= 0x02;     
//    if (value & 0x80) P |= 0x80; else P &= 0x7f;
    //showme(0x6C7C,0x06);
//case 0x6C7E: //2A 06 0B
    clockticks+=2;
    saveflags=(P & 0x01);
    P= (P & 0xfe) | ((A >>7) & 0x01);
    A = A << 1;
    A |= saveflags;
    if (A) P &= 0xfd; else P |= 0x02;     
    if (A & 0x80) P |= 0x80; else P &= 0x7f;
    //showme(0x6C7E,0x2A);
Another neat trick, which should have been more obvious, I used WriteMemory() and ReadMemory() everywhere memory was accessed, this was good for debugging and getting to a working state, bad for performance. To solve that you could for example replace with ram[0x0b]=value; or I just made a #define to do that so that I could undo the define later and debug. Big performance boost. Another one that Graham knew about based on compiler experience was instead of known branches setting PCSTART, leaving the switch statement and re-entering, use a goto:
case 0x6C7C: //06 0B 2A
L_6C7C:
    clockticks+=5;
    value=ram[0x0b]; //value = ReadMemory(0x0B);
    P= (P & 0xfe) | ((value >>7) & 0x01);
    value = value << 1;
    ram[0x0b]=value; //WriteMemory(0x0B,value);
//    if (value) P &= 0xfd; else P |= 0x02;     
//    if (value & 0x80) P |= 0x80; else P &= 0x7f;
    //showme(0x6C7C,0x06);
//case 0x6C7E: //2A 06 0B
//L_6C7E:
    clockticks+=2;
    saveflags=(P & 0x01);
    P= (P & 0xfe) | ((A >>7) & 0x01);
    A = A << 1;
    A |= saveflags;
    if (A) P &= 0xfd; else P |= 0x02;     
    if (A & 0x80) P |= 0x80; else P &= 0x7f;
    //showme(0x6C7E,0x2A);

...
    //PCSTART=0X6C7C;
    //return;
    goto L_6C7C;


Another big performance boost.

Those were easy ones. Had I used Steve's version of Neil's code this time around I might have had an easier time of the next neat trick. Many of the 6502 instructions set the N (negative) and Z (zero) flags, you can see this above, bits 0x02 and 0x80 of the P register (the flag register). Steve had gone through Neil's code and replace the stuff above with macros:

case 0x6C7C: //06 0B 2A
L_6C7C:
    clockticks+=5;
    value=ram[0x0b]; //value = ReadMemory(0x0B);
    P= (P & 0xfe) | ((value >>7) & 0x01);
    value = value << 1;
    ram[0x0b]=value; //WriteMemory(0x0B,value);
//    DO_NZ(value);     
    //showme(0x6C7C,0x06);
//case 0x6C7E: //2A 06 0B
//L_6C7E:
    clockticks+=2;
    saveflags=(P & 0x01);
    P= (P & 0xfe) | ((A >>7) & 0x01);
    A = A << 1;
    A |= saveflags;
    DO_NZ(A)
    //showme(0x6C7E,0x2A);

Then Graham used another neat trick, this one I have started to implement in the roids209 version but I have basically broken the thing, so I have left it incomplete. The neat trick is to simply do this:

case 0x6C7C: //06 0B 2A
L_6C7C:
    clockticks+=5;
    value=ram[0x0b]; //value = ReadMemory(0x0B);
    P= (P & 0xfe) | ((value >>7) & 0x01);
    value = value << 1;
    ram[0x0b]=value; //WriteMemory(0x0B,value);
//    NZ=value;     
    //showme(0x6C7C,0x06);
//case 0x6C7E: //2A 06 0B
//L_6C7E:
    clockticks+=2;
    saveflags=(P & 0x01);
    P= (P & 0xfe) | ((A >>7) & 0x01);
    A = A << 1;
    A |= saveflags;
    NZ=A;
    //showme(0x6C7E,0x2A);

Then only when you need to evaluate the zero or negative flags do you do the if(A) or if(A&0x80), that has to save lots of execution time. This took me a lot more typing and figuring out. Basically it was more than just the NZ flags, I used C=A; or C=value; for the carry flag V=A; for the V flag, etc. When you hit an NMI interrupt you need to (to be faithful) combine the flags back into one register and push that on the stack (in theory just make a copy of all of them since Asteroids does not interrupt the interrupt and you can modify rti to not pop the P reg). Then when you pop it off re-build NZ, C, V, etc. NZ is tricky because the real flag for zero is set on zero, but you are taking the whole byte value NZ and later replacing the if(P&0X02) with if(NZ). I had a hard time with this then realized I needed to reverse the flag, before building a new P to push on the stack I had to say if(NZ) P|=0x02; or basically if(NZ) NZ|=0x02; then P=(NZ&0X82)|(C&0x01)... Then later when you san NZ=P&0x82; to pop it off of the stack if(NZ) still works for the zero test. You also have to "fix" the set Z and clear Z functions to match. Arguably this trick should greatly improve performance but it is a big code change which adds risk IMO. Unfortunately I didnt pull it off right away (and may have given up).

The last obvious trick for the 6502 was again something I stole from Graham, and something that boosted performance some I am sure. The stack pointer is an 8 bit register, the stack is fixed between 0x0100 and 0x0200 in memory, so my code was using a 16 bit variable (you could do this with type casting, in theory the compiler should smooth it all out):
    temp=S++; temp+=0x100; 
    value=ReadMemory(temp);
Graham (and I) replaced this with:
    value=stack[S++];
Just make a 256 byte array! and use it. Forget about adding 0x100 all the time and going from an 8 bit address to 16 bit.

So basically I have working code that has some performance improvements, and I have some broken code that has the neat flag tricks, but because of those tricks I cant go back and compare instruction by instruction against an emulator (easily) so I dont know what to do (this is just a hobby, not a job)...

roids207.tar.gz is the source that will build game.c, game.c compiles into the binary that becomes both version 004 and 005 of Asteroids for the iPod (005 added the hold switch which will pause the game!). roids209.tar.gz is the broken one, enjoy!

I have chosen NOT to distrubute Atari's rom, you will have to find the rom files yourself, then compile and run dorom (dorom.c) this will create rom.h which you will need for both trans.c and game.c. Once you build rom.h you can compile trans.c, running
 trans >game.c 
builds game.c then you combine game.c, vsm.c, and lcd.c (etc) to make your iPod asteroids binary. Then you have to use a util like make_fw to combine your apple firmware, asteroids, and the ipodlinux bootloader, to create a final binary to install on your iPod. To compile game.c for ARM (using gcc) I had to use gcc 3.4.x. I had problems with 2.95.3, YMMV.
Sorry for all of the steps but this is the only way to protect Atari's copyright.

Asteroids Deluxe works the same way, unzip the rom files into the same directory as the source below, grab the ipodlinux loader files, make_fw and loader.bin, create your own apple_sw.bin.

Downloads:
deluxe.003.src.tar.gz (Asteroids Deluxe)
roids207.tar.gz (Asteroids)
roids209.tar.gz (Asteroids not working)

I assume no responsibility for what you do with this code. I hope that you will respect the various authors of this code (Neil Bradley, Steve Green, David Welch, Graham Toal, B.J. Leach, and others).

Neil Bradley: 6502.c and 6502.h, and possibly vsm.c (he certainly has written a DVG processor emulator at least once in the past
Steve Green: gave me his emulator source, from which I learned quite a bit even if none of the code I actually used was written by hime (not really sure).
Graham Toal: watched what he did with my code and learned from it and borrowed many ideas/solutions
Bernard Leach: Instrumental player in the ipodlinux.sf.net group, I assume most of the iPod specific code (lcd.c, etc) is his
David Welch: I wrote skel.c to build trans.c along with gtop.c and gbot.c to build game.c. Modified lcd.c to suit my needs for this program. Adjustments to vsm.c and other files.
Others: At one point I went into MAME to check the registers per instruction, I also figured out some good default settings for the dip switches from MAME.
I dont actually know the authors of all of the files I am using/distributing.

ipodtadwelchtodcom