The Hardest Bug

07 Nov 2009
Tags: rlvm

This article is long. It covers a bug that took weeks to understand what the problem was and is easily the hardest bug I’ve ever had to track down.

Background: The stage

rlvm is my open source reimplementation of RealLive, a commercial Japanese game engine and scripting system which is used for some fairly niche games. RealLive is a fast Turing-complete bytecode interpreter with a set of standard functions for text, graphics, sound, music and animation. So a programmer would write the game in some unknown scripting language and the game would be compiled into a SEEN.TXT, a binary archive format that contains compiled bytecode.

Because several popular Japanese anime series have been written based off of said games, there’s been interest in the West about playing them. This has lead to a man named named Haeleth documenting the inner workings of RealLive and he released an assembler/disassembler for the bytecode format called rldev, which rlvm would have been impossible without. He’s also made a translation tool named rlBabel which I’ll get to in a bit.

CLANNAD is one such game. rlvm could play the Japanese version of CLANNAD, but now it was being translated in a wiki. Scripts were disassembled with rldev, which separated the code into a disassembled bytecode file and a string resource file. The string resource files are uploaded to the wiki and translated. The translated string resource files are fed back to rldev to compile a version translated into English.

These games are heavily text based (the genre is called visual novels for a reason) and the bytecode format reflects this. Any quoted freeform text is allowed and is simply printed to screen. There’s an opcode to print out a piece of memory (strout() and intout()) for the uncommon case where a piece of text was built up programmatically in a buffer, but this is rarely used in real games. The bytecode therefore needs to be tokenized into text, commands, debugging symbols, and the raw ASTs which are how math expressions are stored.

The problem

I announced being able to play translated CLANNAD files in version 0.06 of rlvm and everything seemed to work. The translation project then issued a new file with debugging symbols stripped, which shaved several megabytes off the SEEN.TXT file. The problem: It didn’t work in rlvm. The game appeared to work fine as usual, except that it never displayed any text. From the same source code, a SEEN.TXT file compiled with debugging symbols worked, and the stripped SEEN.TXT file didn’t. The game appeared to work, but no text was displayed on screen.

No text in the Text box

rlBabel: How these games are translated

Haeleth is responsible for pretty much all the tools to translate a game from Japanese into some other language. Japanese text is laid out monospace, contains no spaces between words and, with some exceptions, can be line-broken anywhere, including in the middle of a word. To get around this, Haeleth wrote an additional library: rlBabel, meant to add Western linebreaking rules to a Japanese game engine and added support for rlBabel to his compiler suite.

When you compile a game with rldev and rlBabel turned on, instead of inserting:

"Text to output"

which goes through the raw, static text interface, it does the equivalent of the following:

strout("")   {- Manually call the text system to display a window -}
strcpy(buffer, "Text to output")
Call a DLL with buffer which appends it to an internal buffer
gosub displayText

...

displayText:   {- One copy per translation unit -}      
  loop forever:
    retVal = Call a DLL asking about what the bytecode should do.
    if retVal is New Line
      <cause a new line to be printed>
    else if retVal is Print Character
      <prints the current character>
      <manually moves the text insertion point to where the DLL wants it>
    else if retVal is End of String
      gosub_return

So half of rlBabel is some headers that get compiled into your SEEN.TXT, and the other half is a big DLL which does font metrics, text tokenization and feeds the data back to the bytecode side. rlvm has the DLL code compiled into it, slightly modified. A previously translated game can be played start to finish in rlvm.

If you think this is ridiculously over-engineered, attempts to do it all in bytecode are worse, not to mention less accurate. Also, it’s almost unusably slow.

Debugging symbols obscure the error

After weeks of working on debugging this, with quite a large number of false starts and dead ends, I figured several things out:

  • My in memory representation of the bytecode was equivalent between the two input files, sans line markers, which only exist to say “Line XXX in the source starts here.” Except for line markers, the files were being parsed identically.
  • My implementation of rlBabel wasn’t to blame. The DLL interface was being called properly, but was mysteriously always passed the empty string as its input buffer. The bytecode would ask the DLL what to do, and my code would sensibly return getcEndOfString.
  • I then printed out each opcode as it was executed. strcpy() was never being called in the SEEN.TXT file that had debugging symbols stripped.

The behaviour in game made sense. A text window would be displayed by the manual printing of an empty string, and nothing else was printed because rlBabel’s input buffer was empty. The rest of the game appeared to work because the only thing broken was something to do with getting data into rlBabel’s input buffer.

In my design, the implementation of each bytecode was responsible for incrementing the instruction pointer. The bug: The instruction pointer was being advanced by the function that caused text to be printed to the screen; it was not incremented after a textout element was finished. This appeared to work because the textout implementation just called that function.

But strout() also called this same low level function. Then strout() incremented the instruction pointer again because it was a function element. The instruction pointer was advanced twice and skipped the call to strcpy().

Wait, but why did things work with debug symbols? Let’s take a look at the sequence of bytecode elements that one text line and one pause command compile to:

# Source With Debug Without Debug
1 "Line of Text" Line marker
2 Kidoku marker Kidoku marker
3 strout strout
4 Line marker
3 strcpy strcpy
4 Line marker
5 Goto Goto
6 Goto Goto
7 Line marker
8 Goto Goto
9 Goto Goto
10 pause Line marker
11 pause pause

Ignore the gotos; they are an implementation detail of rlBabel. In the debug version, when strout() finishes, the instruction pointer is incremented by two and is placed on strcpy(). The skipped bytecode is a line marker which doesn’t matter. In the version stripped of line markers, when strout() finishes, it jumps over the strcpy() that loads text into rlBabel’s buffer and jumps straight into rlBabel’s implementation.

The fix to this was a two line patch.

Fixed!