Torchlight III | Developer Update: Data Fixup War Stories

Developer Update: Data Fixup War Stories

A technical look at a few of the data-related bugs we encountered during Torchlight III’s game development.

A technical look at a few of the data-related bugs we encountered during Torchlight III’s game development.

Check out our Discord, Facebook, Twitter, and Twitch to get the latest development news and updates.
 


This month's Developer Update comes from Guy Somberg, Lead Programmer of Echtra. Sometimes the best possible way to look to the future of a project, is to understand its past. While the next Developer Update will take a deeper dive into upcoming updates, ongoing issues, and other community feedback, this blog is for those who enjoy a good technical read and want a closer look at some of our internal development ‘war stories’.

Introduction


In game development, not every problem that you solve is performance, features, or functionality that the players get to see.  Sometimes, you have something messy that needs to be fixed, and you just need to dive in and fix it so that work can get done.

There is a phrase that we use to talk about this sort of work: “yak shaving”.  Originally from a reference to the TV program “Ren and Stimpy”, it now refers to work that appears to be completely unrelated to the end goal, but which you have to accomplish in order to reach it.  For example - “I am trying to build a stone bridge over this creek.  I am shaving this yak so that I can trade the fur to a yarn maker, who in exchange will let me borrow the cart so that I can take it to the quarry to pick up some stones.”  Shaving the yak isn’t necessarily an important part of building that bridge, but you won’t be able to make any progress while the yak still has its coat!

This is a collection of just some of the times that we’ve had yak-shaving problems that needed to be “just fixed”.

0D0D0A


Source control systems are one of the fundamental tools that game developers (and, in fact, pretty much all developers) use.  It is a database containing the entire history of every file that makes up our game - source code, assets, sounds, you name it.

About three years ago, we switched our source control from one system to another.  It doesn’t matter from what to what.  The system we were using was breaking down under our load, so we needed a new one.  We did our homework, examined the alternatives, and made a call.

Now, when you transition source control systems, there are broadly speaking two ways to go about it.  The simpler way is to lock everybody out of source control, take a snapshot of the latest stuff, import it into the new system, tweak it to conform to the new system’s idea of how the universe should work, and then turn the new system on for people.  This has the advantage that it “just works”, but it loses all of the source history from before the changeover.  In these environments there is often a single moment in source history that says “Imported everything.  If you want history before this, go look in the other source control system.”  That’s fine, so long as the other system stays around or you otherwise have access to it, but often the commits from earlier are lost forever.

The more complex way of doing this is to actually import the history from the old system into the new one.  Most source control systems allow you to do this, but it is time-consuming, error prone, and still requires some manual intervention to conform to the idiosyncrasies of the new system.  Although it is more work up-front, it is invaluable in the end to have your entire source history available.

We opted for the history import, which - at least on the surface - seemed to go just fine.  We saw the history, we saw the files, and we were able to poke around and verify that everything looked right.  Some files had weird spacing issues, which didn’t seem like that big a deal.

But then we tried to compile, and it all crumbled down.  The Visual Studio compiler complained about “Mac line endings” and refused to compile anything.

What?!  Why would that be?

A bit of background here: when a computer wants to represent a character, it has to select an encoding.  The most common encoding in use today is called UTF-8, which can encode all English-language characters, common punctuation, or a number of control codes into a single byte of data.  (Using multiple bytes, you can encode data in just about any language, but that’s another discussion.

Two of these control codes are the Carriage Return (CR) and the Line Feed (LF) characters, which hearken back to the days where computers were hooked up to automated typewriters rather than screens.  In those days, you would tell the printer carriage to go back to its home column by sending it a CR code, and you would have the paper roll to the next line by sending it an LF code.  Thus, if you wanted to start typing at the beginning of a new line, you would send the sequence CR LF

When the switch to fancy graphical displays occurred, this CR LF convention remained for backward compatibility.  However, developers of new systems - like the sparkly new Apple Macintosh computer and the Unix system at Bell Labs - weren’t hindered by backward compatibility and were free to make different choices

It turns out that the three most common systems in the world today all made different choices: DOS used CR LF, Macintosh used CR, and Unix used LF.  Windows inherited its line endings from DOS, and MacOS now uses LF (the same as Unix).

Over time, the differences sorted themselves out.  Software is generally able to operate in “text mode” and provide the user with whatever line endings they need for their system to render it correctly.  The details of these differences leak through every so often, but usually aren’t a big deal.

All of this background flashed through our minds when we saw the error about Mac line endings.  What was it talking about?  We develop on Windows, so all of the line endings should have been Windows (CR LF) line endings - or, at the very least a combination of Windows and Unix (LF) line endings.  Where were these lone CR characters coming from?

And then we remembered the weird spacing issues - all of our source code appeared to be double-spaced.  Where did those extra spaces come from?

This is where somebody had the idea to look at the file in a hex editor - a tool which allows us to see the binary representation of the text files by displaying each byte’s value in hexadecimal.  Ordinarily, on a file with Windows line endings, you expect to see a line of text, then a CR (13, or 0D in hexadecimal representation) and an LF (10, or 0A).  For some reason, on the broken lines, we saw a CR (0D), then another CR (0D), and then an LF (0A), giving us 0D0D0A.

Somehow, during the conversion process from one source control program to another, the conversion program decided that the file had Unix line endings, then went through and did a blind search/replace every LF with CR LF, even if it already had a CR!  That explained everything.  Our editor was perfectly happy to render the CR as a blank line, and knew how to convert the CRLF into a blank line, which was why our code appeared to be double-spaced.  Contrariwise, the Visual Studio compiler was happy to interpret the CR LF combo as a newline, but errored out on the preceding CR.

I fixed this by writing a little program in C++.  It would iterate over our source code directory, open every text file, find patterns of 0D0D0A and replace them with 0D0A.  We don’t expect to change source control systems again, so the code for this tool is lost to the sands of time.  (Ed. - Or, so we thought!  A drive containing the source for this program was discovered after this article was written, so we have uploaded the code to our repository for posterity.)

There were only two or three of us who worked on this particular issue, but you can get any of us to twitch a little just by saying “oh doh doa”.

Octothorpe Fixer


A couple of years ago our sound designer and our composer took a trip out to Bratislava, Slovakia to record a live orchestra for some of our music.  It was an awesome trip (so I’m told), and they got a lot done over the few days that they were there.

One of the outputs of this trip was a suite of content that we call “vzory” - Slovakian for “pattern”.  These are small orchestral chunks of music that can be combined in myriad ways to create new music, and are recorded in various combinations of keys and notes.  The end result is that we have a particular pattern in G, in G#, in F, in F#, etc.

Our composer did the natural thing - he spent a bunch of time cutting and organizing all of this content, doled it out into folders and files matching the note that they were recorded at, and then imported the whole suite into the audio tool that we use, FMOD Studio.  FMOD is hooked up to our source control system, and it happily added all of the new files and then checked them in.

So far so good.  Until people started to get mysterious warnings about filenames when they got the latest code and data through our source control system.  They were just warnings - they weren’t preventing anybody from working - but it was definitely something that we didn’t want to stick around.

These vzory tracks were tracked down as the culprit.  It turns out that our source control system doesn’t like it if you check in files with an octothorpe (‘#’, sometimes called a pound sign, hash mark, hash tag, number sign, or various other things) in the filename.  It will accept them, but complain loudly.  It turns out that our composer named the directory and matching files for the vzory tracks in the key of A sharp with the name “A#” - naturally!  (The other sharp keys were set up this way as well.)

The source control system was most displeased with this choice.

Renaming the files wasn’t enough, because FMOD keeps track of file and directory metadata in XML files - each one with a GUID (a sequence of letters, numbers, and dashes) as the filename.

Once again, code to the rescue.  This time it was a program written in C# (ironically) that would iterate over all of the files and subdirectories in the given path, find ones with an octothorpe in the name, and rename them to replace the ‘#’ with the word ‘sharp’.  So, ‘A#’ became ‘Asharp’.  Then it would iterate over the XML files in the path, find any that had an octothorpe in the file contents (which were therefore metadata about the files or directories that had been renamed), and replace the ‘#’ in that line with the word ‘sharp’.

Other than telling our folks “don’t do that”, there’s not much that we had to do to prevent this from happening again.  This time we kept the source code, so if we make that particular mistake again then the tool to fix it is ready at hand.

POFixer


Localization and internationalization are important parts of any game project.  We use the Unreal engine, which has a suite of localization tools built-in.  By using a particular data structure in our data files, Unreal can find all of the localized lines in the game.  We can then export them into a standardized format called a “portable object” (.po) file, used by the GNU gettext tools, among others.

This is a format that our translators have tools to deal with.  They grab the files, translate the lines, and send them back.  We can then import them to a particular locale and then Unreal will render the text.  All very neat, so long as you color within the lines and follow the way that Unreal expects you to work.

Naturally, we have built some of our own stuff which lives inside of Unreal’s systems and plays nicely with them, but is sufficiently “off to the side” that it is invisible to some of Unreal’s other systems.  One of those parts is the localized string system, which didn’t see any of our fancy assets.

We wrote a tool that makes them visible, and called it a day.  Our first big batch of localization went out to the translators.  We went to import it...only to find that none of our strings got imported!

What happened?

Unreal allows you to identify each localized string by a pair of text strings: a category and an entry within that category.  If you don’t provide either of those entries, it will generate them for you.  It turns out that the tool that we had written to make our assets visible to the translators generated a new category and entry for every localized text string every time it was run, which meant that every text line would get a different code every time we ran the import or the export.

Oh, dear.  We fixed the underlying problem and made the category/entry pairs be consistent across runs, but we had this massive drop of strings in all of the languages that was incompatible with the fixed-up data!  We had to figure out how to run a one-time fixup on these strings to make them match.

Fortunately, each string came with a lot of metadata about its context and provenance.  Much of this metadata did not change, or at least changed in a predictable fashion.  This metadata turned out to be enough that we could compare an imported line and a newly-exported line and match up the strings.

As before, writing some code was the answer here.  We wrote a program (C++ again) to read in the translated file (containing the old, incorrect category/entry pairs) and a newly-exported English-language file (containing the new, correct category/entry pairs), match up the metadata, and then write out a fixed version of the file containing the translated text with the corrected category/entry pairs.

Here is one situation where simply fixing up the data was insufficient.  We needed to solve the underlying problem first before we could write the tool to fix the data.

Conclusion


These problems all had a common theme: through some sequence of events - human error or machine error - a bunch of important files appeared that were all broken in some way.  Ultimately, to the people who are working with the data it doesn’t really matter why any of these things happened.  They just want to take their broken files and fix them.  It is always a worthwhile endeavour to figure out a root cause and prevent an issue from occurring again, but sometimes you just need to get to work.  All of the postmortem analysis and preventative work in the world won’t help people get their jobs done with the broken files that they already have.

The examples I have talked about here were all important work that needed to get done, but all three of those programs got run exactly once.  It turns out that many of the systems that we build are complex, and these sorts of issues crop up as a normal part or development as we discover some of the edge cases.

Sometimes, you need to write tools that you run exactly once, and that is not a problem - you just need to grit your teeth and shave that yak.

- Guy Somberg