Tuesday, January 20, 2009

Tilton's Law: Solve the Failure First

The team was at my throat.

"Just use the new search!," they bellowed.

The mission critical, project saving, do or die demo to upper management was eight hours away and we had not even begun the always dicey process of moving the software from the development system to one within reach of the Demomeister, and I was trying to find out why the old search was so slow.

"Soon," I replied.

We had a new search I was told was a screamer but I continued poking around putting in metrics trying to figure out why the old search was so slow. Had we not been a virtual remote telecommuting team I would not have lived to tell this tale but we were so they had no choice and I reassured then that "soon" meant ten minutes and they shut up.

Why was I still trying to understand the perplexing sloth of the old when a whole new replacement module was available and working fine and pretty much the demo on which all our jobs and a cool project depended was coming on like freight train?

Tilton's Law: Solve the failure first.

Early on we learned the other side of that coin: Solve the first problem. The commonality is...no, let's do the war story first, war stories are more fun than preaching.

Back we go a quarter of a century to my first contract with a client who would become my sole recurring client for the next decade. I was being hired to take over maintenance of an application whose author had been one of the first to die of AIDS. I was reminded of the whole business by a conversation with another developer recently about the nature of working on OPC. Other People's Code.

In my IT career I have worked always at the poles of software development, either writing new code or performing massive overhauls of OPC, never that relaxed zone between in which one simply maintains and extends in small ways a long-lived system. The second pole (OPC overhauls) always seemed to me an intimate one-way encounter with some anonymous predecessor, an encounter usually involving me roundly and steadily cursing them out. You can imagine then how eerie it was working on this system from this predecessor who was not so anonymous this time, especially when I learned that the poor guy was in bad shape during one stint but needed the money and so worked on the code I was now working on even as his fate rose up to meet him (this well before the days of the cocktails of today that make ones fate less certain). This guy I do not remember cursing out so much.

But I digress. Our lesson today is how to piss off your coworkers by insisting on solving a failure first, by which I mean even if you do decide to punt on X make sure you understand how X failed. I am not alone in this. In 2001 the movie when the crew determines that the unit Hal said was no good was fine he says fine let's put it back in and let it fail. Sure, he was really looking for a way to kill the crew but we learned in 2010 that Hal was just a computer system and I think the bit about putting the supposedly OK/not OK system back in to see if it failed was one of Hal's systems working nominally in accordance with Tilton's Law: we need to understand broken things.

[Aside: a good example of the Hell of Not Solving Failure.]

And now at long last, my unsolved failure. My predecessor's, actually. The application was a securities database with a nightly feed of data applied to the cumulative DB by a batch program. This is late 80s, primitive stuff. A security could have three IDs because three groups were tracking securities and each had their own ID system. We had tens of thousands of records in our VAX/VMS RMS file, and a separate RMS key for each of the three possible IDs. So far so yawn. Here comes the fun part.

Two of the IDs were populated all the time. The other one was populated five percent of the time. Big deal, right? Right, very big deal, the poster boy for Solve the Failure First. What happened was this quesswork reconstruction:

My predecessor Paul (I picked "Paul" because it easier to type than predecessor) had a problem. His program ran an initial test load of a hundred securities in a few seconds. Fine. Everything looked good. So then he ran it against a full daily feed, which would include news of every security traded that day so it would be -- OK, I confess I completely forget even the order of magnitude, let's say tens of thousands and declare up front that that is idiotic and I am sorry, but here is what happened: the damn thing ran forever. There probably was no immediate specific great mystery because Paul probably had the program printing something (a count, the last ID recorded, something) right to his VT-100 console as it went and he could see that the program had started out zooming along but then gradually got slower and slower until just adding one security to the database (and this is just good old ISAM, mind you) took... wait for it... twenty seconds. Oh. My. God. What on earth is happening?

Paul got a clue. Every once in a while two records were written out bang-bang, as fast as at the start. Dig dig dig puzzle puzzle...ah, there it is. Any record for which we have all three IDs is written out in nothing flat. Any record (you know, the ninety-five percent) with just two will (by the end of the run) be written out three per minute, 180/hour, or 1000/fuggedaboutit.

Paul realized what was going on. The ISAM file system had no problem storing data with duplicate keys, which was a good thing because Paul was storing a whole lot of data with one key 95% the same: spaces. Poor ISAM it seemed was chugging thru all the duplicates looking for the last one after which it would record the latest duplicate. And apparently it took twenty seconds back then to walk (effectively) the entire index of a hundred-thousand record file.

Now the good news is that we would never need to look something up using spaces as the key value sought, so....what can we do? Paul was no slouch. He popped open the RMS reference manual and to his delight discovered he was not the first to pass this way and gleefully added the option "NULL_VALUE=SPACES" (translated: "if the value is spaces, Just Don't Index this record on this key") to the key definitions in the file definition script he was using to initialize the file and recreated the file and re-ran the program from scratch.

The change did not help. At all. I think we all know that feeling, as visceral as a dropping elevator.

There it was, the option explicitly intended to solve the problem he had explicitly encountered, and it did not change a thing. Impossible. But this happens to us programmers all the time. We know what to do. Compile the damn code, because we made the edit change but forgot to compile. Or link. Or, in Lisp, to zap the faultily specialized method. Or something.

So Paul edited the definition again and checked that NULL_VALUE = SPACES was the right syntax and right spelling and on the right key -- to hell with that, he put it on all three damn keys -- and he saved it and checked the date and created the file again and ran his program again and you know it did not run any faster or I would not be telling this story.

OK, time to get serious. Or if he was good he did this without all the huffing and puffing of the preceding paragraph. He Just Typed In "rms/analyze sdb.dat". And RMS looked at the file itself (not the script used to create it) and confirmed that "NULL_VALUE = SPACES" was operative for all indexes.

Momma don't let your kids grow up to be programmers.

What comes next is hard to convey. I can tell you but if you have not worked on this code or (we will learn) run this batch application it is hard to convey how much blood, sweat, tears, CPU time, and delayed nightly batch closes for how many years resulted from Paul's not first solving the failure of NULL_VALUES=YES.

Well, maybe this is a fair glimpse of the enormity that followed: the problem got sorted out only because the head of operations and I got to talking one day and he pretty much ended up down on one knee begging me to find some way to eliminate the two-hour merge step that held up the nightly close every night. I had never heard this before, though I had been overhauling the system for months.

"It just sits there for two hours," he groaned. "It kills us every night. Please, if you can, please, do something to make this go away."

Whoa. I had inherited this system and been asked to enhance it but no one had said a word about this. The code was far and away the best OPC I had ever dealt with so everything got the benefit of the doubt, including the (soon-to-be explained) two hour merge. As in, if it is there, it must be there for a good reason. What was not there was The Story of the Unsolved Failure of NULL_VALUE=SPACES, but even if it had been I would have taken that at face value, too, because the NULL_VALUE option was unknown to me. But enough of this flash forward, let's get back to poor Paul.

NULL_VALUE was not working as it should. Software is like that. Good programmers do not let bad software stop them. Plan B. A rule is born: Thou shalt not write new securities to the securities database where the massive duplicates will make each write take twenty seconds. Paul decides to write them to a second file initialized empty on each run. Since we only got dozens of new securities in one batch, that file would never have the massive count of duplicates and writes would be lightning fast. Then we just do a sort/merge at the end of the batch to combine the new securities in with the old. Oops. "Just."

The funny "you can run but you cannot hide" moral within the moral being that I did the calculations one day and worked out that twenty seconds times the usual number of new securities in a day was exactly as long as the sort/merge that was just killing the folks down in operations. And I bet Paul realized that but only after writing all the crazy code he had to write to work with two files at once as if there were only one file and at that point he just gave up and moved the thing into production. Speaking of crazy code...

You should have seen it. Looking back I cannot recall why it should have been so hard, but I did overhaul that code and I was forever tripping over it. The idea is simple. To look up a security to see if we already have it, first look in the real DB and if it is not there look in the daily "new stuff" DB and if it is not there, ah, it is new. If it is, update it. Just remember to update the right file, because we can get data from two sources about the same new security.

Piece of cake, right? A bottleneck function for all reads and updates... anyway, it seemed like the issue was always getting underfoot as I worked, and just looking at the code one saw again and again this check here/there code, and both Paul and I were the kind of engineers always on the lookout for ways to make code non-redundant. I would think my memory was faulty but I also remember eliminating Paul's Plan B after solving his failure first and that was no picnic. It just permeated the application.

So what was the first failure and how did it get solved? First, I had noticed the issue myself while using Datatrieve to add a record to the securities DB for test purposes. I hit enter and thought I had crashed the system because it went away and never came back and like every egomaniacal programmer out there I always assumed that whenever a system stopped responding the last thing I had done must have broken it so there I sat in dread for twenty seconds until the system finally responds. Wow. Twenty seconds? And then I guess I added a record specifying all three keys and it responded instantly.

But this idea of null values not being recorded in an index was new to me, and we did not have the Internet back then where I could just ask the ether what was going on so it was only a coincidence that just after the guy in operations had begged me for a fix that I was visiting with the lads from a prior contract and I moaned that RMS sucked because it could not handle files with hundreds of thousands of records and they laughed at me and said they were handling millions with RMS.

I can actually remember the look on my face, a neat trick when you think on it.

I haul ass back to work and pull out the RMS reference manual and I can tell you that dead trees aside there is one good thing about paper documentation: right above the entry for NULL_VALUES close enough to catch my eyes was the entry for NULL_KEYS.

Yep. You need to specify both. Paul had specifed NULL_VALUE=SPACES. He had not specified NULL_KEYS=YES. The default for NULL_KEYS? Guess.

I kinda wretch inside even now thinking about the astonishing amount of money, work, debugging, and delayed batches that followed from one simple failure to understand one broken thing.

The metalesson shared with "Solve the First Problem"? In programming, never deal with the unknown. This game is hard enough.


Epilogue

The punchline is that I never solved the first failure from Scene I of this tragedy. As my father used to say, "Do as I say, not as I do." We did have a deadline, and I did narrow down the location of the problem in a way that reassured me somewhat that it would not jump up to bite the new code in the rear end. And even in its breech the law is confirmed: we do need to address the underlying problem which I have some confidence I now understand because it still presents problems for the software but it will go away only when bigger problems are solved and they are much bigger so I am keeping my sights set on them.










6 comments:

Simon Michael said...

Classic. :) Thank you for the inspiration.

Slobodan Blazeski said...

Very cool. We want more war stories.

Jon said...

Very nice story. Incredibly well written. 100% with you on the principle.

Anonymous said...

Now THAT is truely wisdom.

I too would enjoy more war stories.

Steev

Anonymous said...
This comment has been removed by a blog administrator.
Kenny Tilton said...

heh-heh, looks like you missed the point: the workaround was Hell to live with, not something someone does as a trick. Meanwhile, hunh, I know I mentioned it somewhere, but I do not see it in this blog piece: I got the gig because Paul had died of AIDs. In fact, he had worked on the code even after he had fallen ill, needing the money. I always get to know (for some values of "know") The Original Developer while working on Other People's Code, so working on this code was a bit sobering.