Monday, March 10, 2008

Tilton's Law: Solve the First Problem

This was such a weird project. Scheduled for five days altogether. My friend from the clinical drug trial venture was also a tech recruiter who got me about half my tech jobs over the years and this one was a real throwaway.

What we had was a mid-80s start-up in the educational software game producing exactly the kind of mind-numbing drill and practice software that was supposed to revolutionize education because Look, Ma! We used computers!

Now they were stuck on some software problem and needed help fast. Their stack was Tandy, Cobol, and some micro database package. My skills were Apple, Cobol, and ISAM and in those days that was a deerskin glove fit so off I went for a mutual look-see.

I was on the beach, why not?

The next morning I am walking up to an apartment building where this enterprise had wedged itself into what was meant to be doctor's offices. Inside I sit down with the top guy in his office and the entire company joins us.

The staff unleashes a thirty minute nightmare tale of software crashes, dysfunctions, anomalies, and disrepair as each person takes turns reciting some utterly bizarre malfunction of the application, all with the database software as the likely culprit. It was a tag-team misery report, a through the looking glass panoply of software non-determinsim. It was wonderful.

A half dozen times I formulated "Explanatory Guess X" only to hear in the speaker's next sentence that they had thought it might be X and but no luck. I mean it was really wonderful and then finally it ended. My head was spinning.

"Have you worked with the Tandy OS," the manager asked.
"No."
"Cobol?"
"Yes, but it does not sound like Cobol is your problem."
"No. I don't suppose you have worked with this DBMS?"
"No."
A pause.

"Can you help us?" See straw. Clutch,

I have no idea what to tell them.

"Is the DBMS any good?", I recover enough to ask.
"I checked it out pretty well. It got great reviews, it is supposed to be the best."
I look down at my shoes.

The contract was for five days. The longest any single glitch had stopped me was for five days. Do the arithmetic.

"Yes," I say.

It took seven. They paid up front for the first five, never paid for the last two probably because they did not have it or maybe because of the way things went. You'll see. And I am surprised it came to seven days, I only remember one or two. I never ran their software once and I do not remember even touching a computer. Here is what happened.

After signing on I took home the manuals for their DBMS and a listing of their schema definition. It took maybe a day to decide that everything looked right. The next day I ask Tom the programmer how hard it would be to just initialize an empty database and start over entering the data.

"Easy", says Tom.

Welcome to Tilton's Law: Solve the First Problem. They had described to me twenty distinct failures and that was too many for me, I am not smart like you guys, I cannot just figure these things out in the shower.

I wanted to turn the software off and turn it back on with a clean slate and see what went wrong first and stop right there. I just wanted to see what went wrong first and fix that. I suspect that needs no explanation, but what am I doing up on this soapbox if I am not going to explain these things?

Here goes. Once upon a time my sleazebag ward politician buddy and I were cruising the singles bars back when they had such things and he got nicely eviscerated by a woman we were chatting up. My buddy had said something cynical and she had challenged him on it.

"Oh, I have compromised my principles a few times," he conceded with a sly grin.

"You can only compromise your principles once," she replied. "After that you don't have any."

Software is the same. This stuff is hard enough to get right when things are working nominally, but once they go wrong we no longer have a system that even should work.

Back on the project, the next day I get a call.

"Bad news," Tom says. Uh-oh.

"What happened?"
"Same thing. Mary was entering the 118th record and the program crashed."

I pretty much fell out of my chair. Somewhere in the thirty minute firestorm of issues I had heard the number 118.

"118 sounds familiar."
"Yep," Tom moaned inconsolably. "That's what happened before. Sorry, no difference."
I was doing cartwheels.

"Tom, how hard would it be to write a program to just write out a couple hundred records, just put in dummy data, 1-2-3-4-5...?"
"That would be easy."
"Awesome, do that and let's see what happens in batch mode," says me.
"OK."
"And reinitialize the DB first, OK?"
"OK."

The next day I hear from Tom. Sounds like he is calling from the morgue.

"Bad news, Kenny."
Oh, no. It worked.

"What happened?"
"Same thing. The program wrote out 118 records and crashed. Sorry, Kenny."
Oh, yeah, I just hate easily reproducible errors. Not!

"Listen, Tom, let's try making the buffer allocation bigger."
"OK."

The next day, "Bad news. Same thing."
I am icing the champagne; this is one solid, reproducible bug. But what about the others?

"Tom, remember the first time this thing crashed, before I came on board?"
"Yeah."
"Did you start over from a fresh database or just resume working on the one that had been open when the DBMS had crashed?"
"We just continued working with the same DB."
"Oh. OK."

Tilton's Law (Solve the First Problem) had been broken as badly as broken can be. A DBMS had failed while writing data and they had tried to continue using the same physical DB. This transgression is so severe it almost does not count.

Normally Tilton's Law refers to two or three observed issues that do not necessarily seem even to be in the same ballpark. The law says pick out the one that seems most firstish and work on that and only that until it is solved. The other problems might just go away and even if not the last thing we need to do while working on one problem is to be looking over our shoulders at possible collateral damage from some other problem.

Two minutes later I am on the phone to DBMS tech support .

"Hi, we're reliably crashing after adding 118 records in one sitting."
"Yes, that is a known problem."
Oh. My. God.

"Would you like us to send you the patch for that?", she asks.
"That would be lovely."

This being before the advent of the Interweb we confirmed our mailing address and asked for it to be sent out ASAP and overnight delivery. But we are not done yet. Tilton's Law or no, all I have solved is P1, the first problem.

"One more thing," I say.
"Shoot."
"If we continue working with the DB after this crash..."
"Oh, no. Don't do that. It's hopelessly corrupted at that point."

Were some of the other issues unrelated to the first crash? I will let you know as soon as this test I have running to solve the halting problem finishes.

Meanwhile, the conversation had suggested how we might get them up and entering data now. Apparently we were crashing because of a bug that surfaced when more than so many records were being held in the buffer before being written out. We had tried making the buffer bigger, only making things worse.

"Tom, we can wait for the patch, but I have one last idea in mind that might get this thing working for you. Want to try one more thing?"
"Sure."
"Try making the buffer half the size it was when we started."
"OK."
A few minutes later he comes back.

"It works now."
"Yeah, baby!"
"I had it loop to one thousand. No problem."
"Cool. Let's tell the others and go get drunk."

Nope. Something is wrong. Tom is just standing in the doorway all deer and headlights.

"Can I ask you something?", Tom asks quietly.
"Sure."
"I do not understand why making the buffer smaller made the program work."
"Well there was this bug that had to do with being unable to keep more than so many records in memory and with a smaller buffer the software did not try to keep so many in memory."
Long pause.

"OK, but why does it work now?"
Hmm.

"Maybe 118 multiplied by the record size is more than 16,384 and somewhere in the DBMS logic there was an integer overflow so the problem does not come up if the cache is smaller and the software flushes the cache before it gets to 16,384."

"All right," says Tom "But I do not understand why we make the buffer smaller and now the software works."

This was surreal. I try a different tack, a really dumb one, but sometimes when a grizzly bear has your back to the wall all you can do is tap dance.

"Look. There are multiple code paths in an application, right? Every conditional is a fork in the path. A bug exists in some branch or other out of all the code paths, right? By changing a fundamental parameter we send the code down a different code path. Avoiding the bug."
Pause.

"I just don't understand why making the buffer smaller makes the program work."

Then it came to me. I was Dr. Chandra in 2010 trying to get Hal to fire the rockets, and Tom was Hal stuck in a Mobius loop unable to resolve my understanding of the confusion with his confusion of the understanding.

"I don't know, Tom," I say. "I don't know why it works now."
Tom nods.

Suddenly Mike, the project lead, appears.

"Kenny, Tom. In my office. Now."
Whoa.

"OK, this has to stop. Kenny, I am paying you to solve this problem and you have Tom doing all your work. He has his own work to do. From now on you work on this problem and Tom you do what you are supposed to be doing. Have I made myself clear?"

Remember in Annie Hall when Woody Allen turns to the camera and asks, Why can't real life be like this?

"Actually...I think I'm done."

Leaving Mike and his facial expression frozen in spacetime, I turn to Tom with raised eyebrows for his assent and Tom nods. I turn back to Mike, who no longer knows where he is.

"It turns out this is a known bug. You'll have a patch tomorrow or the next day. In the meantime we found a workaround and you are up and running. Mary can start entering your data, um, now."

Mike recovers.

"So basically I am sitting here making a complete ass out of myself?"

Good for him. We all had a good laugh, shook hands and I was on my way and Tilton's Law of Programming was reaffirmed: Always solve the first problem. The corollary: there only ever is the first problem.

13 comments:

testseifried said...

A perfect example of this: I got a new computer last week, set it all up, wonderful. But then sound starts acting funny, getting a bit scratchy and then fading completely. So I check the settings, reboot a few times, toss in a new sound card, same problem, remove sound card, etc. Then I check the headphones... Plugged in to my iPod, same problem. Check the cable... Yup. Our new kitten has chewed the s*it out of it.

Anonymous said...

My first computer was a Performa I got in college. I set it up and dove in, willy-nilly setting up apps and tweaking prefs, moving far too fast for a newbie on the legendarily buggy System 7.5... and eventually it just started making the error 'bong' sound ad infinitum. And it was loud, the max volume the tinny little speaker could muster. And it was about 3 a.m., with my roommate trying to sleep.

I changed every possible sound setting. Bong. Restarted. Bong. Shutdown, started back up. Bong. I reinstalled the system. Bong. I wiped the hard drive. Bong. I tried different electrical outlets. Bong.

I finally looked at the two volume buttons on the front of the case, which I had used to hear a CD earlier that day. The right/up volume button had gotten stuck in a tight fit against the opening in the case.

Click button that controls sound. No more bong.

misterorange said...

Hahaha, wonderful story! Thanks for sharing :)

Neil Baylis said...

Yeah, it's a wise principle. Especially important when looking at the output of a C compiler.

"Oh, that first error.. it's just a nested comment. Nothing to worry about. It's only a warning. Let me look at these other, more interesting errors first.."

Anonymous said...

You came up with a law and named it after yourself just yesterday. You should probably start numbering them, especially if you're going to make a habit out of this.

Anonymous said...

Tilton's First Law = Wonderful story

Anonymous said...

"sometimes when a grizzly bear has your back to the wall all you can do is tap-dance"

The story is great, but that quote is just priceless.

bugeyedmonster said...

Brilliant post! I love seeing a precept that I hold so dear expressed so simply. I will, from now on, be able to smugly refer to this method as "Tilton's Law" and watch all the junior tw@t programmers and engineers secretly scratch their heads (as they would NEVER deign to admit they have no idea what I'm talking about). Nicely done.

ps - a word on your blog title. Lisp weenies and Mac enthusiasts; Is there any type but smug? :P

Kenny Tilton said...

Especially important when looking at the output of a C compiler.

OK, I'll tell that story, too. Soon.

Bong. I wiped the hard drive. Bong.

I like the part where I am sitting there while the Mac reboots thinking, This is, OK, I could use a rest, for the twentieth time.

that [grizzly] quote is just priceless

I had so much fun writing that, and it was precisely how I felt when I went for the different code branches theory of why the code was working, or as they say in football, it was such a Hail Mary of an explanation.

watch all the junior tw@t programmers and engineers secretly scratch their heads

I guess there is no wi-fi in your imagined setting, or you have banned Interwebby-ready cell phones from your meetings: google already has it as the first hit. :)

Anonymous said...

I remember a classic moment of "duh", when our entire site was not updating because feed provider stopped sending the feed. We call, we ask, they say, all is well, we keep pumping stuff to you folks, it's on your end. We are checking software, logs are solid, receiver is running, no errors. After spending countless hours trying to troubleshoot, putting many heads together, getting on conference call department-wide, finally some doofus in data center said that he tripped and accidentally unplugged the dedicated line modem "a while ago"... and because this box's been sitting there for a while, he assumed it's not used, so he never plugged it back in :)

Chris said...

I don't think you explained the "buffer" solution very clearly to Tom. He obviously didn't understand that "the buffer" was only temporary storage, writing the records to permanent storage when full. Especially since the word "buffer" does not imply "temporary" at all..

It looks like you forgot to address that problem first ;)

Kenny Tilton said...

finally some doofus in data center said that he tripped and accidentally unplugged the dedicated line modem

Aw, give him some cred. I would have quietly plugged it back in and kept my mouth shut, you'd still be wondering what happened.

"the buffer" was only temporary storage, writing the records to permanent storage when full.

I see.... So why is the software working now?

Anonymous said...

-It almost sounds like you worked on mainframes...
Buffer management there is about a millennium ahead of PC memory management... Good Story...