The Glitch Detective

Dedicated to the art and science of bug fixing

I see that I’ve gained some new subscribers in the past week. Welcome! I’m sorry I haven’t given you anything recently to justify your subscription, but that will soon change. Sometimes it just takes a while to sort out the jumble of ideas in my head into something coherent. I think we’d all prefer to see the full JPG rather than the 1′s and 0′s, am I right?

In the meantime, I invite you to submit your own case studies. The best ones, to me, are not the ones strictly about code: there’s a billion places on the web where people can mock you for your syntax errors. No, what is more interesting is the logic errors or the hidden “gotchas” in the data, and the means and methods you use to discover them.

So, if you’ve got something to share, send it in. I look forward to hearing from you.

Years ago, long before Google saved us, I used to have to look up moon rise and -set times and moon phases, which were then published in newspapers. One day, I received an angry call from an editor at, well, let’s just say a well-known, medium-large East Coast paper, complaining that I had mixed up the rise and -set times. I had put that the moon rise was at 7-something a.m., setting at 7-something p.m. Clearly, he excoriated, this was wrong and makes the paper look foolish.

I double-checked, and I assured the man that these times were indeed correct.  This did not soothe him, since everyone knows that the moon is visible at night. I explained to him that no, this was the day of new moon.

Him: So?

Me: Well, at new moon the moon is on the same side of the Earth as the sun, so the moon can’t be in the visible sky until the same hour that we are facing the sun. Then of course you can’t actually see the moon because the sun is so bright…

Him: What the hell are you talking about? The moon never comes between the sun and the Earth!

Me: Uhhh, what do you think an eclipse is?

Him: (long pause) Ohhhhh.

I admit I may have been a bit snarky in my reply, but the guy was being kind of a dick about it. The point here is that it’s always dangerous to assume users know things that seem obvious to you. To that end, here’s a brief but excellent write-up on the phases of the moon and how those relate to rise and set times. Or you can just watch this:

This is exactly what the last Case Study was about as well. Those of us who deal with data all the time have an understanding of it that is much deeper than that of our users. That is as it should be, really. Users come to us for information, not data. No one wants to be the bore at the party who rambles on and on with useless minutae:

You then run the risk of not include enough data though. It’s a tricky balance to strike.

As you may recall, our case study begins with a complaint about the nighttime icon being shown as the “current condition” even though it was late in the morning. There are actually a couple of issues here, but let’s start with our first two examples from the other day:

Screen Shot of Weather appGraphic showing current conditions in CupertinoNow, the first thing is that these two locations are in Pacific Time, and the application is showing the time in Pacific. The phone’s default is showing Eastern Time. That’s a three-hour difference. So, if we correct for that, we see that the data is as of 7:11 a.m. and 7:10 a.m., respectively.  From there, our friends at the US Naval Observatory can weigh in:


U.S. Naval Observatory
Astronomical Applications Department
 
Sun and Moon Data for One Day
The following information is provided for Cupertino,                         Santa Clara County,                 California             (longitude W122.0, latitude N37.3):
        Wednesday
        13 January 2010       Pacific Standard Time
                         SUN
        Begin civil twilight       6:53 a.m.
        Sunrise                    7:22 a.m.  
        Sun transit               12:17 p.m.
        Sunset                     5:12 p.m.
        End civil twilight         5:41 p.m.

So there you have it: it’s still showing the nighttime icon because technically, it’s still nighttime.

The other example is a little bit different, and requires a deeper understanding of the data:

Graphic showing current conditions in San Mateo

Here, the time zones are all in alignment, but there’s no way the sun hadn’t risen in San Mateo by 7:59. Trust me, I checked! But the data is still correct, sort of, if you understand what’s being shown.

These “current conditions” reports are based on observations made by National Weather Service stations. For most locations, those observations are made only once an hour, at the top of the hour. Hence, the observation was taken at 7:00 a.m. Pacific, about 20 minutes before sunrise. The data itself is correct, but undone by a lack of precision in the interface.

You have only to take a look at the aesthetics of this blog to see that I have no clue about design. I’m not certain how this could be more clearly shown, especially given the constraints of the mobile platform. Clearly whomever designed the interface did not have a deep enough understanding of the data (nor should they). The ultimate breakdown here is in communication, both from the data people to the designer, and from the interface to the end user.

In my next post, I’ll start a discussion the importance of recognizing what your users don’t know. Hope to see you there.

Boy, I’ve been slacking on the posts recently, haven’t I? It’s difficult to blog when the weather finally turns nice. Since that won’t be a factor in the next few days, expect to see some more riveting content.

This week I have a simple case study, although it’s got a lot of moving parts. I was given a trouble ticket about an application for a mobile phone. Specifically, the complaint was that the nighttime icon was still showing in the “current conditions” display, even though it was now daytime. I was given three examples of this problem:

Screen Shot of Weather appGraphic showing current conditions in CupertinoGraphic showing current conditions in San Mateo

If the problem has been escalated to me, that means they’ve already ruled out problems receiving the data and sending the data out. (OK, for the sake of argument, let’s just assume that the due diligence was performed.) That means folks believe that the problem lurks in the underlying data. That’s true, but it’s also not true. All the data shown is indeed correct. But there are some hidden factors that are giving the perception of a problem.

In the case of the first two, it’s fairly obvious (to me, at least) that there’s a question of time zones: the phone itself is in Eastern time, but the data is in Pacific. But it’s still showing nighttime at 7:11 and 7:10 a.m., respectively. And what’s up with our friends in San Mateo? No such time zone issue there. So what’s going on?

I will follow up with an answer, and what that answer tells us about the data and about the user interface, in a few days. But feel free to take a guess in the comments.

Frank Ahrens has a piece in the Washington Post today discussing the difficulties that the Toyota executives are having explaining their investigation to Congress. Among the choice exchanges is this:

An exchange between Del. Eleanor Holmes Norton (D-D.C.) and Toyota President Akio Toyoda illustrated the problem.

Toyoda said that when his company gets a complaint about a mechanical problem, engineers set to work trying to duplicate the problem in their labs to find out what went wrong.

Norton said: “Your answer — we’ll wait to see if this is duplicated — is very troublesome.” Norton asked Toyoda why his company waited until a problem recurred to try to diagnose it….

It’s obvious to those of us who investigate bugs that Norton completely misunderstood the process. It’s a challenge we all face when we’re trying to fix a problem when the users and/or our bosses are demanding answers. “Trying to duplicate the problem” is a shorthand that technical people understand, and often we lose sight of the fact that non-technical people don’t hear that phrase the same way we do.

It’s easy to dismiss this as ignorance on the part of the congress person or on the part of the user. Of course there is a knowledge gap, but that’s not the fault of the users. They are not obligated or even supposed to know about the internal processes we all use. In some ways, this exchange illustrates a fundamental difference in the way the two groups (IT vs users) sees the world: the IT group looks at a result and sees a process, the user looks at a result and sees a result.

Mr. Ahrens makes the comparison to going to the doctor and the process of diagnosis. It’s an easy analogy, and it’s definitely correct in terms of process. But this too is fraught with peril in the perception: people don’t like to think of their doctor as doing trial-and-error as treatment.

OK, so how could Mr. Toyoda have said this differently? He needed to be a bit more long-hand, describing the process:

We try to determine the specific set of circumstances under which this problem will occur, and in the process rule out factors such as conditions, driver error, etc. Why, out of the billions of driver hours where no problem occurred, was this circumstance unique? Once we are able to replicate all the factors, we can then isolate causes and figure out strategies to correct them. As you might imagine, there are countless variables to consider, but the problem reports give us a place to begin research….

It’s long-winded, to be sure, but being able to translate the technical to the non-technical is a valuable skill.

You guys were super-fast on this one! As noted, the problem occurs here:

$       if ((f$length(outline)+f$length(word)).le.wrapLen).and.(word.nes." ")
$       then
$          outline = outline + word + " "
$          NUM = NUM +1
$          goto BREAKLINES

If the line plus the word is 80 characters, it is written out to the file, but a space character is appended to the end. That makes it 81 characters, and hence it is too long.

In other news, I’ve posted this as a question on StackOverflow.com. I invite all of you versed in the fine art of DCL to rewrite this code. Warning: there are going to be lots and lots of anti-VMS trolls there. Don’t let their jealousy bother you: we all know that real programmers use a command-line interface.

I might have known my OpenVMS peeps would be all over this. In fact, when I wrote this up I was just discussing one problem, and they pointed out a second one which I hadn’t even considered! You guys rock.

Anyway, here is a sample of the text output (new window) where the problem occurs. It should be pretty obvious from here, so I’ll post my solution tomorrow.

Is there anything more annoying than sporadic errors? They defy investigation using the Five W’s, because they refuse to give up at least one key W. Today’s case study describe just such a situation: this is a subroutine that is called a few dozen times a day, and every now and then, inexplicably, it fails to do it’s job. For added fun, this is actually an old-school DCL command file from OpenVMS, the operating system of the true geek.

The Scenario

For reasons that are much too complicated to explain, what this subroutine is designed to do is take text strings from one file, which are free-form text, and break them into lines of no more than 80 characters, with hard returns. (Coding Horror has an excellent discussion of the hardships of end-of-line characters, btw.) The output file in turn is read by a different process, which complains loudly if any line has more than 80 characters.

Here’s how it works:

1) It reads in a line
2) It normalizes the string, trimming leading and trailing whitespace, and then compressing so only   single spaces appear between words.
3) It steps through the string, word by word, writing to an output string.
4) If the length of the output string plus the length of the next word is greater than 80
      a) it writes the output string to the output file (plus a space)
      b) it blanks out the output string, and writes the next word to it (plus a space)
      c) Goes back to step three .
6) Get a new input string, and repeat from step 2.

The Problem

This subroutine would work fine for weeks or months at a time, then suddenly it would fail. Just once. There was no set time of day, week, or, month. The calling routine varied as well, so it wasn’t always the same input source. The logs showed no errors: the subroutine was called, ostensibly performed it’s duties, and exited without incident. Complicating matters was that when the error occurred the support staff would just manually edit the file, in the process destroying the offending text strings.

I looked through the code myself, and didn’t see anything obvious. You can do the same right here. (I should point out here that I’m honestly not sure if I wrote this originally or not: I might have, but it doesn’t look like my style.) My first guess was that it correctly parsed the first 80 characters, and then just wrote out the rest of the string without checking if it was longer than 80. But that wasn’t it. I could also rule out that the input string had more than 256 characters (the natural length imposed by DCL) because that would have caused an error which would be in the logs. Ditto for weird control characters or unexpected quotation marks.

When the process failed again, I managed to get a hold of the faulty output file, and the error was obvious. (In hindsight it was totally obvious before, but let’s see if you can get it.) I’ll post that output file in a couple of days as a clue. But feel free to leave your guess in the comments.

As I mentioned before, this particular case has a few things going on with it. Let’s quickly revisit that snippet of code from the first post:

var inColors = new Array(FF0000, FFA500, FFFF00, 00FF00, 0000FF,
                         EE82EE, 000000, FFFFFF)
var outColors = new Array (“RED”, “ORANGE”, “YELLOW”,”GREEN”,”BLUE”,”INDIGO”,
                           ”VIOLET”,”BLACK”,”WHITE”)
document.write(“<select name=’userColor’>”)
for (var x=0; x<inColors.length(); x++)
{ if (theColor == inColor[x])
then
document.write(“<option value=’”+inColor[x]+” selected>”+outColors[x]+”</option>”
else
document.write(“<option value=’”+inColor[x]+”‘>”+outColors[x]+”</option>”
}
document.write(“</select>”)

Take a look at the declaration of inColors and outColors. Still not seeing it? Count how many items are in each array. There’s one missing from inColors, which is the list of items expected from the server. This omission has two nasty effects, actually. First, if theColor is indigo (that’s the missing one), there’s no match for it, so that pull-down will have nothing selected. Second, if theColor is white, the loop will never get to it, because it’s controlled by the length of inColor. Hence, there will be nothing pre-selected in that case as well.

OK, so you have a pull-down with nothing selected. The user will just select one, that’s what they’re there for. Everyone’s happy. Except, if your chances for success depend on a human paying attention to what they’re doing every time, you’ve got yourself an escalating probability of catastrophic failure.

In the second act of our tragedy, inevitability strikes, and the user fails to select anything and then goes ahead and hits the “submit” button. One of the values being checked is undefined, and an error gets thrown. No harm, no foul, right? Oh no, my friends, because here is where Wednesday’s clue comes into play.

Originally, the framework had this as the submit button:

<input type="button" value="submit" onclick="validate_form_values(this.form)" />

But the programmer changed that to:

<input type="submit" value="submit" onclick="validate_form_values(this.form)” />

In either case, the form will not submit if the Javascript returns a false (meaning validation failed). But if the Javascript routine itself fails (has a fatal error), the script stops running, and the button performs its default action. The default action of a button type is to do nothing. What’s the default action of a submit type? (I had to step through this whole damn script line-by-line to find this out!)

The lesson to be learned here (aside from the fact that Javascript is evil, which you should already know) is to always check that your input data is indeed what you expect; you can never assume that.

I don’t talk a lot about security or vulnerability in this blog, for the simple reason that I tend to focus more on the arcane accidents rather than malicious intent. But, like anyone in the field, it’s something to always be mindful of when creating client-facing systems.

A group of exports, including the MITRE Corporation, the Sans Group, and our very own DHS, recently released their “Top 25 Most Dangerous Programming Errors.” This document details some of the most common and dangerous mistake programmers make in securing their sites. What I find striking is that most of these relate back to the same fundamental problem: your program is assuming way too much about the data stream it’s receiving. Most of these are in fact failures to validate data. Obvious things like failure to encrypt your database connection strings is just laziness, or perhaps a hopelessly optimistic view of the online world.

I hate to put it so pessimistically, but this one line sums it up best: “Assume all input is malicious.” Sorry kids, that’s the world we live in.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes