Debugging the Zune Blackout!

Debugging the Zune Blackout

By Ed Felten - Posted on January 12th, 2009 at 10:46 am

On December 31, some models of the Zune, Microsoft's portable music player, went dark. The devices were unusable until the following day. Failures like this are sometimes caused by complex chains of mishaps, but this particular one is due to a single programming error that is reasonably easy to understand. Let's take a look.

Here is the offending code (reformatted slightly), in the part of the Zune's software that handles dates and times:

Code:
year = 1980;

while (days > 365) {

    if (IsLeapYear(year))  {

        if (days > 366)  {

            days -= 366;

            year += 1;

        }

     } else {

        days -= 365;

        year += 1;

    }

}

At the beginning of this code, the variable days is the number of days that have elapsed since January 1, 1980. Given this information, the code is supposed to figure out (a) what year it is, and (b) how many days have elapsed since January 1 of the current year. (Footnote for pedants: here "elapsed since" actually means "elapsed including", so that days=1 on January 1, 1980.)

On December 31, 2008, days was equal to 10592. That is, 10592 days had passed since January 1, 1980. It follows that 10226 days had passed since January 1, 1981. (Why? Because there were 366 days in 1980, and 10592 minus 366 is 10226.) Applying the same logic repeatedly, we can figure out how many days had passed since January 1 of each subsequent year. We can stop doing this when the number of remaining days is less than a year -- then we'll know which year it is, and which day within that year.

This is the method used by the Zune code quoted above. The code keeps two variables, days and year, and it maintains the rule that days days have passed since January 1 of year. The procedure continues as long as there are more than 365 days remaining ("while (days > 365)"). If the current year is a leap year ("if (IsLeapYear(year))"), it subtracts 366 from days and adds one to year; otherwise it subtracts 365 from days and adds one to year.

On December 31, 2008, starting with days=10592 and years=1980, the code would eventually reach the point where days=366 and year=2008, which means (correctly) that 366 days had elapsed since January 1, 2008. To put it another way, it was the 366th day of 2008.

This is where things went horribly wrong. The code decided it wasn't time to stop yet, because days was more than 365. ("while (days > 365)") It then asked whether year was a leap year, concluding correctly that 2008 was a leap year. ("if (IsLeapYear(year))") It next determined that days was not greater than 366 ("if (days > 366)"), so that no arithmetic should be performed. The code had gotten stuck: it couldn't stop, because days was greater than 365, but it couldn't make progress, because days was not greater than 366. This section of code would keep running forever -- leaving the Zune seemingly dead in the water.

The only way out of this mess was to wait until the next day, when the computation would go differently. Fortunately, the same problem would not occur again until December 31, 2012 (the last day of the next leap year), and Microsoft has ample time to patch the Zune code by then.

What lessons can we learn from this? First, even seemingly simple computations can be hard to get right. Microsoft's quality control process, which is pretty good by industry standards, failed to catch the problem in this simple code. How many more errors like this are lurking in popular software products? Second, errors in seemingly harmless parts of a program can have serious consequences. Here, a problem computing dates caused the entire system to be unusable for a day.

This story might help to illustrate why experienced engineers assume that any large software program will contain errors, and why they distrust anyone who claims otherwise. Getting a big program to run at all is an impressive feat of engineering. Making it error-free is too much to hope for. For the foreseeable future, software errors will be a fact of life.

Further investigation reveals that this code was first used by Freescale/Toshiba.. The code probably was re-used & not tested properly by M$... Way to go M$! :bleh:

Didn't get a reason for this :

If this was the case, shouldn't the same bug have appeared in the previous leap years as well ? (eg : 2000, 2004 etc etc..) Why didn't it appear then?


One of the comments at that article :

I think it's even more interesting that the code started its life somewhere else -- Motorola / Freescale -- and then was (potentially) modified and used by Microsoft. So most likely the bug wasn't even Microsoft's, and other Freescale hardware using this same boilerplate code suffers the same fate. And, despite good QA, obviously no one tested this scenario. I suspect they didn't QA this code as much as what they engineered themselves, and I think it's an important thing to consider: code reuse is often a good idea but it has security and performance implications.

Another interesting comment :

The earlier anonymous commenter makes the most accurate observation. This was not Microsoft-written code. It was written by Motorola/Freescale and built into the product by Toshiba.

The first Zune was a modified Toshiba Gigabeat -- evidence easily found on the web shows that the same bug affected the Toshiga Gigabeat (there are just fewer of them, and Toshiba doesn't make the news the way Microsoft does).

The second-generation Zunes were designed by Microsoft, and don't have this bug. I'm pretty familiar with Microsoft's coding practices, and I can assure you, a bug this basic would never have passed code review. Date (and string) manipulations trigger automatic flags, because, as noted above, there are extremely well-tested libraries for these sorts of things that are required use.

Microsoft's failure was shipping code written by someone else without doing complete code reviews. This was probably due to the compressed schedule for the original Zune (conception to shipping in 8 months), and the team moving on to the next product as quickly as possible.

If there's a lesson here, it's that open source != bug free.





Here's the explanation from the guy who found out the bug (from ZuneBoards) :


After doing some poking around in the source code for the Zune's clock driver (available free from the Freescale website), I found the root cause of the now-infamous Zune 30 leapyear issue that struck everyone on New Year's Eve.

The Zune's real-time clock stores the time in terms of days and seconds since January 1st, 1980. When the Zune's clock is accessed, the driver turns the number of days into years/months/days and the number of seconds into hours/minutes/seconds. Likewise, when the clock is set, the driver does the opposite.

The Zune frontend first accesses the clock toward the end of the boot sequence. Doing this triggers the code that reads the clock and converts it to a date and time. Below is the part of this code that determines the year component of the date:

year = ORIGINYEAR; /* = 1980 */

while (days > 365)

{

if (IsLeapYear(year))

{

if (days > 366)

{

days -= 366;

year += 1;

}

}

else

{

days -= 365;

year += 1;

}

}

Under normal circumstances, this works just fine. The function keeps subtracting either 365 or 366 until it gets down to less than a year's worth of days, which it then turns into the month and day of month. Thing is, in the case of the last day of a leap year, it keeps going until it hits 366. Thanks to the if (days > 366), it stops subtracting anything if the loop happens to be on a leap year. But 366 is too large to break out of the main loop, meaning that the Zune keeps looping forever and doesn't do anything else.

The unfortunate part is that there isn't anything that can be done to fix this besides somehow changing what the clock is set to (which is exactly what the battery disconnection trick ends up doing). On the other hand, it shows that Microsoft is correct: tomorrow, everyone's Zunes will operate normally again. However, if Microsoft doesn't fix this part of the firmware, the whole thing will happen all over again in 4 more years.. Hopefully by then a fix will be in place.

Sources:

Debugging the Zune Blackout | Freedom to Tinker

Cause of Zune 30 leapyear problem ISOLATED! - Zune Boards
 
Back
Top