Thursday, July 31, 2008

Why would my program suddenly stop working?

[This was originally posted at]

Deterministic bugs are easy. When you write "ConvertCelsiusToFahrenheit", the debugging is simple. When things break, they're very repeatable, and it's easy to step through the debugger and see why. However, production code doesn't work this way. Sometimes your enterprise application will just temporarily stop working, only to resume working correctly again a little later. Why? Here's a few ideas:

  • Caching - something was cached, and the cache expired.

  • Session - the session expired

  • External dependencies - a dependent web service or database could be down

  • Rare boundary condition - perhaps your code doesn't account for certain rare input (like nulls, or not escaping special characters)

  • Concurrency - perhaps the code works great in a single thread (which is how must code is tested), but doesn't handle being run concurrently, for example one thread deadlocks, or another process locks a resource.

  • Too much load - perhaps too much load temporarily crashed something - like throwing an out of memory exception.

  • Randomness - maybe your code uses random numbers, and most of those work, but some of them don't - i.e. the code crashes when the random number is divisible by 111, or something really weird like that.

  • Incremental buildup with rounding error - perhaps every time the code is run, it produces an incremental buildup somewhere, like inserting a row in a database table. And as long as there are less than X rows, it "rounds down" and works. However, once the table has X+1 rows, it "rounds up" and something fails. This is abnormal, but certainly possible.

There is almost always some sufficient cause that causes the code to act abnormally. It helps for your app to have a good logger, such that you have clues to track down what that cause was. It also helps to have a QA environment that matches production, so that you can try to reproduce the steps yourself. Knowing that there will inevitably be production errors, it should encourage us to write good code upfront such that we take care of all the easy errors and these preventable bugs don't distract us from fixing the non-trivial ones.

No comments:

Post a Comment