Wednesday, July 20, 2011

Ghosts and Time Bombs

Having bugs that are reproducible on your local machine is a luxury. Enterprise production apps are often void of such luxuries. Indeed, often the reason the bug gets past developers, code reviews, QA, UAT, regression, and every other quality control measure is because it is not acting in an obviously  deterministic way. Two common types of such bugs are "Time Bombs" and "Ghosts".

A Time bomb works perfectly, only to explode at some point in the future because it depends on some external dependency that eventually changes. These are usually deterministic, and can be reproduced if you know what you're looking for, but it's very hard to trace the exact cause.  The temptation with the time bomb is that it's working perfectly right now, and everyone is always so busy, so they move on.
Examples of time bombs are:
·         Dependency on the clock – Y2K was the most famous case. Other examples include code that doesn't account for the new year (say it sorts by month, and doesn't realize that January 2012 is greater than December 2011), or storing a total milliseconds as in Int32 (and it overflows after a month).
·         Growing data – Say your system logs to a database table, and it works perfect on local and even QA tests where the database is constantly refreshed. But then in 6 months (after the developers have rolled off the project and no-one even knows about the log) the log file becomes so bloated that performance slows to a crawl and all the connections timeout.
·         Memory leak – Similar to the growing data.
·         Service contract changes or expires – In today's interconnected systems, it is common to have external data dependencies from third parties. What if a business owner or manager forgets to renew these contracts, or the schema of the contract fails, and hence the service constantly fails. Even worse – say you shell out to a third-party tool (with System.Diagnostics, and hide the window so there's no popup) that gives a visual EULA after such an expiration, and all you see if the process appears frozen because it's waiting for the (hidden) EULA?
·         Expiring Cache – What if you store critical data in the cache on startup, but that data eventually expires without any renewal policy and the app crashes without it?
·         Rare events with big impact – What if there's a annual refresh of an external data table? I've seen apps that work perfectly in prod for 8 months, processing some external file, and then unexpectedly explode because they're given an "annual refresh" file that is either too big, or has a different schema.

General ways to test for time bombs:
·         Leave the app running for days on end.
·         Forcibly kill the cache in the middle of running – will it recover?
·         Do load testing on the database tables.
·         Make sure you have archive and cleanup routines.
·         Set the system clock to various values.
·         Test for annual events.

Ghosts are bugs that work perfectly in every environment that you can control, but randomly show up in environments you can't. You can't reproduce them, and you don't have access to the environment, so it's ugly. Ghosts are tempting to ignore because they seem to go away. The problem is that if you can't control the ghost, then you can't control your own application, and that looks really bad to senior management. Examples of ghosts include:

·         Concurrency, threading, and deadlocks – Because 99% of devs test their code as a single user stepping through the debugger, they'll almost never see concurrency issues.
·         Environmental issues – For unknown reasons, the network has hiccups (limited bandwidth, so sometimes you app gets kicked out), or database occasionally runs significantly slower, causing your performance-tuned application to time out.
·         Another process overwrites your data – Enterprise apps are not a closed system – there could be other services, database triggers, or batch jobs randomly interfering with your data.
·         Hardware failures – What if the network is temporarily down, or the load balancer has a routing error (it was manually configured wrong during the last deploy?), or a disk is corrupt?
·         Different OS or Windows updates – Sometimes devs create (and debug) on one OS version, but the app actually runs on another. This is especially common with client apps where you could create it on Windows 7 Professional, but it runs on Windows Vista. Throw in service packs and even windows updates, and there can be a lot of subtle differences.
·         Load balancing – What if you have a web farm with 10 servers, and 9 work perfect, but the last one is broken (or deployed to incorrectly)? The app appears to work perfectly 90% of the time. Realistically, say it's a compound issue where the server only fails 10% of the time, then your app appears to work 99% of the time.
·         Tedious logic with too many inputs – A complex HR application could have hundreds of test cases that work perfectly, but say everyone missed the obscure case that only occurs when a twice-terminated user logs in and tries to change their email address.

General ways to test for ghosts:
·         Increase load and force concurrency (You can easily use a homegrown threading tool to make many web service or database calls at once, forcing a concurrency test).
·         Simulate hardware failures – unplug a network cable or temporarily turn off IIS in your QA environment. Does the app recover?
·         Allow developers some means for QA and Prod debug access –if you can finally reproduce that bug in prod (an nowhere else), the cheapest solution is to allow devs some means to troubleshoot it there. Perhaps they need to sit with a support specialist to use their security access, but find a way.
·         Have tracers and profilers on all servers, especially web and database servers.
·         Have a diagnostic  check for your own app. How do you know your app is healthy? Perhaps a tool that pings every webservice (on each machine in the webfarm), or ensures each stored proc is correctly installed?

No comments:

Post a Comment