Wednesday, July 27, 2011

Why you’re in trouble if you rely on 30-page SOPs

Every organization wants to have its development processes documented into Standard Operating Procedures (SOPs) for the obvious reasons – faster onboarding, standardization, auditing, etc… The Holy Grail is the potential to hire a bunch of contractors (or outsource), tell them to read a novel worth of documentation, and then they’re fully up to speed a week later. SOPs also imply that the team knows what it is doing and has a plan, which is one indicator of a mature organization. SOPs are also a prerequisite for outsourcing, an appealing option for large organizations.
While documentation has its benefits, you can’t rely solely on large documents to communicate process and onboard new people. Here are at least four common problems, and it will result in a frustrated and confused team:
  1. The doc itself will be wrong (or outdated), such as skipping steps or assuming institutional knowledge. This is especially common if you hire outside consultants (with no institutional knowledge of your systems) to document your process.
  2. It’s easier to bluff a doc – a busy tech writer will hurry the doc, thinking it looks done ("I’ve written 50 pages!"), but the content won’t be correct or specific enough.
  3. Screens will vary (example: the software is upgraded or the OS doesn’t match).
  4. People simply won’t read the docs, they’re skim and miss details.
Several ways to communicate an SOP instead of just 30-page MSWord docs:
  • Favor automation over documentation where possible. The best document is an automated script. A script is usually kept up to date (compared to a doc) because developers need the script to work. It’s also much faster (and less error-prone) for a new guy to kick off the script than to tediously step through 20 pages of instructions.
  • Lower the cost of documenting by leveraging a wiki. Developers are far more likely to update or correct a wiki then a big MSWord doc on SharePoint.
  • Favor simplifying the process so the doc itself is simpler.

Monday, July 25, 2011

If you're going to fail, fail big – example with dirty towels

After showering the other day, I needed a towel to dry off.  I saw a bunch of towels in a nearby laundry basket, but wasn't sure if they were dirty laundry going downstairs, or clean laundry coming upstairs. I was too lazy to walk down the hall to get a towel from the closet that I knew would be clean, and the towels in the basket looked clean, so I used the ones from the basket because they were immediately available. But it didn't feel "right". I started feeling bits of lint and loose hair on me, so I did the most thorough test I could find – I phoned my wife and asked her if the towels in the laundry basket were clean or dirty. She informed me that they were "of course" dirty towels used as floor mats.

Like every other aspect of normal life, I see this directly analogous to software coding. The dirty towel is like a defect. If it was obviously dirty, I never would have used it, hence saving myself the grossness of drying myself with a floor mat. Applied to software: Better to have a defect explode in your face so that you're forced to fix it, as opposed to a "half-bug" that continually bites you.

Wednesday, July 20, 2011

Ghosts and Time Bombs

Having bugs that are reproducible on your local machine is a luxury. Enterprise production apps are often void of such luxuries. Indeed, often the reason the bug gets past developers, code reviews, QA, UAT, regression, and every other quality control measure is because it is not acting in an obviously  deterministic way. Two common types of such bugs are "Time Bombs" and "Ghosts".

A Time bomb works perfectly, only to explode at some point in the future because it depends on some external dependency that eventually changes. These are usually deterministic, and can be reproduced if you know what you're looking for, but it's very hard to trace the exact cause.  The temptation with the time bomb is that it's working perfectly right now, and everyone is always so busy, so they move on.
Examples of time bombs are:
·         Dependency on the clock – Y2K was the most famous case. Other examples include code that doesn't account for the new year (say it sorts by month, and doesn't realize that January 2012 is greater than December 2011), or storing a total milliseconds as in Int32 (and it overflows after a month).
·         Growing data – Say your system logs to a database table, and it works perfect on local and even QA tests where the database is constantly refreshed. But then in 6 months (after the developers have rolled off the project and no-one even knows about the log) the log file becomes so bloated that performance slows to a crawl and all the connections timeout.
·         Memory leak – Similar to the growing data.
·         Service contract changes or expires – In today's interconnected systems, it is common to have external data dependencies from third parties. What if a business owner or manager forgets to renew these contracts, or the schema of the contract fails, and hence the service constantly fails. Even worse – say you shell out to a third-party tool (with System.Diagnostics, and hide the window so there's no popup) that gives a visual EULA after such an expiration, and all you see if the process appears frozen because it's waiting for the (hidden) EULA?
·         Expiring Cache – What if you store critical data in the cache on startup, but that data eventually expires without any renewal policy and the app crashes without it?
·         Rare events with big impact – What if there's a annual refresh of an external data table? I've seen apps that work perfectly in prod for 8 months, processing some external file, and then unexpectedly explode because they're given an "annual refresh" file that is either too big, or has a different schema.

General ways to test for time bombs:
·         Leave the app running for days on end.
·         Forcibly kill the cache in the middle of running – will it recover?
·         Do load testing on the database tables.
·         Make sure you have archive and cleanup routines.
·         Set the system clock to various values.
·         Test for annual events.

Ghosts are bugs that work perfectly in every environment that you can control, but randomly show up in environments you can't. You can't reproduce them, and you don't have access to the environment, so it's ugly. Ghosts are tempting to ignore because they seem to go away. The problem is that if you can't control the ghost, then you can't control your own application, and that looks really bad to senior management. Examples of ghosts include:

·         Concurrency, threading, and deadlocks – Because 99% of devs test their code as a single user stepping through the debugger, they'll almost never see concurrency issues.
·         Environmental issues – For unknown reasons, the network has hiccups (limited bandwidth, so sometimes you app gets kicked out), or database occasionally runs significantly slower, causing your performance-tuned application to time out.
·         Another process overwrites your data – Enterprise apps are not a closed system – there could be other services, database triggers, or batch jobs randomly interfering with your data.
·         Hardware failures – What if the network is temporarily down, or the load balancer has a routing error (it was manually configured wrong during the last deploy?), or a disk is corrupt?
·         Different OS or Windows updates – Sometimes devs create (and debug) on one OS version, but the app actually runs on another. This is especially common with client apps where you could create it on Windows 7 Professional, but it runs on Windows Vista. Throw in service packs and even windows updates, and there can be a lot of subtle differences.
·         Load balancing – What if you have a web farm with 10 servers, and 9 work perfect, but the last one is broken (or deployed to incorrectly)? The app appears to work perfectly 90% of the time. Realistically, say it's a compound issue where the server only fails 10% of the time, then your app appears to work 99% of the time.
·         Tedious logic with too many inputs – A complex HR application could have hundreds of test cases that work perfectly, but say everyone missed the obscure case that only occurs when a twice-terminated user logs in and tries to change their email address.

General ways to test for ghosts:
·         Increase load and force concurrency (You can easily use a homegrown threading tool to make many web service or database calls at once, forcing a concurrency test).
·         Simulate hardware failures – unplug a network cable or temporarily turn off IIS in your QA environment. Does the app recover?
·         Allow developers some means for QA and Prod debug access –if you can finally reproduce that bug in prod (an nowhere else), the cheapest solution is to allow devs some means to troubleshoot it there. Perhaps they need to sit with a support specialist to use their security access, but find a way.
·         Have tracers and profilers on all servers, especially web and database servers.
·         Have a diagnostic  check for your own app. How do you know your app is healthy? Perhaps a tool that pings every webservice (on each machine in the webfarm), or ensures each stored proc is correctly installed?

Monday, July 18, 2011

The exponential learning curve from studying off-hours

You work a full hard day, so why bother "studying" off-hours? Because, it has an exponential reward. The trick is to differentiate between daily "grunt" work that just takes time without improving you as a developer, vs. "learning" work, such as experimenting with new technology or patterns or tools.

I've seen endless resumes where the candidate says "I have 5 years experience in technology X", but it's really 1 year of experience repeated 5 times. They've sadly spent their career doing repetitious work, and have nothing new to show for it.

Here's a simplistic case: say you spend 9 hours a day ("45" is the new "40") at work, but 8 hours of that is grunt work – data access plumbing, fix a logic bug, attend yet another meeting where you just sit through it, fill out a timesheet – and you've snuck away only 1 hour to research some new data-performance prototype, that means about 90% of you day is grunt work, and about 10% is advancing your career. If you did 1 hour self-study at night, it's not that you go from 9 hours to 10 hours, but rather from 1 hour to 2 hours, i.e. that extra hour is "cool" stuff for self-study, so it gives your "self-study" a  100% return.

Because I love Excel, here's a chart. "Work hours" is split between "Grunt hours" and "self-study hours". Hence an extra hour or two at night could double your learning curve.
I realize this is very black and white, and live is grey (for example, our jobs aren't neatly divided into "grunt " and "self-study", and 1 hour of self-study may be split into a dozen 5-minute Google queries throughout the day), but the general idea still holds.

Work hours
Grunt hours
Self-study hours

Related to this is the Upward Spiral – that the extra hour of overtime helps you learn a technique that makes your day job much faster. For example, say you spend 4 hours a day writing plumbing code to filter C# objects, but then you study LINQ during the evenings and now can do a 4-hour job in 30 minutes. That theoretically gives you a "surplus" of 3.5 hours. Of course that time gets snatched up by other things, but in general it's reasonable to reinvest part of that time in yet further self-study, i.e.  compound interest for your development career.

Friday, July 15, 2011

Presentation: An Introduction to Practical Unit Testing

I had the opportunity to speak at the LCNUG yesterday about one of my favorite topics - unit testing.

For convenience, here are some of the links from the PowerPoint:

Database "Unit Testing"

Tuesday, July 12, 2011

Upcoming talk: July 14 - An introduction to practical unit testing

[UPDATE] - slides at:

I'll be presenting at the Lake County .Net Users Group this Thursday, July 14, on An Introduction to Practical Unit Testing.
Unit testing is one of those buzzwords that every developer hears about, but relatively few projects actually do in a way that adds value. Many developers view unit tests as some "tax" imposed by management. This session will show how to start using unit tests to add immediate value, as well as dispel several common myths and abuses of unit testing. It will explain how unit tests fit with other types of automated tests (integration, functional, UI, performance), as well as how unit tests are but one technique in a developers' tool belt to craft better code.
This is intended as a basic 101-level session.