timstall: 2009

Sunday, December 27, 2009

Estimating database table sizes using SP_SpaceUsed

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/estimating_database_table_sizes_using_sp_spaceused.htm]

One of Steve McConnell's tips from his great book on estimating (Software Estimation: Demystifying the Black Art) is that you should not estimate that which you can easily count. Estimating database table sizes is a great example of this. Sure, on one hand disk space is relatively cheap; on the other hand you want to know at least a ballpark estimate of how much space your app will need - will database size explode and no longer fit on your existing SAN?

Here's a general strategy to estimate table size:

1. Determine the general schema for the table

Note the column datatypes that could be huge (like varchar(2000) for notes, or xml, or blob)

2. Find out how many rows you expect the table to contain

Is the table extending an existing table, and therefore proportional to it? For example, do you have an existing "Employee" table with 100,000 records, and you're creating a new "Employee_Reviews" table where each employee has a 2-3 reviews (and hence you're expecting 200,000 - 300,000 records)? If the table is completely new, then perhaps you can guess the rowcount based on expectations from the business sponsors.

If the table has only a few rows (perhaps less than 10,000 - but this depends), the size is probably negligible, and you don't need to worry about it.

3. Write a SQL script that creates and populates the table.

You can easily write a SQL script to create a new table (and add its appropriate indexes), and then use a WHILE loop to insert 100,000 rows. This can be done on a local instance of SQL Server. Note that you're not inserting the total number of rows you estimated - i.e. if you estimated that table will contain 10M rows, you don't need to insert 10M rows - rather you'll want a "unit size", which you can then multiple by however many rows you expect. (Indeed, you don't want to wait for 10M rows to be inserted, and your test machine may not even have enough space for that much test data).

For variable data (like strings), use average sized data. For null columns, populate them based on how likely you think they're be used, but err on the side of more space.

Obviously, save your script for later.

4. Run SP_SPACEUSED

SP_SpaceUsed displays how much data a table is using. It shows results for both the data, as well as the indexes (never forget the index space).

You can run it as simply as:

exec SP_SPACEUSED 'TableTest1'

Now you can get a unit-size per row. For example, if the table has 3000KB for data, and 1500KB for indexes, and you inserted a 100K rows, then the average size per row is: (3000KB + 1500KB) / 100,000. Then, multiple that by however many rows you expect.

This may seem like a lot of work, and there are certainly ways to theoretically predict it by plugging into a formula. My concern is that it's too easy for devs to miscalculate the formula (like forgetting the indexes, not accounting the initial table schema itself, or just all the extra steps)

5. Estimate the expected growth

Knowing the initial size is great, but you also must be prepared for growth. We can make educated guesses based on the driving factors of the table size (maybe new customers, a vendor data feed, or user activity), and we can then estimate the size based on historical data or the business's expectations. For example, if the table is based on new customers, and the sales team expects 10% growth, then prepare for 10% growth. Of if the table is based on a vendor data feed, and historically the feed has 13% new records every year, then prepare for 13% growth.

Depending on your company's SAN and DBA strategy, be prepared to have your initial estimate at least include enough space for the first year of growth.

6. Add a safety factor

There will be new columns, new lookup and helper tables, a burst of additional rows, maybe an extra index - something that increases the size. So, always add a safety factor.

7. Prepare for an archival strategy

Some data sources (such as verbose log records) are prone to become huge. Therefore, always have a plan for archival - even if it's that you can't archive (such as it's a transactional table and the business requires regular transactions on historical data). However, sometimes you get lucky; perhaps the business requirements say that based on the type of data, you only legally need to carry 4 years worth of data. Or, perhaps after the first 2 years, the data can be archived in a data warehouse, and then you don't worry about it anymore (this just passes the problem to someone else).

Summary

Here's a sample T-SQL script to create the table and index, insert data, and then call SP_SpaceUsed:

USE [MyTest]
GO

if exists (select 1 from sys.indexes where [name] = 'IX_TableTest1')
    drop index TableTest1.IX_TableTest1

if exists (select 1 from sys.tables where [name] = 'TableTest1')
    drop table TableTest1

--=========================================
--Custom SQL table
CREATE TABLE [dbo].[TableTest1](
    [SomeId] [int] IDENTITY(100000,1) NOT NULL,
    [phone] [bigint] NOT NULL,
    [SomeDate] [datetime] NOT NULL,
    [LastModDate] [datetime] NOT NULL
) ON [PRIMARY]

--Index
CREATE UNIQUE NONCLUSTERED INDEX [IX_TableTest1] ON [TableTest1]
(
    [SomeId] ASC,
    [phone] ASC
) ON [PRIMARY]
--=========================================

--do inserts

declare @max_rows int
select @max_rows = 1000

declare @i as int
select @i = 1

WHILE (@i <= @max_rows)
BEGIN
    --=============
    --Custom SQL Insert (note: use identity value for uniqueness)
    insert into TableTest1 (phone, SomeDate, LastModDate)
    select 6301112222, getDate(), getDate()
    --=============

    select @i = @i + 1

END

--Get sizes
exec SP_SPACEUSED 'TableTest1'

Sunday, December 13, 2009

How not to estimate

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/how_not_to_estimate.htm]

Being responsible for the end-to-end solution really makes me think about how to estimate. The flipside is that it also makes me think about how not to estimate:

Wild guess - sometimes this seems like your only option, but most of the time there is ways to improve on the guess (base it on similar projects, or split it into components and estimate those individually).
What you think the boss wants to hear - This may seem like the easy way to initially win favor with the boss, but it will come back with a vengeance when the estimate drastically deviates from reality. Also, because the boss always wants to hear lower schedule times, this has a huge bias that will send you off course.
Base it on unrelated projects - Historical data is great, but don't compare apples to oranges. That a WinForm app took 3 months tells you almost nothing about how long an ASP.Net app will take.
Pick an arbitrary big number - During crunch time, it's easy to think that everything will magically be better "next week" or "next month" ("that will give us enough time to fix everything), but then that time rolls around and the project is still behind schedule.

All of these are bad estimation methods because they miss the fundamental point - how long will the project really take to build in the real world? Wild guesses or the boss's wishes are not necessarily grounded in reality, so basing estimates on them is barking up the wrong tree.

I realize it's easy to say "how not to do something". I'd recommend Steve McConnell's book, Software Estimation: Demystifying the Black Art, for how to do a great job of estimating.

Friday, November 27, 2009

BOOK: 97 things every software architect should know

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/book_97_things_every_software_architect_should_know.htm]

A few months back I finished the book 97 Things Every Software Architect Should Know. I've been slow on blogging, so I hadn't gotten around to my standard follow-up post for books I read.

The books consists of 97 short essays, by various accomplished architects, each with a quick insight. It was a casual and fun read, the kind of book that's easy to sneak in a few pages between changing the kids diapers and fixing the house. It's got a lot of good points. I especially recall an essay about how "the database is your fortress" - GUI and Middle Tier apps change, but you will always have your database.

For hard-core architecture, I thought that Microsoft .NET: Architecting Applications for the Enterprise and Patterns of Enterprise Application Architecture were much more systematic and thorough. But overall, it was still a fun read.

Thursday, October 8, 2009

Can you still be technical if you don't code?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/can_you_still_be_technical_if_you_dont_code.htm]

Can you still be technical if you don't code? A lot of developers have a passion for the technology, do a great job in their current role of implementing solutions (which requires coding), and then get "promoted" into some "big picture" role that no longer does implementation - ironically the thing they did so well at. These higher-level roles often do lots of non-trivial things, but not actual coding. For example:

Infrastructure (Servers, SAN space, database access, network access)
Design decisions
Dealing with legacy code
Handling outsourcing, insourcing, consultants
Build-vs-buy
Vendor evaluations/score card; integrate the vendor's product into your own
Coordinate large-scale integration of many apps from different environments
Coordination among multiple product life cycles
Writing guidance docs
Code reviews
Occasional prototypes
Configuration

On one hand, these types of tasks require technical knowledge in that you wouldn't expect a non-technical person to perform them. On the other hand, they don't seem in the same category as hands-on coding.

What do you think - can you be technical (or remain technical) without actually writing code?

Monday, October 5, 2009

Reasons to NOT put version history in the comments

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/reasons_to_not_put_version_history_in_the_comments.htm]

Some coding standards ask that developers add revision history to the top of the method. Not just the normal "Summary" and "Parameter" tags that can be used to automatically create documentation, but rather a full blown revision log with developer name, date, and comment. This was bigger in the 80's and 90's, when source control and refactoring weren't as common. However, in these days, putting revision history in your comments has some really big problems:

Extra effort - It requires extra effort from developers, and usually the tedious kind of effort. It requires manual developer discipline, so architects don't have a good way to enforce this.
Bad for refactoring - It discourages refactoring of methods. Say you split a method in two - how do you split up the commented version history?
Source control already provides this - It doesn't tell you anything that source control history won't already give you. But even worse, it could be misleading - source control is the true authority, a developer could accidentally type (or forget) the wrong comments. Also, by documenting at the top of the method, it is hard to indicate what changed in the middle (whereas source control diff would instantly tell you).

Perhaps for SQL, I can see the benefit, so that when the DBA runs sp_helptext on a stored proc, they get a quick history (and databases aren't usually refactored like C# code). However, for middle tier code, putting revision history in comments seems like an unwise use of time.

LINK: Comments as version control

Thursday, October 1, 2009

What makes something Enterprise?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/what_makes_something_enterprise.htm]

There's a world of difference between a prototype hammered out over a weekend, and an enterprise app ready for the harsh world of production. Here's a somewhat random brainstorm. In general (there's always an exception), Enterprise apps:

Are scalable - they handle large loads and can be called many times.
Have a retry strategy - for example, it tries pinging the external service three times before "failing".
Have a failover strategy, like an active-passive machine cluster for maximum uptime, and a disaster recovery site.
Send notifications.
Handles invalid data (like states, zip codes, and numbers).
Can integrate with other systems (perhaps providing web service wrappers, or command line APIs, or publicly accessible data repositories that other apps can modify) .
Are deployable - "it works on my machine" absolutely does not cut it.
Have Logging - this is especially useful for debugging in production, or measuring how many errors (and which types) are thrown.
Have long-running process (hours, days, or even weeks) - not just a single thread in memory.
Have async processes, which usually means concurrency and threading problems.
Support multiple instances of the app running. You can open two copies of word, or run two MSBuild scripts at the same time.
Handle product versioning.
Care about the hardware it's running on (enough CPU and memory).
Have a pluggable architecture - You may need to switch data providers (Oracle/SQL/Xml).
Have external data sources (web services, ftp file dumps, external databases).
Can scale out, such as adding more web servers, or splitting the database across multiple servers.
Have security constraints (both hacking, and functional).
Have process that are documented (not just for training, but also for legal auditing and compliance issues).

Much of this code isn't the fun, "glamorous" stuff. However, it's this kind of robustness that separates the "toys" from the enterprise workhorses.

Tuesday, September 29, 2009

A quick overview of enterprise object caching

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/a_quick_overview_of_enterprise_object_caching.htm]

Caching is a performance mechanism where you store an expensive-to-create value for future consumption. For example, you may cache dropdown values to spare yourself expensive database hits. Caching is one of those buzzwords that everyone knows your application should have, but surprisingly few really do.

Books have been written on caching, so this is just a quick brain dump based on my personal experiences. Also note that I refer to "the database" a lot because it's the main data dependency that most developers can relate to, but really, it could be anything (web service, external file, etc...).

Frontend and Backend Caching

	Frontend UI	Backend
Where is it located?	Local (in process)	Remote (external machines)
Pro	Faster because it's local - no remote hit	Handles updating stale data in distributed systems Handles any CLR serializable object, independent of the UI layer.
Con	Does not handle updating values - data may be stale Limited to just HttpContext.	Slower because it's remote - you still need to pay for the remote hit.
Example	Asp.Net HttpContext.Cache	Memcache, Velocity, others...

Obvious follow-up question: "Would you double cache something, taking data from the backend cache and saving it to the fronend cache?" Sure, if it benefited your specific scenario. Ideally the backend cache is totally encapsulated anyway, so your frontend UI developer wouldn't even know if the data they're working with came from a cache or not.

What is a good candidate for caching?

Any data that:

Takes a lot of time to create, either due to a remote hit (to a database or web service), or a large calculation time (like querying a million rows).
Has a small final result - you query a million rows just to return a single value.
Does not change - the problem with caching is stale data.
Has minimal dependencies. If your object touches 10 tables for creation, then there's a much greater chance of it becoming stale. This is one benefit of loosely coupled (and then batched) objects, instead of spaghetti code.
Has many reads, but very few writes. For example, system-level data (that everyone constantly requests) is good for caching, but employee-level data may not be.
Is retrieved externally and requires high uptime - for example you make a web service call to get some value, you call the service again 5 minutes later, the service is down, and you really wish you had even a "stale" copy of that data.

The canonical example would be something like city-state dropdowns. Say it's initially a remote database hit to some "City/State" tables, it returns a small amount of data, it changes infrequently (The US has had 50 states for the last half-century), so it's read many times but not prone to stale data.

What is a bad candidate for caching? Pretty much, the opposite of the good criteria.

Pitfalls with caching

Merely creating the dictionary isn't the problem. The problem will be integrating it (seamlessly) into your data persistence layer, and then ensuring the cache doesn't become stale - especially across a distributed environment - and then making sure it actually improves your performance instead of degrading it.

Stale data is the bane of caching. There are at least a few ways to deal with stale data:

Apply a time-out policy such that all data expires after N minutes, but that won't be acceptable for most scenarios.
If your app is changing the data (such as updating a details page), then have the DAL method (that calls the update SQL) also ping the cache to clear the stale object. This requires that your application somehow keeps track of which objects depend on which piece of data. It works great with a domain model, which requires more design upfront, but can have huge payoffs.
If someone else is directly changing the database, such as a DBA running ad-hoc SQL in production, then consider providing some admin console that lets them clear segments of the cache. For example, if the DBA did a mass-update of all salaries, then have the admin page allow you to flush all salary-related objects out of the cache. This requires some infrastructure for tagging each cached object (perhaps a master config file), and discipline from the DBA to check that admin page.
Beware of "database cache dependencies" (ASP.Net 2.0?), that claim to let you apply triggers to the database table, and then automatically clear the appropriate cache items when a specific row/column is updated. I've personally never gotten this to successfully work, have heard lots of horror stories (especially when deploying it across a DMZ), and it forces the domain design into the database instead of the middle tier. Although I'm all ears to anyone who's had a success story here.

Some other things to keep in mind:

Where should my cache tie in? Ideally, you'd want the backend cache abstracted via your dataAccess layer. This becomes very feasible with CodeGeneration. Whether data is pulled from cache or the live database is just a tuning option. Just like you don't want to put in-line SQL throughout your codebehind pages, you probably don't want to tightly-couple all your UI code to your cache provider. The frontend cache can be referenced in your UI, but again, be aware of too much plumbing code that "designs you into a corner".
Only temporary - A cache is not a persistent data store, it is not merely a "mirror" to scale out your database, as you must account for the cache being cleared. A data access method should always have a means to recreate the cached data.
Why use a remote cache? If both the cache and database servers both have remote hits, what's the difference? In its simplest form, this helps scale out the database (which is usually a bottleneck). Every hit on the cache is one less hit on the already-overloaded database. Even after the remote hit, the cache can have a much faster lookup time because while the cache stores a created object, the database may still need to query 1 million rows to collect the data.
Control Panel - You'll want to provide an easy way to flush the entire cache, or even segments of the cache, without restarting any servers. It's also great moral support to have a statistics page showing how many thousand (million) database calls have been spared.
Configuration - You'll want to provide a way to configure almost everything: the cache durations, what category of duration (short, medium, long), which objects get cached, and perhaps even turn off the entire cache for emergency troubleshooting. Caching is something that you want to tune based on actual production results. Ideally control of what gets cached is all abstracted to a few easy-to-manage config files. You do not want these config values hard-coded throughout your app.
Beware of over-caching - If done the wrong way, caching can actually screw your performance. Say you cache something that is continually obsolete, so instead of just doing the normal database hit, you continually also need to do the extra cache query.

Good things to read

There's an endless list of info on caching. Here are some that could be useful.

The Pragmatic Programmers have a whole book on caching with memcached! (I've personally used Memcached before, and liked it.)
MSDN: Caching Architecture Guide for .NET Framework Applications
Enterprise Library Caching Application Block

Yes, there's ton more that can be said about caching. Again, this is just a quick brain dump.

Tuesday, September 22, 2009

ConnectionTimeout vs. CommandTimeout

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/connectiontimeout_vs_commandtimeout.htm]

SQL timeouts can be very annoying, especially for internal development tools where performance isn't critical.

A lot of developers will try fixing this by modifying the connection string by appending "Connect Timeout=300". Normally this is easy because the connection string is stored in some config file.

However, it still usually fails. This is because there's a big difference between SqlConnection.ConnectionTimeout and SqlCommand.CommandTimeout.

If you're running a command, like a snippet of SQL or a stored proc, then your code needs to set the CommandTimeout. Something like so:

SqlConnection con = new SqlConnection(strDbCon);
SqlCommand cmd = con.CreateCommand();
cmd.CommandType = CommandType.Text;
cmd.CommandText = strText;
cmd.CommandTimeout = 10000; //no relation to con.ConnectionTimeout

Obviously, that's very minimalist code, but that's the general idea. For a robust data access layer, you'd make the timeout configurable.

Sunday, September 20, 2009

I don't have time to put on the parachute

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/i_dont_have_time_to_put_on_the_parachute.htm]

If you're jumping out of a plane at 5000 feet - you take the time to put on the parachute. Sure, you could save yourself a few seconds, and for the first 1000 feet having that parachute doesn't matter yet when you're still free-falling ("I saved schedule time by skipping the 'put-on-parachute' task!"). However, by the time you hit the ground, that parachute is life-saving.

Same thing with software projects and best practices. The project is like jumping out of a plane, and the best practices are like the parachute. Sure, some "best practices" are just useless marketing buzzwords. However, others - like unit testing - are the real deal. And a developer or manager who rejects unit testing (for new .Net code in the middle tier) because they "don't have time" is like jumping out of the plane without putting on your parachute. You save a bit of time upfront, but when the project goes to production and "crashes" into reality, the maintenance costs and untested boundary cases will kill it.

"I don't have time" sounds much more noble and business-like than "I don't understand that idea", or "I just don't feel like doing something new." But within the (very common) context of new .Net class-library development - given that testing frameworks are free (NUnit, MSTest), and that any developer can at least write their tests locally - regardless of management support, and that in certain scenarios it's actually faster to develop code by writing unit tests (because it stubs out the context), the "I don't have time" reply to unit testing certain scenarios is misguided. Sure there are always exceptions, and reasons that good devs may not write unit tests. Testing databases, or legacy code, or the UI, or integration is a different beast. However, in general, testing public methods in a C# class library, without external data dependencies, should be as reasonable as putting a parachute on while jumping out of a plane.

Wednesday, September 16, 2009

LCNUG - Sept 24 - C# 4.0

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/lcnug__sept_24__c_40.htm]

Mike Stall (MSFT) will present on the new features in C# 4.0, including named and optional parameters, dynamic support, scripting, office interop and No-PIA (Primary-Interop-Assemblies) support.

This is a great event for those interested in C# 4.0.

Tuesday, August 25, 2009

What is a number?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/what_is_a_number.htm]

Having an application collect a number from the user seems simple. Just whip out a textbox, maybe add an "integer validator", and ta da! Sounds great, but endless things can go wrong (things can always go wrong, like with states, zip codes, or even labels). I'm certainly not saying that every textbox in every line of business app should handle all of this, but here are some things to be aware of:

Do you allow for special characters, like dollar signs "$", percents "%", or financial negative numbers "( )"? [Seriously, try formatting a negative dollar amount in Excel]
Do you allow commas?
Do you allow decimals?
Do you handle globalization? Some countries use the "," for a decimal point, and the "." for a comma (the inverse of what America does).
Do you allow both "-" (for negative) and "+" for positive, or is the "+" just implied by default?
Do you allow pre and post zeros: "001.00". Keep in mind that the zeros on the left are mathematically redundant, but the zeros on the right may count as "significant digits", and denote a degree of precision. The extra zeros in "1.25000" implies a much more precise number than just "1.25". However, most numeric types in .programming languages will truncate the trailing zeros.
Do you allow exponents for large numbers, like "1E+12" (that's 1 with 12 zeros after it: "1,000,000,000,000")?
Do you allow abbreviations, like "10M" for "10,000,000"? (FWIW, I'm not sure of any app that actually does this.)
Do you have a clearly defined range? For example, a smallint (Int16) stores much smaller numbers than a long (Int64). Most business values can be stored with a double. However, if you're extending a legacy system that only provides a byte or smallint, but it expects occasionally huge values, then you could get into trouble.
Is your validation lenient? For example, if the user enters multiple commas "23,5,6" - do you simply remove all the commas and have "2346", or is that an error?
How do you handle nulls? Do you use a sentinel value (like Int32.MinValue), or use the new Nullable data types, or something else? If using a nullable value, does it "convert up" - i.e. Int32.MinValue is null if the type if Int32, but it would be a valid non-null value if the type was Int64. If you have multiple systems reading the same value, and one system uses Int32, and the other uses Int64, will your system still work (this can happen when you start doing things like letting the user dynamically create their own pages or reports).
Not to be smart-alecky about it, but I'd just assume that unless otherwise told, this accepts base-10 inputs. I.e. "A5" and "FF" are invalid values - although there may be apps where that is legit (like specify a color in HTML for some blog or content-management software).

Also, does your system handle transformations? That's where you collect the input, store the raw data as something else, and display a newly formatted value. Perhaps the most common example is with percents. A user may type "50" into a textbox (with the "%" in the label next to it), you may save it in the database as "0.5", and you may format it back as "50%" on a report. Or a user may enter "+0" in a textbox, and you simple store and display it as "0".

So, some numerical inputs to test with (besides the obvious invalid numeric inputs, like "xyz") :

-23,456.000
-23.456,00 //Globalization
$45,345.00
$(4.00)
The max/min values for each of the numeric types.
Fill the entire textbox with 9's.
000000000
+23
23 //try the simple case
-0
+0
1.25E+12
1.25 E + 12 //note the spaces

The things to look for are (1) will these prompt validation messages, (2) how will each value be stored in the database, and (3) how will the value be re-formatted when it's displayed.

Perhaps the best standard is to see how Excel handles it, as Excel is probably the most famous application in the world that handles numbers.

Thursday, August 20, 2009

10 tips to integrate CodeSmith into your processes

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/10_tips_to_integrate_codesmith_into_your_processes.htm]

Say you've theoretically seen why code generation is so profitable, so you've downloaded a free trial of CodeSmith, and banged out a few templates. In other words, you've got code generation working on a single developer machine. That's great, but it's even better to have it adopted by the entire department. Here are some practical tips on how to integrate CodeSmith into your processes.

Aim for active regeneration - There are two kinds of generation, Active and Passive. Active means that the code is actively regenerated on a regular basis. Passive means it was generated just once, and then modified manually thereafter. The problem with Passive generation is that it lets developers create tons of code upfront, but then people get trigger-happy and use the generator to produce even more code, and you're stuck now maintaining it all. It's like a trip with no return ticket. It also misses out on many of the other benefits of codeGen - like mass updating code with some new change.
Always have a batch script - Yes, people can integrate into VS, or use the CodeSmith IDE. But to enforce uniformity, ensure that the right properties are passed in, and hook into your Continuous Integration (CI) build, you'll need a batch.
Run the codeGen from your CI build - This enforces active regeneration.
Consider not checking generated scripts into source control - This prevents synchronization errors between local developers and the build server. Yes, all code should ultimately be checked into source control, which is why we still check the templates themselves in, from which you can deterministically recreate all the target code. Your automated checkout script, which gets the latest from source control, can then run the templates and recreate the target code. NOTE - this only works if you're not using merge regions (which mix generated and custom code in the same file). If you use merge regions, then you need to check in the generated files.
Avoid merge reasons where possible - CodeSmith has this powerful feature called "merge regions" which lets you mix both generated and custom code. Sometimes you need this, but if you have a choice - always opt to put generated code in its own, dedicated file. This prevents synchronization issues, is less likely to break, is easier to handle overwriting files in active generation, and is easier for most developers to understand and maintain.
Ensure that code generation can be run on every developer's machine - Because you'll want to actively re-generate the code, you'll need each developer to be able to run those generation batch scripts locally. That means each developer will need a license for CodeSmith. This is absolutely not the place to be stingy. If developers cannot simply make a change and have the code re-generated, they will revolt against using code generation.
Clearly identify the generated files - Make sure an average developer can quickly identify that a given file is code-generated. You could name the file with a "*.CodeGen.cs" extension, put a comment disclaimer at the top ("//This file is code-generated. Any changes will be overwritten"), and not check the target code into source control so that it doesn't have any overlay icons (like what SVN offers).
Know your overwrite strategy - If the target file already exists (because you're actively re-generating), make sure you know the expected behavior. If you don't use merge regions, you can simply overwrite the file. Source control should be smart enough to see that the file has the exact same content, and hence it shouldn't be a burden. Worst case, you can have your generator, before it writes out the generated context, detect if the target is the same or not, and handle appropriately (not write anything, have your build server throw a synchronization error if they're different, etc...)
Don't output the DateTime or user info into the code - When someone first uses a code generator, it can be tempting to add as much "free" details into the target file - like displaying "//This code was generated by Homer, on August 20,2009 at 11:34 pm". You'd never maintain that by hand, so it initially looks cool to see all that crisp information in your file. However, the problem is that every time the code is regenerated, that kind of information changes, and the code continually appears to be updated. Furthermore, such comments don't give you anything that you couldn't already get from source control.
Have a backup developer - Make sure that at least one other developer on the team can use the generating tool.

Monday, August 17, 2009

When should you use Code Generation?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/when_should_you_use_code_generation.htm]

Like everything, using code generation is a tradeoff. Basically, you want codeGen when the cost of writing the templates is less than the costs of writing what the template generates. For example, codeGen rocks at creating and maintaining tedious plumbing code - data access plumbing is the canonical example.

You should probably consider code generation when the target code:

Has similar patterns and deterministic rules, and little deviation from those rules. If there's an MS word doc or wiki page providing detailed instructions on how to write certain code, you may be able to send those instructions to the code generator instead.
Is large and brittle, and it can be described easily (i.e. 1 line in an xml config fiile to describe = 20 lines of C# and SQL coding).
Requires syncing with external data sources (like ADO.Net plumbing based on the database schema, such as var type, size, parameter order, count).
Requires multiple files to be kept in sync (like a SQL stored proc being in sync with C# ADO.Net wrapper).
Requires in-depth knowledge of a problem domain (like ADO.Net for data-access plumbing).
Is continually updated (like adding new classes to your DAL).
May need to be expanded in the future (like adding a whole new layer of webservices, or Audit triggers, to your DAL. Or, perhaps even migrating to a new language).

Code Generation is not a "Golden Hammer" - while it's great, it's not the perfect solution for everything. codeGen may not be the best solution if the target code:

Is very custom with no general pattern. If you can't abstract a pattern out of the target code, then you won't be able to write a generation template.
Is too small and trivial - In general, codeGen should decrease your total lines of code. So if you're writing a 50 line template to produce a single 30 line C# file, it's probably a bad ROI.
Can be refactored away, and doesn't need to exist in the first place.

Like everything else, in some contexts, it just isn't profitable - but in other contexts it's awesome. Some problems are cheaper to solve using other techniques, like unit testing, automation, open-source, DSLs, or other techniques; but every advanced developer should have code generation in their tool belt.

Monday, August 10, 2009

How Zip Codes can get complicated

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/how_zip_codes_can_get_complicated.htm]

I mentioned in a previous post that while states (in an address) seem simple - indeed most developers have made a dropdown to "pick your state" - in legacy apps they can quickly get complicated. Same thing applies to zip codes. It sounds like a secondary afterthought - "Oh, just add a field to the application so we can store numbers like 20500." However, it can quickly snowball:

Do you store the 4 digits at the end, like "20500-0003".
If you remove the spaces and dashes, you're left with just numbers - which seems easier to store and search on. So do you store it as an integer (205000003)? This might work if you're only looking at cities in the midwest or west coast, but some east coast states use zip codes that start with a "0", which would get truncated if stored as a number. Personally, I prefer to store them as a varchar, and then have a UI validation (for new) and backend scrubbing process (for existing) to standardize the format in the database.
Do you enforce valid zip codes only? Not every 9-digit combination of numbers is a valid zip code - i.e. there are not 1 trillion distinct codes that actually are used.
Many applications assume that one zip code belongs to one state - but there are scenarios where a single zip code is shared by multiple states (seriously).
Do you allow your zip code field to store international postal codes? Many US applications start off small, and only worry about the US market. Then some business sponsor says "we're missing out on the Country XYZ market, quick, update the app to handle foreign cities". This often causes changes to an address screen, and the quickest way to change it may be providing an "Out-Of-Country" option for the state dropdown, and allow the zip code to store international postal codes.
And do you handle all this in the UI with a rich control, or just use a "flexible" 10-character textbox?

Thursday, August 6, 2009

What can go wrong with changing a text label?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/what_can_go_wrong_with_changing_a_text_label.htm]

What could go wrong if the business sponsor just wants to change some little text on a label? It sounds like the simplest thing. And while it should be a simple change, it can sometimes get complicated (i.e. expensive).

Globalization - If the app needs to support multiple languages, then the new text needs to be translated into all those target languages.
Special Characters - The new text could support special characters that need to be encoded (like & or < or >).
Non-supported characters - The new label introduces a special character not in the original character set (such as some foreign language), and the application only supports basic ASCII.
Changes flow layout - The new label is longer or shorter, which changes the flow layout. For example, the new text is long enough to force wrapping (which pushes a row to high) or it doesn't wrap so it pushes a column to wide.
It's an image - Perhaps the label is actually an image (for fancier looking designs), not just text.
Text is not determined in the presentation layer - Perhaps the label is set dynamically through code or configs, such as pulling from a meta-data dictionary.
Text was dynamically built - Perhaps the label was dynamically concatenated via some existing logic (for example, it pluralizes the text if some count > 1), so you're not just setting a literal string anymore.
You don't own the text - Perhaps the label comes from an embedded, third-party component that you don't own and cannot easily change.
The label appears in multiple places - Perhaps the label appears in multiple places, and so all places need to be updated. For example, it also appears in a UI Report writer where you select the UI-friendly name instead of the database schema column/table.
(Bad design) - other code depends on the label text - Perhaps the application has bad design, and there's actually code that reads the value of the label to determine if it should do some other action. For example, if the label contains some word X, then hide section Y (as opposed to whatever method sets label to word X also hides section Y).
New font - Perhaps the new text actually introduces a new font that isn't available on the client (this isn't merely changing text, but it's associated with that kind of request).

Wednesday, July 22, 2009

Thinking that you can learn it all

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/thinking_that_you_can_learn_it_all.htm]

I think .Net is too huge for one person to learn it all, and it's just getting bigger - like a galaxy. However, sometimes an optimistic developer may get the temporary delusion that they can learn it all, or at least the parts that matter. How could someone become so optimistic?

Unexpected free time.
Something came easier than expected.
You're on a prestigious fast track project.
A real good teacher explained something very well (and quickly).
You're kidding yourself - you're just skimming, or only looking at buzzwords, not really digging into the tech.

Basically, if things are temporarily going well (i.e. you're absorbing new concepts really fast), it may be tempting to think that "ah ha, this learning thing has finally clicked, and it will always go fast from now on!" Oh, how I wish...

Sunday, July 19, 2009

Would you still write unit tests even if you couldn?t automatically re-run them tomorrow?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/would_you_still_write_unit_tests_even_if_you_couldnt_automa.htm]

I am constantly amazed at how difficult it is to encourage software engineering teams to adopt unit testing. Everyone knows (wink wink) that you should test your own code, and we all love automation, and all the experts are pushing for it, and we all know how expensive bug fixes are, etc... Yet, there are still many experienced and good-hearted developers who simply don't write unit tests.

I think a critical question may be "Would you still write unit tests even if you couldn’t automatically re-run them tomorrow?"

Here's why - most managers who push unit tests do so saying something like "Yeah, it's a lot of extra work to write all that testing code right now, but you'll sure be glad in a month when you can automatically re-run them. Oh, and by the way, you can't go home today until you fix these three production issues."

The problem is this demotes unit testing to yet another "invest now; reward later" methodology. This is a crowded field, so it's easy to ignore a new-comer like "unit testing". Obviously, most devs live in the here and now, and they'll just trying to survive today, so they care much more about "invest now, reward now".

The "trick" with unit testing - at least with basic unit testing to at least get your foot in the door - is that it adds immediate value today. Even if you can't automate those tests tomorrow, it can often still help get the current code done faster and better. How is this possible?

Faster to develop - Unit testing is faster to developer because it stubs out the context. Say you have some static method buried deep within your web application. If it takes you 5 minutes to set up the data, recompile the host app, navigate to the page, and do whatever action triggers your method being called - that's a huge lag time. If you could write a unit test that directly calls that method, such that you can bypass all that rigmarole and run the static method in 5 seconds - and now you need to test 10 different boundary cases - you've just saved yourself a good chunk of time.
Think through your own code - Unit testing forces you to dog food your own code (especially for class-library APIs). It also force you to think out boundary conditions - per the previous question, if it takes several minutes to test one usage of a function, and that function has many different boundary cases, a time-pressed developer simply won't test all the cases.
Better design - Testable code encourages a more modular design that is more flexible to change, and easier to debug. Think of it like this: in order to write the unit test, you've got to be able to call the code from a context-free, class library; i.e. if a unit test can call it, then so could a windows service, web service, console app, windows app, or anything else. Every external dependency (i.e. the things that usually break in production due to bad configuration) has been accounted for.

Even if you could never re-run those unit tests after the code was written, they are still a good ROI. The fact that you can automatically re-run them, and get all the additional benefits, is what makes unit testing such a winner for most application development.

Sunday, July 12, 2009

The address's State field may contain more than just the 50 states

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/the_addresss_state_field_may_contain_more_than_just_the_50_.htm]

Most business applications eventually ask the user to enter an address. There's the user's address, shipping address, their company's address, maybe an emergency contact's address, address history, travel-related addresses, financial addresses, etc... Most addresses have a city, state, and zip. While city and zip seem simple (more on that later), many devs initially expect the "State" field to be simple - perhaps a two-character column that can store the 50 US states. However, it can quickly balloon to something much more complicated (especially if you're troubleshooting some legacy app). Besides the standard states, it could contain:

Military codes (reference)

Armed Forces Africa	AE
Armed Forces Americas (except Canada)	AA
Armed Forces Canada	AE
Armed Forces Europe	AE
Armed Forces Middle East	AE
Armed Forces Pacific	AP

US Possessions (reference)

AMERICAN SAMOA	AS
DISTRICT OF COLUMBIA	DC
FEDERATED STATES OF MICRONESIA	FM
GUAM	GU
MARSHALL ISLANDS	MH
NORTHERN MARIANA ISLANDS	MP
PALAU	PW
PUERTO RICO	PR
VIRGIN ISLANDS	VI

Perhaps Canadian provinces? (reference)

Alberta	AB
British Columbia	BC
Manitoba	MB
New Brunswick	NB
Newfoundland and Labrador	NL
Northwest Territories	NT
Nova Scotia	NS
Nunavut	NU
Ontario	ON
Prince Edward Island	PE
Quebec	QC
Saskatchewan	SK
Yukon	YT

Generic codes to indicate international use?

Foreign Country	FC
Out of Country	OC
Not Applicable	NA

Or, specific applications may try their own proprietary international mapping, like "RS" = Russia. This might work if you're only doing business with a handful of countries, but it doesn't scale well to the 200+ (?) existing other countries (i.e. I would not recommend this. Use a "Country" field instead is feasible).

Special codes to indicate an unknown, or empty state?

Perhaps, for some reason, the application developer isn't storing just 2-char codes, but rather integer ids that map to another "States" table, so you see numbers like "32" instead of "NY" ("New York")?

Or, even worse, they're shoving non-state related information into the state column as a hack that "made something else easier".

How many distinct entries could you have?

With 26 letters, you've only got 26 ^ 2 = 676 options.
If you use numbers too, you've got (26 + 10) ^ 2 = 1296 options.
If you start using lower case letters (SQL is case-insensitive, but maybe this impacts managed code), then you've got (2 * 26 + 10) ^ 2 = 3844 options.
Add in some special characters (such as spaces, periods, asterisks, hyphen, underscores, etc...), maybe 10 of then (if the column isn't validated on strictly alpha-numeric), and you've got (2 * 26 + 10 + 10) ^2 = 5184 options.

That's potentially 100 time more than just the 50 US states. Of course, for new development, we'd all prefer some clearly-defined schema with referential integrity and a business-sensible range of values. However the real world of enterprise applications is messy, and you have to be prepared to see messy things.

It sounds simple, but that innocent "state" field can quickly get very complex.

Thursday, July 2, 2009

Why a manager may not want you to learn

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/why_a_manager_may_not_want_you_to_learn.htm]

I'm a huge advocate of learning. And it's natural for devs to want to pick up new stuff. However, many devs don't realize that they may report to a manager that actually wants to prevent them from learning new things - even on their own personal time. I think this type of manager is rare. However, it's good to be aware in case a manager is (perhaps unintentionally) "sabotaging" your learning.

I hesitated about writing this post lest it seem to cynical or jaded, but it's worth discussing as developers should be aware of such things. Note that there is not one specific person/event/incident that I have in mind, but rather glimpses of things over the last 10 years.

Their control - They may want to be in control, and you learning new things that they don't know takes away from their control.
- They may want to understand the entire architecture themselves. It's sort of a "not built here" applied to a personal level - "If I don't already know it, it must not be necessary."
- They may not want to learn the new stuff themselves. If you're a tech manager, and all your devs learn the next wave of technologies, it pressures you to learn the new wave as well, else you look obsolete.
- They don't want you exposing their mistakes. Say a senior developer wrote a bad messaging framework. As long as no other employee has a clue about messaging, no one knows that they made a bad mistake.
- They want you to "suffer" just like they did. Often new techs make it easier to do something, and rather than have the easy way out, you should do it the original way so you "understand what's really going on". Think using assembly language or C++ instead of a higher-level language like C#.
It doesn't support the immediate work
- They may think it's a waste of time - "We've already invested in this architecture, we don't need anything else." Even though it's your own time, they'd rather you spend overtime on "useful features", like copying and pasting tedious code.
- It competes with your day job - If you're researching some cool XNA technology, which is a lot more fun than the drudgery of some bad architecture, it may compete. Suppose you work at home, it might "distract" you.
- It may be misapplied. New stuff is risky, and could be buggy or applied incorrectly - which would hurt the project.
- Their afraid that "smart" developers are hard to manage. Smart developers can sometimes be total egomaniacs to work with (because they think they're so smart), and management may not want to even think about dealing with that.
You may leave
- You may outgrow your company and leave- If your company is stuck in the dark ages, they may want to keep everyone's technical skills "in the dark" as well, lest an employee "see the light" and leave.
- It makes your more marketable, and you may leave. If you're stuck with some niche technology on an obsolete framework, you aren't very marketable and hence can't get another job, and hence your boss has tremendous control over you.

Examples of how a manager might unintentionally discourage a developer from learning:

Financially reject anything (like buying new books or tools or paying for a class)
Undermine your confidence ("Why would you need that") or question your motives.
Deny you resources, such as preventing you from installing anything on your machine (open source code, tools, etc...)
Never affirm new learning or innovation. They tell you "good job" for getting that feature done, but won't affirm picking up new technologies.
Never provide their software engineers with a continuing-education plan. Ask yourself, how do developers go "to the next level" in your team? Does management help them?

It's sad, but some companies are structured where it's not in the manager's best interests for the employees to "wise up". The managers want hard-working, honest people who are easy to manage, but they don't want to deal with innovation or smart developers.

LINK: Does your Project encourage learning

Sunday, June 28, 2009

23 features of an enterprise data access layer

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/23_features_of_an_enterprise_data_access_layer_1.htm]

Most line of business applications will die unless they have a strong data access strategy. Enterprise apps simply cannot afford to hard-code thousands of in-line SQL calls to an aspx code behind; the maintenance and lack of reuse and testability will kill you. I realize entire books are written on data-access strategy (Fowler, Dino/Andrea), and by much smarter men than I, so I only offer this blog post as a summary and braindump. I'm sure I've inevitably missed several important aspects. I also realize that developers take their Data Access Layers (DAL) very seriously and personally, and may consider some features more or less important than others.

Must-have features - This will get you started.

CRUD - Give you at least the basic CRUD (Create, Read, Update, Delete) functionality
Sorted paging and filtering - Provide a simple way to handle sorted-paging and filtering
Automatically generated - For the love of all that is good, do NOT write tons of manual data-access plumbing code by hand. Either code generate it, or use a dynamic ORM (like NHibernate)
Serializable objects - Domain objects should be serializable so you can persist them across the wire (such as store them in a cache). Sometimes this is solved as easily as slapping on attributes to your objects.
Handles concurrency - Even a where-clause check that simply compares a version (or datetime) stamp.
Transactions - Support transactions across multiple tables, such as either using the SQL transaction keyword, or the ADO.Net transaction (or something else?)

Good-to-have features - When you start scaling up, you're really going to want these.

FK and unique-index lookups - Provide those extra automatically generated FK and unique-index lookups on your tables.
Meta-data driven - Perhaps you define your entity model in xml files, and your process generates the rest from that (tables, SP, DAL, entity C# classes, etc...)
Mocked / Isolation-Framework-friendly - It could provide support for a mock database, or at least create interfaces for all the appropriate classes so you program against the interface instead of concrete classes.
Batching - (Includes transaction management). If you don't have the ability to batch two DAL calls together, because remote calls are relatively expensive, you'll inevitably start squishing unrelated calls into single spaghetti-blobs for performance reasons.
Insert an entire grid at once - This could be done via batching, or perhaps SQL 2008's new table value parameter.
Handle database validation errors - Ability to capture database validation errors and return them to the business tier. For example, checking that a code must be unique. (See: Why put logic in SP?)
Caching - for performance reasons, you'll eventually want to cache certain types of data. Ideally your DAL reads some cache-object config file and abstracts this all away, so you don't litter your codeBehinds with hard-coded cache calls. [LINK: thoughts on caching]
Multiple types of databases - Access multiple different types of databases, such as main, historical, reporting, etc...
Scales out to multiple, partitioned, databases - For example, your main application data store may be partitioned by user SSN, and hence you can spread out the load across multiple databases instead of having one, giant bottleneck.
Integrate with a validation framework - Perhaps by applying attributes to the entity objects (like what Enterprise Library Validation Block does), you may want your generator to be able to read both database schema info and external override values from an xml file. For example, say you have an Employee object with a "FirstName" property which maps to the EmpInfo table's FirstName column, the generator could pull the varchar length and required attributes from the database column, and then possibly pull a required expression pattern from the override xml file.
Audit trail for changes made - The business sponsors are going to want to see change history of certain fields, especially security and financial related ones.
Create UI admin pages - Provide the ability to create the admin UI pages for easy maintenance of each table. Even if you don't actually deploy these, they're a great developer aid.

Wow - These are more advanced

Partial update of an object - say you have a reusable Employee object with 30 fields, but you only need half those fields in some specific context, it can be beneficial to have a DAL that can be "smart" and updates only the fields that you used in a given context. Perhaps you could add a csv list to the base domain object (that Employee inherits from), and every time a property in Employee is set, it adds the field to that CSV list. Then, it passes that CSV list to the data access strategy, which only updates the fields in that list.
Provide a data dictionary so it integrates into other processes. Building off the meta-data approach (where you can automatically generate lots of extra plumbing to assist with integration and abstraction layers), you can start doing some really fancy things:
1. See every instance in the UI where a DB field was ultimately used
2. Provide clients a managed abstraction layer that lets them write their own reports given the UI views - not the backend tables.
3. Provide clients a managed abstraction layer that lets them do mass updates of their own data (this is a validation and security nightmare).
N-level undo - I've never personally implemented or needed this, but I hear CLSA.Net has it.
Return deep object graphs - Having a domain model is great, but there's the classic OOP - relational data mismatch. ORM mappers explicitly help solve this. Without some sort of ORM mapper, most application inevitably "settle" (?) for a transaction script or table module/active record approach. A deep object graph also requires lazy loading.
Database independence - Configure your database for easy switch from SQL Server to Oracle. You could do this at compile time by re-writing your code-generator templates. I've heard some architects insist that you should be able to do this at run time as well via a provider model, and updating some information in the config file (I've never personally done this).

Data access is a re-occurring problem, so the community has evolved a lot of different solutions. Consider some of these:

ORM mappers
- NHibernate - A popular ORM mapper (and check out Hudson Akridge's blog for NHibernate expertise)
CodeSmith-related (generates code at compile time)
- CLSA.Net - Rockford Lhotka's super enterprise framework, which has CodeSmith support.
- .NetTiers - A set of CodeSmith templates to create the entire DAL and UI admin pages.
- Your own custom-built thing (via CodeSmith)
Microsoft solutions
- Strongly-typed .Net DataSets - This is what .Net first had.
- Linq-to-Sql - Great for RAD apps, but I'm personally not sure how well it scales. Be sure to run SQL-Profiler on its generated sql.
- Microsoft Entity Framework - Microsoft's offering for a domain-model approach (as opposed to their out-of-the-box typed datasets).
I've heard about, but never personally used these:
- Castle ActiveRecord
- (Probably several others too...)
In-line SQL from your Aspx codebehind - ha ha, just kidding. Don't even think about it. Seriously... don't.

Thursday, June 18, 2009

BOOK: Microsoft .NET: Architecting Applications for the Enterprise

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/book_microsoft_net_architecting_applications_for_the_ente.htm]

As the .Net platform matures (almost version 4.0!), I'm seeing more and more good .Net architecture books coming out. One such book is Microsoft .NET: Architecting Applications for the Enterprise, by Dino Esposito and Andrea Saltarello.

The first section focused heavily on architectural principles. The book was worth getting just for Chapter 3 alone (Design Principles and Patterns), which provided a survey of the various concepts required for high-level architecture, such as OOP, Design Patterns, Structured Design, Separation of Concerns, Dependency Injection, Testability, Security, and AOP.

I also liked their chapter on DataAccess. They made a well-reasoned plug for NHiberante and the maintenance benefits of auto-generated dynamic SQL for the data access layer. I admit that I personally have "grown up" with a bias for code-generated stored procedures, but I can see the changing winds.

Their book is very focused on the standard N-tier layers: DataAccess, BusinessFacade, Service, and Presentation. Here's the table of contents:

Chapter 1: Architects and Architecture Today
Chapter 2: UML Essentials
Chapter 3: Design Principles and Patterns
Chapter 4: The Business Layer
Chapter 5: The Service Layer
Chapter 6: The Data Access Layer
Chapter 7: The Presentation Layer
Final Thoughts
Appendix: The Northwind Starter Kit

The book didn't discuss much on messaging, caching, validation, logging, system integrations, configuration, or other architectural components. However, most applications make or break on the data access strategy, so I can see the focus there. And, you could have an encyclopedia if you wanted to cover every aspect of enterprise architecture.

I found it interesting comparing the book to Fowler's landmark Patterns of Enterprise Application Architecture. Indeed, Dino and Andrea continually refer back to patterns in Fowler. The Dino/Andrea book almost seems intended as a sequel to Fowler's - it adds value by specializing in .Net, having the benefit of almost 6 years of hindsight, and providing constant web references and practical tools (many which didn't exist when Fowler wrote his book). Overall, it's a good read for any .Net Architect or aspiring developer. It's an especially good read for those who grew up as architects in a single company, and therefore may only have exposure to one way of doing architecture.

Tuesday, June 16, 2009

Why share knowledge with others?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/why_share_knowledge_with_others.htm]

I'm a big advocate of knowledge sharing. However, I understand why some developers may be hesitant to do so. We live in an era of unprecedented competition. People (and therefore Companies) compete against one another for employment opportunities, promotions, recognition, mindshare, and plain-old-power. In today's cut-throat world, it almost seems counter-intuitive to "undermine" yourself by sharing your knowledge (read: "competitive advantage") with others (read: "competition").

While obviously you shouldn't share trade secrets and proprietary knowledge with the competition, I still think there are a lot of good reasons to share general knowledge with the community, and especially your coworkers.

It teaches you
- Explaining something to others forces you to better understand it yourself.
- It is self correcting - By sharing your knowledge, the other dev may be able to help you improve it by offering tweaks or suggesting gaps that you need to fill.
- It provides a chance to practice your communication. To communicate, you need a "receiver", someone who wants to catch (i.e. listen to) whatever message you're "throwing". This means that the message needs to be relevant. What's more relevant than a message that is solving the problem that the "receiver" actively wants solved?
Social benefits
- To have a friend, you need to be a friend. If you never share your stuff (knowledge, tricks, tools), people may be hesitant to share their stuff with you.
- Some people just enjoy helping others.
It's part of your job
- It frees up co-workers to do something useful (which could in turn benefit you, and more importantly, the company who is paying you), as opposed to them re-inventing what you've already done.
- It's your job as a good developer - Good developers share knowledge. Also, when a company is paying you to code, it's no longer "your code", but the company's code, and therefore the company has a right that tricks and knowledge be shared amongst the team.
Demonstrate leadership
- Knowledge-sharing increases your credibility. By sharing objective things that other devs can verify to be correct, these devs are more likely to trust you on subjective opinions or predictions that are much harder to verify.
- Become a thought-leader - Often consulting firms encourage their star consultants to demonstrate thought leadership by blogging, writing articles, or contributing to open source projects. Sometimes this is great marketing for that company, and even leads to sales. For example, thousands (millions?) of people use CruiseControl from ThoughtWorks, which in turn gives ThoughtWorks name recognition and marketing.
- It is necessary in order to be promoted. Leaders communicate, and usually the higher up you go, the wider the audience you need to share knowledge with. If you never practice knowledge sharing until you absolutely have to (via a job demand), then you will probably not be very good at it.
- It helps make your approach the official standard (which may be good or bad). If you hide your tricks and code, only you will use them. If someone publicizes and shares their code - even if it's worse than yours - their code will be used by more people and hence adopted as the "official" team approach.

There are many ways to share your knowledge, such as:

Wikis
Email
Automated governance
Sample applications
Team meetings
1-on-1 meetings (like a Code Review)
Blogging

Sunday, June 14, 2009

The benefits of technical blogging

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/the_benefits_of_technical_blogging.htm]

By some estimates, there are over 50 million blogs. Granted, many of these are barely maintained, and they cover all topics (not just software engineering), but clearly there's something to this whole blogging thing. Keep in mind that the vast majority of these blogs are not profitable - so the authors are writing as hobbyists and not paid professionals. Millions of people - and thousands of software engineers - blog because there are a lot of benefits to it:

Record smaller thoughts to build up for a larger project like an article or even a book.
Instant gratification - Blogs let you publish your writing instantly, which is gratifying (as opposed to the lag time for an article or book)
Ability to share your thoughts with the world - and likewise get feedback on them.
Practice at writing, beyond just the quick email or IM context.
Provides an outlet for all the miscellaneous thoughts and code a developer comes across
Gets you to think deeper about ideas, knowing that you need to at least polish it to the level of a few paragraphs for public consumption. This helps you learn the topic yourself, as one of the best ways to learn is to teach the content to others.
Thought Leadership - some people (especially consultants where this is a resume credential) really like this. Good blog posts also build credibility.
Confidence booster - putting your thoughts out there to potentially millions of people requires courage, and repeatedly doing that (and growing from the feedback) builds confidence.
Write anything on any topic, regardless of what you're currently working on.
It helps others. For example, sometimes I post the most obscure syntax errors I find - and other people were troubled with the exact same issue, and actually find the post useful.
Blogging acts like a personal journal. I've gotten a kick out of seeing how some of my thoughts have evolved over the last 5 years.
Low barrier to entry - because blogs are (effectively) free, you can write small chunks, you can write anything you'd like, and you can automatically publish it, there's a very low barrier to get started.

Should you blog even if you don't get traffic? I think yes - if you enjoy it (such as for any of the reasons mentioned above). Traffic comes and goes, and the weirdest entries get tons of traffic while the other entries (that I expect to be popular) are almost ignored.

Related: Do you have time to blog?

Wednesday, June 10, 2009

Real life: Avoiding customization to build a Sandbox

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/real_life_avoiding_customization_to_build_a_sandbox.htm]

I have three kids - two sons and a daughter. Cute little buggers, so I wanted to make them a sandbox to play in. I figure doing something physical and outdoors, as opposed to watching TV, would be good for them. Plus, I'm all for anything that even remotely encourages them to do engineering. However, I have very, very, minimal wood-working skills. When it comes to woodworking, I am (at best) a hobbyist - by no means an expert. This means I was just trying to build a simple sandbox that works - no fancy wood cutting, things that take big vocabulary to describe, or expensive tools required. I made a crude box-like design, drove to the local home depot, and got help picking out the wood (12x1x6 inch cedar). The only cuts I made where ones that the home depot guy could do in the store - so no triangle, notched, or diagonal cuts. I hauled my precious wood back to my garage (read: not professional tool workshop), applied one coat of polyurthethane something (read: I hope that helps protect against weathering), and hammered the boards together. After digging the box into the ground and filling it with sand, it was good enough and I was done.

Why dump such a story on my technical blog? Because my hobbyist mentality towards a wood sandbox is essentially the same as many "programmers" hobbyist mentality towards the craft of software engineering. We both just want to get it done, make the end users happy, and maybe enjoy it along the way. If we miss out on an optimal technique, that's okay. Working with other people, it can be useful to understand that mindset.

And yes, the kids loved it.

Tuesday, June 9, 2009

An easy way to hack unvalidated web input

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/an_easy_way_to_hack_unvalidated_web_input.htm]

Many security bugs get overlooked because hackers use "special" tools that give them options that the original developers anticipate. For example, most web developers only test their application through a browser. While this is sometimes sufficient to check different querystring inputs, it gives one an enormous false sense of security. Consider two cases:

A button is disabled, therefore the developer assumes you can't click it (Example: A Save button is disabled because some data isn't valid)
Data is stored in a hidden field, therefore the developer assumes that you can't change it. (Example: A record's ID is stored in a hidden field)

To some people, this appears safe because IE doesn't let you click disabled buttons or change the value of hidden fields. However, IE is only used to collect the data from the page and aggregate it into a post that it sends to the webserver. What if the hacker is using a different tool besides IE that gives them more control on how to assemble and send this post (or an extension to IE that lets them modify fields)?

How to hack it

You could crack both the above cases using Visual Studio Team Test (or I bet Firebug, or the IE8 dev tools). Here's how:

Open up VS Team Test
Create a Test Project, and add a new web test. This opens up a recorder in IE.
Navigate to the page you want to crack, such as something that stores data in a hidden field.
Perform a normal save action, such as saving the page (which uses data from the hidden field)
Stop the recording. Play back the test just to make sure it works in the normal case. Note how for each request, VS provides you a new row and lets you see the exact request and response, as well as what visually appears in the Web Browser.
Now start the hacking. Go to the request where you saved the page (that collected data from the hidden field), and unbind that field and change its value to whatever you want. In this case, I simply had a field called "Hidden1", and I changed it's value to "abc". Note that just like IE, the webtest sends a post to the server. But unlike IE, you can easily change the contents of that post.
Rerun the webtest. The new, hacked, value is sent to the server.
Note that you can also script VS WebTests with C#. So, you could have a for-loop that re-submits a 1000 requests, varying the post values each time.

So, what can we do?

As a developer, we always need to validate and do security checks on the server - merely JS side validation is not enough. For example, knowing that that hidden field could be hacked, always validate that data as if it were a free-form textbox.

The problem is that this degree of extra validation (1) costs much more time, and (2) could be a big performance hit - which is especially ironic because it's for something that 99% of users won't even know about.

Allotting time requires managerial approval. However, most managers see security as one of those "nice-to-have" buzzwords, that they'll have implemented later "if time permits".

I think a good solution to this is to give management a live demo of your site being hacked (perhaps try it on the QA machine, not necessarily production). To really make a point, see if you can hack into their profile. Hopefully that will convince management that it's worth the time.

About the performance hit - I don't see a silver bullet. Most application performance can be improved by adding more resources somehow - more hardware, better design (caching, batching), coding optimizations, etc... Because applications vary so much, there's not a one-size-fits-all solution.

Monday, June 8, 2009

BOOK: Pragmatic Thinking and Learning: Refactor Your Wetware

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/book_pragmatic_thinking_and_learning_refactor_your_wetware.htm]

I'm fascinated with learning how to learn, so I was excited to finally read Andy Hunt's Pragmatic Thinking and Learning: Refactor Your Wetware. Recall that Andy is one of the co-stars of the hit The Pragmatic Programmers.

This is a good example of a higher-level, "non-syntax" book, something that transcends the "How to program XYZ" genre. (Shameless plug: I had written my own book: A Crash Course in Reasoning, but I can see why Andy's is in the top 3000 Amazon sales rank, and mine is barely in the top 3 million).

My favorite chapter was "Journey from Novice to Expert", as there is such a huge productivity gap here. He also continually emphasized the differences between the two parts of the brain, comparing it to a dual CPU, single master bus design.

It was an enjoyable read, similar to picking desserts out of a buffet. He had a lot of good quotes throughout the book:

"... software development must be the most difficult endeavor ever envisioned and practiced by humans." (pg. 1)
"... it's not the teacher teaches; it's that the student learns." (pg. 3)
"Don't succumb to the false authority of a tool or a model." (pg. 41)
"If you don't keep track of great ideas, you will stop noticing that you have them." (pg. 53). This is huge. The "slow times" during the day (driving, waiting in line, burping a sleeping baby) are great for mulling over random ideas. It's almost like collecting raindrops. I used to do this, but got lazy the last few years. Andy's chapter inspired me to go out, get some pocket-sized notebooks, and start jotting down random thoughts again (read: future blog entries).
"Try creating your next software design away from your keyboard and monitor..." (pg. 72). It's ironic, but often sitting in front of the computer, with all the internet distractions, can kill one's creativity.
"So if you aren't pair programming, you definitely need to stop every so often and step away from the keyboard." (pg. 85). I've seen many shops that effectively forbid pair programming, so this at least is a way to partially salvage a bad situation.
"... until recently, one could provide for one's family with minimal formal education or training." (pg. 146)
"... relegating learning activities to your 'free time' is a recipe for failure." (pg. 154)
"... documenting is more important than documentation." (pg. 179). The act of documenting forces you to think through things, where design costs upfront are much cheaper than implementation costs later.
"... we learn better by discovery, not instruction." (pg. 194).
"It's not that we're out of time; we're out of attention." (pg. 211)

Perhaps the best effect from reading this kind of book is that it makes you more aware, such that your subconscious mind is constantly thinking about learning.