timstall: Estimating database table sizes using SP

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/estimating_database_table_sizes_using_sp_spaceused.htm]

One of Steve McConnell's tips from his great book on estimating (Software Estimation: Demystifying the Black Art) is that you should not estimate that which you can easily count. Estimating database table sizes is a great example of this. Sure, on one hand disk space is relatively cheap; on the other hand you want to know at least a ballpark estimate of how much space your app will need - will database size explode and no longer fit on your existing SAN?

Here's a general strategy to estimate table size:

1. Determine the general schema for the table

Note the column datatypes that could be huge (like varchar(2000) for notes, or xml, or blob)

2. Find out how many rows you expect the table to contain

Is the table extending an existing table, and therefore proportional to it? For example, do you have an existing "Employee" table with 100,000 records, and you're creating a new "Employee_Reviews" table where each employee has a 2-3 reviews (and hence you're expecting 200,000 - 300,000 records)? If the table is completely new, then perhaps you can guess the rowcount based on expectations from the business sponsors.

If the table has only a few rows (perhaps less than 10,000 - but this depends), the size is probably negligible, and you don't need to worry about it.

3. Write a SQL script that creates and populates the table.

You can easily write a SQL script to create a new table (and add its appropriate indexes), and then use a WHILE loop to insert 100,000 rows. This can be done on a local instance of SQL Server. Note that you're not inserting the total number of rows you estimated - i.e. if you estimated that table will contain 10M rows, you don't need to insert 10M rows - rather you'll want a "unit size", which you can then multiple by however many rows you expect. (Indeed, you don't want to wait for 10M rows to be inserted, and your test machine may not even have enough space for that much test data).

For variable data (like strings), use average sized data. For null columns, populate them based on how likely you think they're be used, but err on the side of more space.

Obviously, save your script for later.

4. Run SP_SPACEUSED

SP_SpaceUsed displays how much data a table is using. It shows results for both the data, as well as the indexes (never forget the index space).

You can run it as simply as:

exec SP_SPACEUSED 'TableTest1'

Now you can get a unit-size per row. For example, if the table has 3000KB for data, and 1500KB for indexes, and you inserted a 100K rows, then the average size per row is: (3000KB + 1500KB) / 100,000. Then, multiple that by however many rows you expect.

This may seem like a lot of work, and there are certainly ways to theoretically predict it by plugging into a formula. My concern is that it's too easy for devs to miscalculate the formula (like forgetting the indexes, not accounting the initial table schema itself, or just all the extra steps)

5. Estimate the expected growth

Knowing the initial size is great, but you also must be prepared for growth. We can make educated guesses based on the driving factors of the table size (maybe new customers, a vendor data feed, or user activity), and we can then estimate the size based on historical data or the business's expectations. For example, if the table is based on new customers, and the sales team expects 10% growth, then prepare for 10% growth. Of if the table is based on a vendor data feed, and historically the feed has 13% new records every year, then prepare for 13% growth.

Depending on your company's SAN and DBA strategy, be prepared to have your initial estimate at least include enough space for the first year of growth.

6. Add a safety factor

There will be new columns, new lookup and helper tables, a burst of additional rows, maybe an extra index - something that increases the size. So, always add a safety factor.

7. Prepare for an archival strategy

Some data sources (such as verbose log records) are prone to become huge. Therefore, always have a plan for archival - even if it's that you can't archive (such as it's a transactional table and the business requires regular transactions on historical data). However, sometimes you get lucky; perhaps the business requirements say that based on the type of data, you only legally need to carry 4 years worth of data. Or, perhaps after the first 2 years, the data can be archived in a data warehouse, and then you don't worry about it anymore (this just passes the problem to someone else).

Summary

Here's a sample T-SQL script to create the table and index, insert data, and then call SP_SpaceUsed:

USE [MyTest]
GO

if exists (select 1 from sys.indexes where [name] = 'IX_TableTest1')
    drop index TableTest1.IX_TableTest1

if exists (select 1 from sys.tables where [name] = 'TableTest1')
    drop table TableTest1

--=========================================
--Custom SQL table
CREATE TABLE [dbo].[TableTest1](
    [SomeId] [int] IDENTITY(100000,1) NOT NULL,
    [phone] [bigint] NOT NULL,
    [SomeDate] [datetime] NOT NULL,
    [LastModDate] [datetime] NOT NULL
) ON [PRIMARY]

--Index
CREATE UNIQUE NONCLUSTERED INDEX [IX_TableTest1] ON [TableTest1]
(
    [SomeId] ASC,
    [phone] ASC
) ON [PRIMARY]
--=========================================

--do inserts

declare @max_rows int
select @max_rows = 1000

declare @i as int
select @i = 1

WHILE (@i <= @max_rows)
BEGIN
    --=============
    --Custom SQL Insert (note: use identity value for uniqueness)
    insert into TableTest1 (phone, SomeDate, LastModDate)
    select 6301112222, getDate(), getDate()
    --=============

    select @i = @i + 1

END

--Get sizes
exec SP_SPACEUSED 'TableTest1'

timstall

Sunday, December 27, 2009

Estimating database table sizes using SP_SpaceUsed

No comments:

Post a Comment