Tuesday, July 22, 2008

Screen scraping the easy way with .Net

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/screen_scraping_the_easy_way_with_net.htm]

Sometimes you may want to collect mass amounts of data from many web pages, and the easiest way is to just screen-scrape it. For example, perhaps a site doesn't provide any other data export mechanism, or it only lets you look up one item at a time, but you really want to look up 1000 items. That's where you have an application request the html page, then parse through the response to get the data you want. This is becoming rarer and rarer as RSS feeds and data exporting becomes more popular. However, when you need to screen scrape, you really need to screen scrape. .Net makes it very easy:

WebClient w = new WebClient();
string strHtml = w.DownloadString(strUrl);

Using the WebClient class (in the System.Net namespace), you can simply call the DownloadString method, pass in the url, and it returns a string of html. From there, you can parse through with Regular Expressions, or perhaps an open-source html parser. It's almost too easy. Note that you don't need to call this from an ASP.Net web app - you could call it from any .Net app (console, service, windows forms, etc...). Scott Mitchell wrote a very good article about screen-scraping back in .Net 1.0, but I think new features since then have made it easier.


You could also use this for a crude form of web functional testing (if you didn't use MVC, and you didn't have VS Testers edition with MSTest function tests), or to write other web analysis tools (is the rendered html valid, are there any broken links, etc...)


No comments:

Post a Comment