Between a recent Edward Tufte talk in Denver and the Kindle release of Flowing Data’s new book, my distraction of choice lately has been data visualization. In particular, I’ve been on the hunt for data about my city. I had an idea for a chart displaying a map of houses in Denver color-coded by year built. Now, this is simple and publicly available data. But it’s also not that easy to collect in bulk. The city of Denver has an online database of property records, but you have to search by address. You can’t just download a big table. Also, I don’t want my map to get cut off at the city limits, especially when what people think of as Denver extends far beyond these limits. Which means I’d have to figure out how to extract Denver’s data, and then do the same for every city in the metro area — assuming they even put their data online. This would be a serious pain.

Besides, someone has already gone through the pain for me: real estate websites. I won’t say exactly which site I went to because I’m pretty sure harvesting their data would be considered a terms of use violation. But let’s just say there’s a site out there (and there are several, actually) that serves up all kinds of basic, publicly available data about homes, and they make it easy to browse the entire Denver Metro Area from a single starting page. It couldn’t be more perfect for spidering and data-scraping.

This was my first time building a spider or scraping, but it didn’t take much searching around to find Scrapy, a Python framework for doing just this. It’s a little complicated to start, but they have a good tutorial, as well as a live command line environment that lets you test out code.

In the end, it worked beautifully. If you want the data I scraped, it’s here: Denver Metro Property Data

Now if I can just figure out how to call Google’s GPS-finding service more than 15,000 times a day…

