About Me
I'm Jeff Fal, and I'm an information architect and front-end developer working in Denver. I have been working on the Web since 1998 and am interested in learning at least a little about most Web technologies. I'm most experienced with HTML, CSS, JavaScript, and Flex. Soft Werewolf is my place to play.Tags
BaseJS browsers css Design Development dungeons and dragons exit signs firefox Google Buzz Google Docs Google Wave html html5 icons ie ie6 ie8 ie9 images Information Architecture ipad iPhone javascript jeff fal jQuery micrrosoft mouse events myths new version progress bars psychology Shop Shark Software Development timthumb tools touch events typography underware usability user experience visual tricks web applications wired Wireframes wordpressCategories
- Design (5)
- Development (11)
- General (7)
- Information Architecture (4)
- Projects (10)
- Mapper (1)
- Poser (1)
- Quotas (1)
- Shop Shark (6)
- Splitter (1)
- Soft Werewolf (1)
Archives
-
RSS Links
Scrapy helps me scrape
Between a recent Edward Tufte talk in Denver and the Kindle release of Flowing Data’s new book, my distraction of choice lately has been data visualization. In particular, I’ve been on the hunt for data about my city. I had an idea for a chart displaying a map of houses in Denver color-coded by year built. Now, this is simple and publicly available data. But it’s also not that easy to collect in bulk. The city of Denver has an online database of property records, but you have to search by address. You can’t just download a big table. Also, I don’t want my map to get cut off at the city limits, especially when what people think of as Denver extends far beyond these limits. Which means I’d have to figure out how to extract Denver’s data, and then do the same for every city in the metro area — assuming they even put their data online. This would be a serious pain.
Besides, someone has already gone through the pain for me: real estate websites. I won’t say exactly which site I went to because I’m pretty sure harvesting their data would be considered a terms of use violation. But let’s just say there’s a site out there (and there are several, actually) that serves up all kinds of basic, publicly available data about homes, and they make it easy to browse the entire Denver Metro Area from a single starting page. It couldn’t be more perfect for spidering and data-scraping.
This was my first time building a spider or scraping, but it didn’t take much searching around to find Scrapy, a Python framework for doing just this. It’s a little complicated to start, but they have a good tutorial, as well as a live command line environment that lets you test out code.
In the end, it worked beautifully. If you want the data I scraped, it’s here: Denver Metro Property Data
Now if I can just figure out how to call Google’s GPS-finding service more than 15,000 times a day…