When dealing with vast amount of data you need a scalable distributed storage system. All of my database driven web sites use MySQL and for smaller databases you can mount a MySQL search to the web site. But you soon find out that to deliver fast searching capabilities to a site, or if as in my case you intend to offer a search service of hundreds of millions of crawled niche data you need a scalable distributed storage system.
Recently Google hosted a Conference on Scalability in Seattle where they talked about MapReduce, BigTable, and other distributed systems for large datasets. Listed here are the talks which are now available on Google video:
- Keynote I: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets by Jeff Dean, Google, Inc.
- Keynote II: Scaling Google for Every User Marissa Mayer, Vice President, Search Products & User Experience, Google, Inc.
- Lustre File System by Peter Braam, Founder and President, Cluster File Systems, Inc.
- SCTP's Reliability and Fault Tolerance by Brad Penoff, Mike Tsai, and Alan Wagner, The University of British Columbia Department of Computer Science.
- Scalable Test Selection Using Source Code Deltas by Ryan Gerard, Symantec Corporation.
- VeriSign's Global DNS Infrastructure, Patrick Quaid, Technical Director and Scott Courtney, Principal Architect, VeriSign.
- Using MapReduce on Large Geographic Datasets, Barry Brummit, Software Engineer, Google, Inc.
- YouTube Scalability, Cuong Do, Engineering Manager, YouTube.
- Building a Scalable Resource Management, Khalid Ahmed, Platform Computing Corp.
- Lessons In Building Scalable Systems, Reza Behforooz, Google Inc.
(Kudo's to Greg Linden for compiling the list of videos.)
The video's provide some technical detail while Marissa Mayer's provides some insight into Google's big picture plans.
Google's technology however is closed so if you're interested in a solution that you can use then turning to open source projects is the way to go. And this is where Hadoop with HBase come in.
Hadoop is a framework for running applications on large clusters of commodity hardware. There's a lot of development going into Hadoop right now mostly being led by Doug Cutting and Owen O'Malley of Yahoo. In my experience if you implement Hadoop you really need to stay on top of it and tweak to suite your needs. To show how young Hadoop is, the current stable release is 0.13.0.
HBase is a distributed storage system for structured data and designed for storing very large amounts of data in a distributed environment. It's intent is to be similar in function to Google's Bigtable which is used with the Google File System. Hbase will provide Bigtable-like capabilities on top of Hadoop.
While these projects are still in their infancy the open source model is leading to rapid development in these technologies.