Before the main event there was a preconference (with a fraction of the main attendees) exploring technical challenges and possible collaborations. There were also a couple brief presentations, here are my raw notes on Kris Carpenter Negulescu, the Director of the Internet Archive Web Group talking about
"20th Century Find" using Amazon S3 & EC2
Internet Archive stats
3.5 PB
1.5 million downloads/day
this project is about providing full-text search of their web archive for the 20th century 1996-2000, ~22TB
NutchWAX = Nutch + Hadoop
focus moving to Nutch with plugins
Amazon S3
Amazon EC2 (beta)
there are now more EC2 node options: small (default), large, extra large 8 times small performance and better I/O ($0.80/cpu hour)
indexing began in October 2006
1996: indexed via 20 EC2 in ~36 hours
1997: 100+ EC2 nodes
1998: 300+ EC2 nodes
1999 was attempted in September 2007 using cluster of ~270 EC2 nodes but halted due to lack of
consistent CPU/IO across nodes.
deployed (alpha) index is 1.35TB in size, no compression, ~600 mill docs
Enhancements
* multiple instances of a page
* improved ranking of results
* handle dimension of time
* easy UI
Why Amazon Web Services
* pay as you go
* simple to provision
* committed to support
* indeal for indexing Web pages, providing offsite storage, reliable hosting
* great platform for experimentation, iteration
* geographically disperse from Internet Archive ?Data Repository?
Cost Effective, budgeted $20k
* Note: fees can add up fast if not vigilant
Working Well
* APIs
* tech support
* S3
* fee structure
* speed of provisioning
* S3 uniformity of nodes
Challenges - S3
Oct 2006 - June 2007
* (internal?) bandwidth availability
* no specific guarantees for data preservation
* issues related to popularity of the service
Fall 2007
* available bandwidth consistent (~4h to move 7.5TB into EC2)
Challenges - EC2
lots of issues Oct 2006 - June 2007
* location of S3 nodes relative to EC2 was a significant factor for large-scale data processing
July 2007 - present
* working well but hitting IO and CPU constraints on small (basic, default) nodes;
however will continue to use these small nodes
Consider Using AWS When
S3
* need cost-effective backup for data
* multi-provider preservation, geographically diverse
EC2
* if you have spiky computing needs (e.g. spikes in demand)
* you have available R&D resources
Will experiment with AWS for crawling and harvesting, starting Jan 2008, Heritrix/AWS.
Comments