« SOA and escience info via EDUCAUSE | Main | DLF twittering »

November 05, 2007

Internet Archive 20th Century Search - DLF developers' preconference - Nov 3, 2007

Before the main event there was a preconference (with a fraction of the main attendees) exploring technical challenges and possible collaborations.  There were also a couple brief presentations, here are my raw notes on Kris Carpenter Negulescu, the Director of the Internet Archive Web Group talking about

"20th Century Find" using Amazon S3 & EC2

Internet Archive stats

3.5 PB
1.5 million downloads/day

this project is about providing full-text search of their web archive for the 20th century 1996-2000, ~22TB

NutchWAX = Nutch + Hadoop
focus moving to Nutch with plugins

Amazon S3
Amazon EC2 (beta)

there are now more EC2 node options: small (default), large, extra large 8 times small performance and better I/O ($0.80/cpu hour)

indexing began in October 2006
1996: indexed via 20 EC2 in ~36 hours
1997: 100+ EC2 nodes
1998: 300+ EC2 nodes

1999 was attempted in September 2007 using cluster of ~270 EC2 nodes but halted due to lack of
consistent CPU/IO across nodes.

deployed (alpha) index is 1.35TB in size, no compression, ~600 mill docs

Enhancements
* multiple instances of a page
* improved ranking of results
* handle dimension of time
* easy UI

Why Amazon Web Services
* pay as you go
* simple to provision
* committed to support
* indeal for indexing Web pages, providing offsite storage, reliable hosting
* great platform for experimentation, iteration
* geographically disperse from Internet Archive ?Data Repository?

Cost Effective, budgeted $20k

* Note: fees can add up fast if not vigilant

Working Well

* APIs
* tech support
* S3
* fee structure
* speed of provisioning
* S3 uniformity of nodes

Challenges - S3

Oct 2006 - June 2007

* (internal?) bandwidth availability
* no specific guarantees for data preservation
* issues related to popularity of the service

Fall 2007

* available bandwidth consistent (~4h to move 7.5TB into EC2)

Challenges - EC2

lots of issues Oct 2006 - June 2007
* location of S3 nodes relative to EC2 was a significant factor for large-scale data processing

July 2007 - present

* working well but hitting IO and CPU constraints on small (basic, default) nodes;
however will continue to use these small nodes

Consider Using AWS When

S3
* need cost-effective backup for data
* multi-provider preservation, geographically diverse

EC2
* if you have spiky computing needs (e.g. spikes in demand)
* you have available R&D resources

Will experiment with AWS for crawling and harvesting, starting Jan 2008, Heritrix/AWS.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/203481/23065196

Listed below are links to weblogs that reference Internet Archive 20th Century Search - DLF developers' preconference - Nov 3, 2007:

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

----

Search


  • Google
    Web scilib.typepad.com

Receive via Email



  • Powered by FeedBlitz

Twitter Updates

    follow me on Twitter

    Furl Linkblog

    Resources

    Recent Comments

    Referral

    StatCounter

    Googlytics

    Technorati

    Blog powered by TypePad
    Member since 11/2004