ApacheCon NA 2010 Session

Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond

In this talk, we'll reflect on 6 years of Nutch, the open source web-scale search engine framework. Nutch has grown up since it's inception as a brainchild of Doug Cutting, its movement to Apache from Sourceforge.net, its pollination and procreation of the Apache Hadoop project (and its ecosystem), and its ultimate 1.0 release. I'll talk about what worked, what didn't, and discuss major lessons learned in the development of open source search engine software at scale. With that out of the way, I'll tell you about the exciting near-term and longer-term plans and development in the works for Nutch2: (Chris’s) codename "delegator". The Nutch2 architecture will be highly modular, will interoperate with other pals in the Lucene and Hadoop and Tika ecosystems, and will do some self-reflection on handling many of the ease of use difficulties in configuration and deployment of Nutch1. I'll discuss some use cases of Nutch within the domains of NASA planetary science and also discuss Nutch-related work of my graduate students at the University of Southern California and the CS572: Search Engines and Information Retrieval course taught during Summer 2010