Hadoop Camp | 6-7 November

THURSDAY: Session Abstracts and Speaker Bios

15:00 Hadoop at Yahoo! – Eric Baldeschwieler

Eric Baldeschwieler is the Vice President of Grid Computing at Yahoo. He manages a team that contributes to Apache Hadoop and operates very large scale Hadoop clusters for internal Yahoo! research and production projects. For the last twelve years he has worked in various roles on very large scale data processing and web search problems at Yahoo! and Inktomi. Previously he worked as a software engineer on video games, video post production gear, computer graphics systems and other projects. He is having a blast scaling Hadoop and watching it take root in companies and universities world wide!

15:15 Hadoop Directions: A Futures Panel - Moderator: Ajay Anand, Yahoo! • Panelists: Sameer Paranjpye, Owen O'Malley, and Sanjay Radia, Yahoo! and Dhruba Borthakur, Facebook

An interactive panel discussion on future directions and proposed developments for Hadoop.

Ajay Anand is Director of Product Management for Grid Computing at Yahoo!. Ajay was product manager of Sun's first high availability file and database servers and then worked in product and marketing management roles in the areas of storage management, middleware, and identity management. Previously he was Director of Product Management for SGI's storage products and Aspect's customer management middleware. Ajay holds an MS in Computer Engineering and an MBA from the University of Texas at Austin, and a BSEE from the Indian Institute of Technology.

Owen O'Malley is a Software Architect in Yahoo's Grid Computing team and is the chair of the Program Management Committee for Apache Hadoop. He has been a Hadoop committer since March of 2006 and more than 200 of his patches have been committed to Hadoop. Before working on Hadoop, he worked on Yahoo Search's Webmap that builds and analyzes the graph of the World Wide Web. Prior to Yahoo, he worked at NASA Ames Research Center on software model checking and at Sun working on a distributed version control system. He received his PhD in Software Engineering from University of California, Irvine.

Sanjay Radia leads the Hadoop Distributed File System project at Yahoo where it is in daily use for large clusters of several thousand machines. Previously he has held senior positions at Cassatt, Sun Microsystems and INRIA where he has developed systems software for distributed systems and grid/utility computing infrastructures. He has published numerous papers and holds several patents. Sanjay has PhD in Computer Science from University of Waterloo. Canada.

Dhruba Borthakur has been one of the lead contributors for the Hadoop Distributed File System. He has been associated with Hadoop almost since its inception while working for Yahoo. He currently works for Facebook. Earlier, he was a Senior Lead Engineer at Veritas Software (now acquired By Symantec) and was responsible for the design and development of software for the Veritas San File System. He was the Team Lead for developing the Mendocino Continuous Data Protection Software Appliance at a startup named Mendocino Software. Prior to Mendocino Software, he was the Chief Architect at Oreceipt.com, an e-commerce startup based in Sunnyvale, California. Earlier, he was a Senior Engineer at IBM-Transarc Labs where he was responsible for the development of Andrew File System (AFS) which is a part of IBM's e-commerce initiative WebSphere. Prior to his experience in the United States, Dhruba developed call processing software for Digital Switching Systems at C-DOT Delhi. Dhruba has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science from the Birla Institute of Technology and Science (BITS), Pilani, India. He has 7 issued patents and 14 patents pending.

16:30 Using Hadoop for an Intranet Search Engine – Shivakumar Vaithyanthan, IBM

As a company spanning 80+ countires with geographically disperse organizational structures IBM's intranet poses a challenging search problem. To tackle this, Project ES2 deploys a combination of sophisticated offline analytics and intelligent runtime query matching. Initially ES2 was implemented using a home-grown crawler and a set of analysis components working over DB2 and Lucene. As the analytics increased in complexity this ad-hoc solution suffered from several scalability problems. Furthermore, as Project ES2 matures, besides intranet pages, we expect to crawl numerous enterprise "deep web" repositoiries resulting in even higher demands on the infrastructure. To address this we are currently in the process of moving each component of ES2 onto a Hadoop cluster. This talk will describe our experiences in accomplishing this migration while continuing to maintain a working systems under active use.

Shiv Vaithyanathan is the Sr. Manager of Infrastructure for Intelligent Information Systems at the IBM Almaden Research Center. His research interests span information extraction, databases machine learning and search. He is an Assoc Editor for Journal of Statistical Analysis and Data Mining.

17:00 Cloud Computing Testbed – Thomas Sandholm, Hewlett-Packard

The first part of this talk will present the Hewlett-Packard, Intel, and Yahoo! cloud computing research test bed which was announced in July this year. An overview will be given of the motivation, initial setup, core services and technical issues we anticipate the research community to address on the test bed. The second part showcases work on a HP Labs research project targeted at leveraging the test bed: Hadoop as a Service. In this project resource allocation issues involved in offering Hadoop on-demand to users with MapReduce jobs are investigated. The talk will conclude with a demo and lessons learned from working with Hadoop.

Thomas Sandholm – As a CORBA, Grid, Web services and now Cloud computing technologist, Thomas Sandholm has been a long-time contributor to open source projects, including Apache Axis and the Globus Toolkit. He holds a Ph.D. in Computer and Systems Sciences from Stockholm University and currently works as a research scientist in the Social Computing Lab at Hewlett-Packard Labs in Palo Alto, CA. His current research in the intersection of computer science and economics focuses on computational market resource allocation.

17:30 Improving Virtualization and Performance Tracing of Hadoop with OpenSolaris – George Porter, Sun

In this talk, I will outline some of the ongoing efforts at Sun Microsystems in improving the deployability and manageability of Hadoop. We have developed a distribution of OpenSolaris hosting a virtual Hadoop cluster. OpenSolaris' lightweight virtualization support requires a very small memory footprint, leaving more resources for Hadoop jobs. We are also working with the Hadoop community to develop better trace support within Hadoop. These traces can be coupled with systems measurement tools, such as DTrace, to better gain insight into the behavior of Hadoop at scale.

George Porter is a member of Sun Microsystems, and his research interests include improving the reliability and usability of large-scale distributed systems, with a current focus on Cloud Computing and datacenter environments. Prior to joining Sun, George was a member of the RAD Lab, a multidisciplinary research center at U.C. Berkeley, where he co-developed a cross-layer network tracing framework called X-Trace. He received his B.S. in Computer Science from the University of Texas at Austin, and his Ph.D. from the University of California, Berkeley.

18:00 An Insight into Hadoop usage at Facebook – Dhruba Borthakur, Facebook

This talk gives a brief overview of the type of applications that are using Hadoop at Facebook, the configuration of hardware and software in our Hadoop cluster, size and volume of datasets, characteristics of jobs and the processes we have built on top of Hadoop to keep the data pipeline alive and active.

Dhruba Borthakur has been one of the lead contributors for the Hadoop Distributed File System. He has been associated with Hadoop almost since its inception while working for Yahoo. He currently works for Facebook. Earlier, he was a Senior Lead Engineer at Veritas Software (now acquired By Symantec) and was responsible for the design and development of software for the Veritas San File System. He was the Team Lead for developing the Mendocino Continuous Data Protection Software Appliance at a startup named Mendocino Software. Prior to Mendocino Software, he was the Chief Architect at Oreceipt.com, an e-commerce startup based in Sunnyvale, California. Earlier, he was a Senior Engineer at IBM-Transarc Labs where he was responsible for the development of Andrew File System (AFS) which is a part of IBM's e-commerce initiative WebSphere. Prior to his experience in the United States, Dhruba developed call processing software for Digital Switching Systems at C-DOT Delhi. Dhruba has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science from the Birla Institute of Technology and Science (BITS), Pilani, India. He has 7 issued patents and 14 patents pending.

FRIDAY: Session Abstracts and Speaker Bios

09:00 Hadoop on Amazon Web Services – Jinesh Varia, Amazon

Developers are finding new and innovative ways to use Hadoop in conjunction with Amazon Elastic Compute Cloud with Elastic Block Storage and Amazon Simple Storage Service. In this session, Jinesh Varia, Technology Evangelist for Amazon Web Services, will discuss the various ways Hadoop is being used within the Amazon Web Services (AWS) environment. We will not only learn how different people are using Hadoop in different ways within the powerful cloud computing environment from Amazon but also learn how you can use Hadoop for your own use case. We will learn how AWS is proving to be the ideal runtime environment to try, test and deploy your production Hadoop apps and also learn how AWS is facilitating Hadoop Development Community.

Jinesh Varia, Technology Evangelist, Amazon Web Services - As a Technology Evangelist at Amazon, Jinesh Varia helps developers take advantage of disruptive technologies that are going to change the way we think about computer applications, and the way businesses compete in the new web world. Jinesh has spoken at more than 50 conferences/User Groups. He is focused on furthering awareness of web services and often helps developers on 1:1 basis in implementing their own ideas using Amazon's innovative services. Jinesh has over 9 years experience in XML and Web services and has worked with standards-based working groups in XBRL. Prior to joining Amazon as an evangelist, he held several positions in UBmatrix including Solutions Architect, Enterprise Team Lead and Software engineer, working on various financial services projects including Call Modernization Project at FDIC. He was also lead developer at Penn State Data Center, Institute of Regional Affairs. Jinesh's publications have been published in ACM and IEEE. Jinesh is originally from India and holds a Master's degree in Information Systems from Penn State University. He plays tennis and loves to trek.

09:30 Hive – Ashish Thusoo, Facebook

Hive is an open-source data warehousing infrastructure built on top of Hadoop that allows SQL like queries along with abilities to add custom transformation scripts in different stages of data processing. It includes language constructs to import data from various sources, support for object oriented data types, and a metadata repository that structures hadoop directories into relational tables and partitions with typed columns. Facebook uses this system for variety of tasks - classic log aggregation, graph mining, text analysis and indexing. Using hadoops mad/reduce paradigm Hive is able to provide SQL like query interfaces on vast quantities of data, thereby unlocking the power of this data not just to engineers but to busniess users and analysts. In this talk I will give an overview of Hive system, query language, future roadmap and usage statistics within facebook.

Ashish Thusoo is an engineer in the Facebook Data Infrastructure team. In the past he has worked at Oracle in the Parallel Query group and then in the XML DB group. He is interested in large scale distributed distributed systems for analytics, data warehousing and data mining.

10:30 Hadoop Hack Revealed – Christophe Bisciglia, cloudera

New to Hadoop? Have some data you want to analyze? Have an idea to improve performance? Just want to play with TBs of data for its own sake? Cloudera is providing access to an Apache Hadoop cluster in the cloud and awarding prizes for the coolest hacks and applications. This session will review the submissions received for the Hadoop Hack. Get details and register: cloudera.com

Christophe Bisciglia joins Cloudera from Google, where he created and managed their Academic Cloud Computing Initiative. Starting in 2007, he began working with the University of Washington to teach students about Google's core data management and processing technologies - MapReduce and GFS. This quickly brought Hadoop into the curriculum, and has since resulted in an extensive partnership with the National Science Foundation (NSF) which makes Google-hosted Hadoop clusters available for research and education worldwide. Beyond his work with Hadoop, he holds patents related to search quality and personalization, and spent a year working in Shanghai. Christophe earned his degree, and remains a visiting scientist, at the University of Washington.

14:00 Pig – Alan Gates, Yahoo!

This talk will introduce Pig Latin, a dataflow programming language and the Pig engine, which runs on top of Hadoop. Pig Latin is designed to occupy the sweet spot between SQL and map reduce programming. It gives developers control of the data flow and the ability to inject their code at any point while allowing them to avoid the details of writing java and interacting directly with map reduce.

Alan Gates is one of the committers on the pig project. He has been involved in RDMS and large data system development for 10 years. For the last 5 years he has been designing and developing query systems for very large data sets at Yahoo!

14:45 Zookeeper, Coordinating the Distributed Application – Benjamin Reed, Yahoo!

Distributed applications need to coordinate the processes scattered across the network. Sometimes this coordination is a simple as knowing who is up and who is down (group membership) or who is the master in the system (leader election) or sometimes a sophisticated scheme to manage shards of processing, locating shard masters, recovering from master failures, propagating changes to system configuration, etc. ZooKeeper gives developers a very simple interface to implement these primitives in a way that is highly resilient to failures. In this presentation I will go over the ZooKeeper interface, show some examples of how it is used, and show the performance that users can expect from ZooKeeper.

Benjamin Reed is a Research Scientist in the Scalable Computing group of Yahoo! Research. He has been working on distributed systems at Yahoo! for the last 2 years. He is an OSGi Fellow. Benjamin received his PhD from the University of California, Santa Cruz for his work on security for network attached storage.

15:30 Querying JSON Data on Hadoop using Jaql – Kevin Beyer, IBM

We introduce Jaql, a query language for the JSON data model. JSON (JavaScript Object Notation) has become a popular data format for many Web-based applications because of its simplicity and modeling flexibility. In contrast to XML, which was originally designed as a markup language, JSON was actually designed for data. JSON makes it easy to model a wide spectrum of data, ranging from homogenous flat data to heterogeneous nested data, and it can do this in a language-independent format. We believe that these characteristics make JSON an ideal data format for many Hadoop applications and databases in general. This talk will describe the key features of Jaql and show how it can be used to process JSON data in parallel using Hadoop's map/reduce framework.

Kevin Beyer is a Research Staff Member at the IBM Almaden Research Center. His research interests are in information management, including query languages, analytical processing, and indexing techniques. He has been designing and implementing Jaql, in one form or another, for the past several years. Previously, he led the design and implementation of the XML indexing support in DB2 pureXML.

16:30 Hbase – Michael Stack, Microsoft

HBase provides a highly scalable distributed structured data store on top of the Hadoop DFS. Come learn about the current state of the HBase project and the roadmap of future improvements.

Michael Stack is a Senior Software Development Engineer at Microsoft, working for the Live Search team (and formerly Powerset). He is an HBase committer and a member of the Hadoop PMC. Previously, he worked at the Internet Archive.