Hadoop Camp | 6-7 November

Sponsored by Yahoo!

Apache Hadoop
is being used by organizations to deliver petabyte scale computing and storage on commodity hardware, by academic and industrial research groups, and at Universities for teaching data parallel computing. Hadoop Camp will bring together leaders from the Hadoop Developer and User Communities to share their experiences. Camp sessions will cover topics in the areas of extensions being developed for Hadoop and related sub-projects, case studies of applications being built and deployed on Hadoop, and interactive discussions on future directions for the platform.

Work shoulder-to-shoulder with many of the leaders in the Hadoop Community and join in on the discussions. The Camp will feature six hours of all things Hadoop! Sessions will be presented by key technologists and architects from Facebook, Hewlett-Packard, IBM, Powerset, Sun and Yahoo.

At a Glance

THURSDAY: Session Abstracts and Speaker Bios

15:00 Hadoop at Yahoo! – Eric Baldeschwieler

15:15 Hadoop Directions: A Futures Panel - Moderator: Ajay Anand, Yahoo! • Panelists: Sameer Paranjpye, Owen O'Malley, and Sanjay Radia, Yahoo! and Dhruba Borthakur, Facebook

An interactive panel discussion on future directions and proposed developments for Hadoop.

16:30  Using Hadoop for an Intranet Search Engine – Shivakumar Vaithyanthan, IBM

As a company spanning 80+ countires with geographically disperse organizational structures IBM's intranet poses a challenging search problem. To tackle this, Project ES2 deploys a combination of sophisticated offline analytics and intelligent runtime query matching. Initially ES2 was implemented using a home-grown crawler and a set of analysis components working over DB2 and Lucene. As the analytics increased in complexity this ad-hoc solution suffered from several scalability problems. Furthermore, as Project ES2 matures, besides intranet pages, we expect to crawl numerous enterprise "deep web" repositoiries resulting in even higher demands on the infrastructure. To address this we are currently in the process of moving each component of ES2 onto a Hadoop cluster. This talk will describe our experiences in accomplishing this migration while continuing to maintain a working systems under active use.


17:00  Cloud Computing Testbed – Thomas Sandholm, Hewlett-Packard

The first part of this talk will present the Hewlett-Packard, Intel, and Yahoo! cloud computing research test bed which was announced in July this year. An overview will be given of the motivation, initial setup, core services and technical issues we anticipate the research community to address on the test bed. The second part showcases work on a HP Labs research project targeted at leveraging the test bed: Hadoop as a Service. In this project resource allocation issues involved in offering Hadoop on-demand to users with MapReduce jobs are investigated. The talk will conclude with a demo and lessons learned from working with Hadoop. 


17:30  Improving Virtualization and Performance Tracing of Hadoop with OpenSolaris – George Porter, Sun

In this talk, I will outline some of the ongoing efforts at Sun Microsystems in improving the deployability and manageability of Hadoop. We have developed a distribution of OpenSolaris hosting a virtual Hadoop cluster. OpenSolaris' lightweight virtualization support requires a very small memory footprint, leaving more resources for Hadoop jobs. We are also working with the Hadoop community to develop better trace support within Hadoop. These traces can be coupled with systems measurement tools, such as DTrace, to better gain insight into the behavior of Hadoop at scale.


18:00  An Insight into Hadoop usage at Facebook – Dhruba Borthakur, Facebook

This talk gives a brief overview of the type of applications that are using Hadoop at Facebook, the configuration of hardware and software in our Hadoop cluster, size and volume of datasets, characteristics of jobs and the processes we have built on top of Hadoop to keep the data pipeline alive and active. 


FRIDAY: Session Abstracts and Speaker Bios

09:00  Hadoop on Amazon Web Services – Jinesh Varia, Amazon

Developers are finding new and innovative ways to use Hadoop in conjunction with Amazon Elastic Compute Cloud with Elastic Block Storage and Amazon Simple Storage Service. In this session, Jinesh Varia, Technology Evangelist for Amazon Web Services, will discuss the various ways Hadoop is being used within the Amazon Web Services (AWS) environment. We will not only learn how different people are using Hadoop in different ways within the powerful cloud computing environment from Amazon but also learn how you can use Hadoop for your own use case. We will learn how AWS is proving to be the ideal runtime environment to try, test and deploy your production Hadoop apps and also learn how AWS is facilitating Hadoop Development Community.

09:30  Hive – Ashish Thusoo, Facebook

Hive is an open-source data warehousing infrastructure built on top of Hadoop that allows SQL like queries along with abilities to add custom transformation scripts in different stages of data processing. It includes language constructs to import data from various sources, support for object oriented data types, and a metadata repository that structures hadoop directories into relational tables and partitions with typed columns. Facebook uses this system for variety of tasks - classic log aggregation, graph mining, text analysis and indexing. Using hadoops mad/reduce paradigm Hive is able to provide SQL like query interfaces on vast quantities of data, thereby unlocking the power of this data not just to engineers but to busniess users and analysts. In this talk I will give an overview of Hive system, query language, future roadmap and usage statistics within facebook.


10:30  Hadoop Hack Revealed – Christophe Bisciglia, cloudera

New to Hadoop? Have some data you want to analyze? Have an idea to improve performance? Just want to play with TBs of data for its own sake? Cloudera is providing access to an Apache Hadoop cluster in the cloud and awarding prizes for the coolest hacks and applications. This session will review the submissions received for the Hadoop Hack. Get details and register: cloudera.com

14:00  Pig – Alan Gates, Yahoo!

This talk will introduce Pig Latin, a dataflow programming language and the Pig engine, which runs on top of Hadoop. Pig Latin is designed to occupy the sweet spot between SQL and map reduce programming. It gives developers control of the data flow and the ability to inject their code at any point while allowing them to avoid the details of writing java and interacting directly with map reduce.


14:45  Zookeeper, Coordinating the Distributed Application – Benjamin Reed, Yahoo!

Distributed applications need to coordinate the processes scattered across the network. Sometimes this coordination is a simple as knowing who is up and who is down (group membership) or who is the master in the system (leader election) or sometimes a sophisticated scheme to manage shards of processing, locating shard masters, recovering from master failures, propagating changes to system configuration, etc. ZooKeeper gives developers a very simple interface to implement these primitives in a way that is highly resilient to failures. In this presentation I will go over the ZooKeeper interface, show some examples of how it is used, and show the performance that users can expect from ZooKeeper.


15:30  Querying JSON Data on Hadoop using Jaql – Kevin Beyer, IBM

We introduce Jaql, a query language for the JSON data model. JSON (JavaScript Object Notation) has become a popular data format for many Web-based applications because of its simplicity and modeling flexibility. In contrast to XML, which was originally designed as a markup language, JSON was actually designed for data. JSON makes it easy to model a wide spectrum of data, ranging from homogenous flat data to heterogeneous nested data, and it can do this in a language-independent format. We believe that these characteristics make JSON an ideal data format for many Hadoop applications and databases in general. This talk will describe the key features of Jaql and show how it can be used to process JSON data in parallel using Hadoop's map/reduce framework.


16:30  Hbase – Michael Stack, Microsoft

HBase provides a highly scalable distributed structured data store on top of the Hadoop DFS. Come learn about the current state of the HBase project and the roadmap of future improvements.