Pig - Making Hadoop Easy

Hadoop provides a powerful platform that enables scalable, fault-tolerant, data-centric computing. However, as a user-facing programming paradigm it is too low-level and requires users to write significant amount of custom code, re-implement common processing primitives (e.g. join), and worry about chaining jobs together. Pig is a Hadoop sub-project that provides a higher level programming language to describe parallel computation. It takes care of implementing common relational operators such as join and filter, operator pipelining, and job chaining while providing ways to incorporate custom user code via user defined functions and streaming. The result is much simpler and more compact code, increased user productivity, and reduced maintenance time. At the same time, unlike SQL databases which rely on a query optimizer to determine the execution strategy for a user program, Pig stays faithful to the spirit of map-reduce whereby a user program specifies a simple sequence of steps for the system to obey. The talk will introduce Pig and its programming model, contrast it with Hadoop┬╣s model and provide motivation to use Pig as the preferred programming paradigm for most applications. The performance tradeoffs will also be discussed.