ApacheCon NA 2013

Portland, Oregon

February 26th – 28th, 2013

Register Now!

Wednesday 1:45 p.m.–2:45 p.m.

Near-Realtime Processing over HBase

Ryan Brush

Track:
Big Data
Audience level:
Intermediate

Description

The better your Hadoop-based processing, the faster people want it. This is a case study of complementing MapReduce with stream-based processing of complex healthcare data. We start with raw input and end with rich, indexed content served in Solr. Along the way we look at how we use Hadoop, HBase, Crunch, and Twitter's Storm project to help make big data fast.

Abstract

Healthcare data is often fragmented across institutions or in formats not easily explored. Here we look at a system that securely brings together related pieces of health information and processes it into a variety of data models useful to clinicians. This talk will include:

  • Low-latency data ingestion from multiple sources into HBase
  • A reliable, scalable change notification system over HBase
  • Processing those incremental changes using the Storm project
  • Bulk processing data using an incubator build of Apache Crunch
  • Building Solr indexes in MapReduce and incrementally updating them
  • Serving resulting data to clinical applications out of HBase and Solr

This talk draws from the Hadoop, HBase, and Healthcare talk given at the 2012 Hadoop World, but goes deeper into the technologies and techniques used. A basic understanding of Hadoop and MapReduce is assumed. Working knowledge of HBase and Solr may be helpful but is not required.