Applicable commodity infrastructures for Apache Hadoop have advanced greatly in the last number of years. In this talk we'll discuss the lessons learned and outcomes from the work HP has done to optimally design and configure infrastructure for both MapReduce and HBase. This talk also serves as a crash course in infrastructure design for anyone working on Scale Out Architectures.
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of information to assist the community in optimally designing and configuring Hadoop Infrastructure based on specific requirements. In addition, of the information that is available, a lot of it is erroneous and needs to be debunked. For example, how many disks and controllers should you use? Should you buy processors with 4 or 6 cores? Do you need a 1GbE or 10GbE Network? Should you use SATA or MDL SAS? What type of processors should you use? Small or Large Form Factor Disks? How much memory do you need ? How do you characterize your Hadoop workloads to figure out whether your are I/O, CPU, Network or Memory bound? Once you've done all of this, how can you make infrastructure design trade-offs to get your cost as low as possible?
In this talk we'll discuss the lessons learned and outcomes from the work HP has done to optimally design and configure infrastructure for both MapReduce and HBase. This talk also serves as a crash course in infrastructure design for anyone working on Scale Out Architectures. No in-depth knowledge of hardware is required.