Hadoop-BAM is a library for distributed processing of genetic data from next generation sequencer machines. It allows scalable manipulation of aligned reads in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM (Binary Alignment/Map) files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions in the Hadoop map-reduce framework.
The library builds on top of the popular Picard SAM-JDK, so tools that rely on the Picard API are expected to be easily convertible to support large scale distributed processing.
Hadoop-BAM is available under the open source MIT license. For downloads, see the files listing on the project page.
Note the related project SeqPig, which enables the processing of sequence data in Hadoop via the popular Pig Latin scripting language.
Matti Niemenmaa, Aleksi Kallio, André Schumacher, Petri Klemelä, Eija Korpelainen, and Keijo Heljanko. Hadoop-BAM: Directly manipulating next generation sequencing data in the cloud. Bioinformatics. Available online via Open Access.