Homework 6. Comparing high level cloud query languages

Motivation

You're working for the Broad Institute's data mining division. One of your flagship software projects is the Genome Analysis Toolkit (GATK), a software library written in Java that uses the Hadoop platform for reliable, scalable, distributed computing that is based on the MapReduce framework.

Your GATK-based applications are typically written by people with expertise in both genomics and Java programming, and this is turning into a problem because few geneticists are Java experts. You'd like to make GATK easier to use, by letting geneticists write their applications in a scripting language rather than in Java. The idea is that they'll need less programming experience that way.

Your boss suggests that you look at three high level query languages that have been built atop Hadoop: Hive's HiveQL, Jaql, and Pig's Pig Latin. She wants you to see which of these would be the most suitable for this application. Ideally she'd like you to translate a simple GATK or GATK-like application into these three candidate languages, and then run them, to see which works best. You have limited time, though, so you settle for just a paper study at first.

Assignment

Investigate the suitability of writing GATK-style applications in Hive, Jaql, and Pig, by showing how a similar example application would be specified in each of the three notations. Specify any assumptions you make in your translations, and any gaps or incompletenesses in your resulting implementation.

Write an executive summary assessing the suitability of Hive, Jaql, and Pig for improving the usability of GATK. The summary should be in 10-point font or larger and should be at most two pages. You can put references and appendixes on a later pages, if there's not enough room on two pages: the appendixes should contain the source code or sketches for your example application. Your summary should focus on the technologies' effects on ease of use, flexibility, generality, performance, reliability, compared to using the existing GATK approach of using plain Java. The summary should be suitable for software executives, that is, for readers who have some expertise in software, particularly in managing software developers, but who are not experts in Hadoop or in the three scripting languages. Please keep the resources for written reports in mind, particularly its advice for citations to sources that you consulted.

Submit

Submit a file hw6.pdf containing your summary.


© 2011 Paul Eggert. See copying rules.
$Id: hw6.html,v 1.40 2011/11/22 08:43:37 eggert Exp $