You're working for Acme Data Miners (ADM), which specializes in quickly mining useful information about large quantities of data. For example, your company's software might look at your credit-card history, discover that you've bought a chrome-skull car accessory, decide that you're therefore probably not a good credit risk, and lower your credit limit. (See Charles Duhigg, "What does your credit-card company know about you?", New York Times, 2009-05-12.)
Currently your company's data analysis system is based on a lot of shell scripts that use pipes and temporary files. For example, it might use shell code like this:
tr -cs 'A-Za-z' '[\n*]' | sort -u | comm -23 - words
to do a simple spelling checker (see the CS 35L shell scripting assignment for discussion of this example). Your actual scripts are much more complicated than this, but this is an adequate (albeit small) example of what they're like.
Your company is doing well and you're getting a lot of work to do. You're running software on big machines with lots of processors. This works reasonably well on big machines but they're expensive (especially their network file system), and you want to do the runs more cheaply, on a cluster or perhaps with cloud computing.
Your boss has heard good things about Hadoop for these kinds of environments. He suggests that you look into the possibility of translating your company's scripts into something that will run under Hadoop. Three possibilities suggest themselves:
Investigate the suitability of writing Hadoop-oriented programs in Java, Pig Latin, and Jaql, by translating the simple spelling example into each of the three notations. Specify any assumptions you make in your translations, and any gaps or incompletenesses in your resulting implementation.
Write a two-page executive summary assessing the suitability of Jaql, Pig Latin and Java for improving the performance of your company's software using Hadoop. The summary should be in 10-point font or larger. You can put references and appendixes on a later pages, if there's not enough room on two pages (the appendixes might be useful for containing source code that you wrote). Your summary should focus on the technologies' effects on performance, cost-effectiveness, reliability, portability (to future hardware), flexibility, and ease of use, compared to using shell scripts. It should be suitable for software executives, that is, for readers who have some expertise in software, particularly in managing software developers, but who are not experts in Hadoop or Jaql or Pig Latin. Please keep the resources for written reports in mind.
Pig Latin is installed on SEASnet, in /usr/local/cs/bin as usual; you can try it out with, for example, the command pig -x local.
Jaql 0.4 is not installed on SEASnet, as 0.4 is not yet released and the trunk version is not quite ready for prime time; you'll have to do a paper study with its syntax, or get it to run on your machine.
Submit a file hw6.pdf containing your summary.