-
hamake - Google Code
Description
'hamake' utility allows you to automate incremental processing of datasets stored on HDFS using Hadoop tasks written in Java or using PigLatin scripts. Datasets could be either individual files or directories containing groups of files. New files may be added (or removed) at arbitrary location which may trigger recalculation of data depending on them. It is similar to unix 'make' utility.
First, you formulate you processing model in terms of data locations (which could be used either as inputs or outputs) and tasks.
Currently two types of tasks supported (although they are called "map" and "reduce" but they should not be confused with Hadoop "map" and "reduce"):
MAP - this a type of task which maps a group of files at one location to another location(s). This task assumes 1 to 1 file mapping between locations, and can process them incrementally, converting only files which are present at source location, but not at all of destinations.
If we view MAP as a function, we can define it using Haskell language syntax as:
map:: Path -> [Path] -> [Path]
map source dependencies targets = ...
REDUCE - this a type of task which takes a group of files as an input and produce one or more outputs. All input files are considered to be a dataset, and if any of them is newer than destination, the re-calculation will be triggered.
If we view REDUCE as a function, we can define it using Haskell language syntax as:
reduce:: [Path] -> [Path]
reduce source targets = ...
You describe your tasks along with their inputs and outputs locations in 'hamakefile' using simple XML syntax (see HaMakefileSyntax). 'hamake' reads this file, builds dependency graph and attempts to execute tasks in order which allows to resolve all dependencies. (in the situation where you have a circular dependency, you can specify a "generation" attribute on an input or output). hamakes takes care of figuring out what tasks have to be executed and in what order. It could execute several tasks in parallel if they do not depend on each other. It
