12/31/2010

The underappreciated banana and its buddy monoid

Note: This is the blog post that goes with the Channel9 lecture "Going Bananas". You find all material (slides, code) linked up on the web site for my Channel9 series; see the section on "Going Bananas". I will add a link to the C9 website with the video as soon as it gets available.

I am really Ok. Well, but I do see bananas (folds) everywhere. This tells me that data processing is quite regular. For instance, MapReduce computations for parallel data processing are essentially folds that extract some intermediate data from the records of a voluminous input---subject to subsequent, possibly distributed and staged reduction. Also, in program transformation and analysis, many data processing steps are compositional, and thereby suggest themselves as being implemented through "large" bananas. Further, in XML processing, many operations are busy with folding over XML trees. There are bananas for everyone---not just for lists. If we were just acknowledging such omnipresence of fold, reasoning about our programs may be drastically simplified.

Many bananas team up with monoids. Again, this is great because this tells me that combination of intermediate results along recursive computations is typically quite regular, too. For instance, MapReduce computations for parallel data processing essentially use monoids to aggregate results and to build up indexes. (The monoidal properties enable parallel composition in so far that we can take any segment of input and process it on a separate machine, and such per-machine results can be combined, in fact, reduced later. We also see easily when we need more than just a monoid. For instance, we may need commutativity in order to be even more flexible with parallel processing schedules.) Also, in program analysis, many data processing steps recurse into substructures, and combine intermediate results in a monoidal fashion. It seems that some people mainly think of monoids as numbers and perhaps lists with the usual culprits for binary operation and identity. If you have never used a monoid for pairs or maps, then you should try it as soon as you can.

One may end up obfuscating programs by under-appreciating the fundamental concepts of bananas and monoids for recursion schemes and algebraic computations. For instance, suppose you design a domain-specific language for parallel data processing, and you suggest certain language concepts for data analysis. Wouldn't it be helpful to point out that the key underlying concept is essentially the one of a list homomorphism, which has been heavily studied from a point of view of data parallelism? This is not a conceived example. Sawzall describes a DSL just like that, and it took me a few hours to see that Sawzall's aggregators boil down to a programming scheme for list homomorphisms with interesting monoids. Likewise, the visitor pattern in OO programming is a bit of a chaos, and so it may help to see something as simple and powerful as a large bananas to consider switching paradigms or develop a simpler view on visitors.

I have compiled a lecture with accompanying code base, to appear on Channel9, which discusses a number of use cases for bananas and monoids. I begin with simple uses of foldr meant to show the generality and expressiveness of this great operator; this also includes a quick discussion of those type classes that can be used for folds on container types other than Haskell's concrete list type. (See code module Foldr.hs for this part.) Then I switch to parallel data processing à la MapReduce, and have a lot of fun with map, foldr, monoids, and friends. (See code module MapReduce.hs for this part.) Then, I switch from lists to heterogenous types, as they are used to model problem domains in programming, e.g., dataypes for ASTs used in interpretation, software transformation and analysis. On such AST types, I discuss large bananas, i.e., bananas that possibly handle many constructors, and the "Scrap Your Boilerplate" style of generic functional programming which involves yet other bananas. (See code modules Large.hs and SYB.hs for these two parts.) I had to be selective in the interest of staying close to a 1h lecture. For instance, folds in the sense of Church encodings, functorial style generic programming, or graph traversals remain unmentioned. I also fail to spend significant time on reasoning about the programs at hand, but I give some further reader pointers as usual. Enjoy, and please provide feedback.

Regards
Ralf

1 comment:

  1. Hi Ralf! Great lecture! Eye opening and exciting!

    ReplyDelete