Steps towards Haskell in the cloud

Introduction

I have a considerable interest in the use of massively parallel cloud systems for handling hard problems, but unfortunately they currently rely on extremely complex infrastructure that is quite intrusive when it comes to writing applications.  My intention is, and has been, to find a simple framework that can be used to provide cloud computing capability for Haskell applications while remaining simple and lightweight.

I have discussed elsewhere my initial implementation of MapReduce as a monadic operation that works in parallel processes on a single machine.  Here I take the first steps towards analysing what else is needed to turn this into a working cloud based system.

The basics

As I showed in my paper on MapReduce, nearly all massively parallel algorithms are some kind of version of the basic scatter-gather approach.  This works as follows:

foreach processing step
    divide data into chunks
    give chunks to processing units
    processing units transform data
    gather results from processing units
    concatenate to form next step's data

It’s useful to reformulate this from the processing unit’s point of view, so we get:

foreach processing step
    request chunk of data
    while chunk allocated
        get chunk
        transform chunk
        set transformed chunk
        request chunk of data

This makes it clear that scatter-gather requires two basic distributed services.

The services

The datastore

First we need a simple central datastore that:

  1. Accepts chunks and concatenates them together to form a complete dataset
  2. Can return the complete dataset when asked to

This service is, of course, little more than a mildly intelligent front end on a database table.

Note that in standard MapReduce, the partitioning of data into chunks is performed by the central datastore.  In view of our desire to run generalised MapReduce, we have to send the entire dataset to all processing nodes.  In fact, if this is handled sensibly (e.g. via IP multicast) then there is no difference in network loading at all.

The broker

Next we need a broker, whose job is simply to wait for processing nodes to ask for chunks to be assigned to them, and either to do so or to inform them that the step is finished and they should proceed to the next step.  So the broker has to:

  1. Keep a track on how many processing nodes there are
  2. Decide how many chunks to divide each step into
  3. Wait for nodes to request chunks and hand them off one by one
  4. Decide when the step is completed

There is also a processing node component to the broker, which in each step accepts the processing function, asks for a chunk and waits for a response.  If a chunk is forthcoming, it gets the data, applies the function, puts back the result and asks for another chunk.  If none is forthcoming it goes on to the next step, accepting a new processing function and proceeding as before.

This structure suggests that in fact what we have here is some kind of distributed monad or related structure. Indeed, as it is clearly stateful, it is more or less inevitable that it should be.

The next step

I have made some observations about implementation, noting that the datastore is a glorified database table, and the broker could be a kind of monad. In my next piece I will follow up on this, and discuss (at a high level) some aspects of how exactly the services could be built.

Advertisements

2 thoughts on “Steps towards Haskell in the cloud

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s