A massively parallel sort algorithm


We present a simple sorting algorithm wellsuited to implementation on clusters of small FPGAs.  Its performance is O({N\over p}) where N is the number of items to be sorted and p+2 the number of FPGA nodes.

The Algorithm


Physical Architecture

We have p+2 nodes which are organised as follows:

  • 1 control node
  • 1 post-processing node (optional)
  • p processing nodes

The nodes are connected as follows:

  • All nodes share a common clock.
  • The processing nodes and control node are linked by high-speed unidirectional communications path in a ring topology, so node n receives data from node n-1 on a network called NetworkIn and sends data to node n+1 on a network called NetworkOut (where all arithmetic is taken modulo p+1).
  • The control node has an additional outward facing communications path called NetworkOutput, which is used for outputting the sorted data once the algorithm has completed.
  • Each processing node has a signal line, called, WorkDone that links to the control node.
  • Each processing node has attached a memory of size at least N\over p and an incoming communications path via which the memory is loaded (ideally the memory will be a FIFO, so one element can be removed at a time and the processing node will know when it is empty).
  • The post-processing node has one inbound high-speed communications path called NetworkIn, which is connected to the NetworkOutput path of the control node, and an outward bound high-speed communications path, called NetworkOutput.

Diagrammatically, it looks something like this:


The algorithm is initialised by filling the memories with the list divided into p chunks and then initialising the nodes. It continues until all data points have been placed in the sorted list, at which point the control node starts emitting the sorted list continuously (and not necessarily starting from the beginning). The function of the optional post-processing node is to pick out and emit complete copies of the sorted list (in fact, it emits every other complete copy – the algorithm could be changed to make it emit only one, but that would mean making it stateful, which I have tried to avoid).


We present the algorithms as JAVA pseudo-code. Note that the are clearly very well suited to implementation on small FPGAs, due to their flow-through nature. We assume, for sake of reference, that the data values are integers.

Also note one minor feature: if all the numbers in the list are identical, the post-processor, which selects one copy of the sorted list from the continuous stream output by the control node, will produce no output.

Processing Node

int value=Memory.pop();

void mainLoop
    int newIn=NetworkIn.read();
      if(Memory,empty()) WorkDone.assert();

Control Node

List<WorkDone> processingNodes;

boolean allProcessingNodesDone()
  for(WorkDone node : processingNodes) if(!node.asserted()) return false;
  return true;

void mainLoop
    int newIn=NetworkIn.read();
    if(allProcessingNodesDone()) NetworkOutput.write(newIn);

Post-processing Node

void mainLoop
  int lastValue;
  int currentValue=Integer.MIN_VALUE;
  boolean outputting=false;
    if(currentValue<lastValue) outputting=~outputting;
    if(outputting) NetworkOutput.write(currentValue);

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s