# Introduction

We present a simple sorting algorithm wellsuited to implementation on clusters of small FPGAs.  Its performance is $O({N\over p})$ where $N$ is the number of items to be sorted and $p+2$ the number of FPGA nodes.

# The Algorithm

## Physical Architecture

We have $p+2$ nodes which are organised as follows:

• $1$ control node
• $1$ post-processing node (optional)
• $p$ processing nodes

The nodes are connected as follows:

• All nodes share a common clock.
• The processing nodes and control node are linked by high-speed unidirectional communications path in a ring topology, so node $n$ receives data from node $n-1$ on a network called NetworkIn and sends data to node $n+1$ on a network called NetworkOut (where all arithmetic is taken modulo $p+1$).
• The control node has an additional outward facing communications path called NetworkOutput, which is used for outputting the sorted data once the algorithm has completed.
• Each processing node has a signal line, called, WorkDone that links to the control node.
• Each processing node has attached a memory of size at least $N\over p$ and an incoming communications path via which the memory is loaded (ideally the memory will be a FIFO, so one element can be removed at a time and the processing node will know when it is empty).
• The post-processing node has one inbound high-speed communications path called NetworkIn, which is connected to the NetworkOutput path of the control node, and an outward bound high-speed communications path, called NetworkOutput.

Diagrammatically, it looks something like this:

## Initialisation

The algorithm is initialised by filling the memories with the list divided into $p$ chunks and then initialising the nodes. It continues until all data points have been placed in the sorted list, at which point the control node starts emitting the sorted list continuously (and not necessarily starting from the beginning). The function of the optional post-processing node is to pick out and emit complete copies of the sorted list (in fact, it emits every other complete copy – the algorithm could be changed to make it emit only one, but that would mean making it stateful, which I have tried to avoid).

## Algorithm

We present the algorithms as JAVA pseudo-code. Note that the are clearly very well suited to implementation on small FPGAs, due to their flow-through nature. We assume, for sake of reference, that the data values are integers.

Also note one minor feature: if all the numbers in the list are identical, the post-processor, which selects one copy of the sorted list from the continuous stream output by the control node, will produce no output.

### Processing Node

int value=Memory.pop();

void mainLoop
{
while(true)
{
int newIn=NetworkIn.read();
if(value<newIn)
{
NetworkOut.write(value);
value=Memory.pop();
if(Memory,empty()) WorkDone.assert();
}
NetWorkOut.write(newIn);
}
}


### Control Node


List<WorkDone> processingNodes;

boolean allProcessingNodesDone()
{
for(WorkDone node : processingNodes) if(!node.asserted()) return false;
return true;
}

void mainLoop
{
while(true)
{
int newIn=NetworkIn.read();
if(allProcessingNodesDone()) NetworkOutput.write(newIn);
NetworkOut.write(newIn);
}
}


### Post-processing Node

void mainLoop
{
int lastValue;
int currentValue=Integer.MIN_VALUE;
boolean outputting=false;
while(true)
{
lastValue=currentValue;
currentValue=NetworkIn.read();
if(currentValue<lastValue) outputting=~outputting;
if(outputting) NetworkOutput.write(currentValue);
}
}

Advertisements