---+Super Scalar Sample Sort
=This is a paper I wrote for ETH, I thought I might as well publish it here=
---++Introduction
---+++Abstract
---++The Sample Sort Algorithm
---+++How it works
The input is an array of _n_ integers.
We take the first _k_ numbers in the array and use them as splitters.
Splitters are elements of the input that will divide the input into smaller pieces. It is fairly simple: You take out the splitters and sort them (using quicksort or inline algorithm). Between the splitters you put buckets. If our input are the numbers 1-100 and our splitters are 32,57 and 73 it will look like this:
<verbatim>
1			 32			 57			 73			 100
 \ Bucket /  \ Bucket /  \ Bucket /  \ Bucket /
  """"""""	 """"""""	 """"""""	 """"""""
</verbatim>
Once we have the splitters and the buckets, we run through the input sequentially and throw every element in to the according bucket. This is on step.<br>
Do the whole thing recursivly for every bucket (with the elements in the bucket as the input). The recursion stops when there is at most one element in every bucket. On the way, store the buckets and the splitters in sequential order.<br><br>
In the end we will have many small buckets in a sorted order that only contain at most one element. If we now run over all buckets and splitters sequentially we have the sorted input.<br>
[Image]
---++The Speed-Ups (Super Scalar Sample Sort)
Let us look a little more into processor architecture. In the following paragraphs the word processor is an Intel Pentium 4 (later 90nm generation) processor.<br><br>
An instruction is not executed by the processor in one step. In fact an instruction in a processor runs through 31 stages called the execution pipeline (reading the instruction, getting all the data from memory that is needed to execute this instruction, performing the instruction, writing the result of the instruction back, etc). It would not be very wise to wait until every instruction ran through all the execution stages before processing the next one. That's why processors use instruction parallelism, which means that they already start the next instruction, before the previous instruction has completly finished. This is not always trivial or possbile and can result in big delays, which unnecessarily slow down the algorithm.
---+++Conditional Branches
When the processor comes to a Conditional Branch it tries to predict its direction in order to keep the execution pipeline full. 
The prediction rule is simple: It always evaluates the condition as true and hence executes the fist branch. Should the processor finally evaluate the expression as false and hence mispredicted the branch it needs to flush the execution pipeline (which costs 31 cycles!) in order to be in a consistent state again.<br>
In many cases a programmer can predict the control flow of an application and can optimize the conditional branches, so that at execution time, the processor doesn't mispredict branches all the time.<br>
With sorting though the problem of predicting a branch becomes impossible. If we have a conditional branch and we have to predict whether an element will be smaller or greater than another element, we have a 50% chance of beeing correct. This means that every second time we compare two elements we loose 31 processor cycles. 
<verbatim>
element = randomNumber();
splitter = randomNumber();
if ( element < splitter ) {
	// The processor will predict this branch. There is a 50% chance of beeing correct
	LeftBucket.add(element)
} else {
	// If this is the correct branch, we lose 31 processor cycles.
	RightBucket.add(element)
}
</verbatim>
---+++The k-Splitters and the buckets
In order to get rid of conditional branches, we do a neat trick to place the elements in the correct bucket, without using conditional branches.<br>
---++++k = 1
If we have only one splitter ( _k_ = 1) it is fairly easy. We put the two buckets in an array and are using a boolean expression to find the correct bucket number in the array (assuming that =true==1= and =false==0=):
<verbatim>
element = randomNumber();
splitter = randomNumber();

BucketArray[0]=LeftBucket;
BucketArray[1]=RightBucket;

BucketArray[element>splitter].add(element);
</verbatim>
---++++any k
If we have any _k_ (preferably a power of 2 -1) we build a binary search tree out of the k splitters in order to find the correct bucket.[Image] If we decide k to be small enough to fit the _k_ splitters into the processors registers we do so, but usually we should choose _k_ = 255 to allow byte sized comparisons. We keep the 255 splitters in a buffered array.<br>
This is the example with _k_ = 3 splitters, which results in 4 buckets:
<verbatim>
element = randomNumber();
s1,s2,s3 = getSplitters();
BucketArray[] = getBuckets();

firstBranch= element > s2;
BucketArray[ firstBranch*2 
				 + (!fistBranch & element > s1)*1 
				 + (fistBranch & element > s3)*1 
			  ].add(element)
</verbatim>
Sanders does not do it exactly like that, he uses special _conditional move_ instructions that only modern processors support, but the result is the same.
---++++Pre-Allocating Memory
We can get an additional speed-up, by first allocating buckets with the correct bucket sizes in memory. In order to do that we need to know the bucket sizes in advance. <br>
Instead of throwing the elements directly into the buckets, we first run over all elements and tell them in which bucket they will be thrown and at the same time we keep track how big the buckets will get. After we processed all elements we allocate the buckets with the correct size and throw every element into the right bucket (it will remember in which bucket it belongs).<br><br>
Instead of always allocating each bucket individually we allocate an auxilary array in the beginning with size _n_ and use it for our buckets. After every iteration step this auxilary array becomes the input for the next iteration.
---++++Oversampling k
To have more equaly sized buckets we should oversample _k_ by a factor _a_ (look at _a*k_ possible splitters) and take the most equally distributed _k_ ones.
---+++Sorting smaller subproblems
Ofcourse it we would have a massive overhead if the buckets get really small and we still would use SSSS for sorting them.<br><br>
Sanders suggests the following:
| *#elements* | * Algorithm* |
| 0 - 5 | customized straight-line code |
| 5 - 100 | insertion sort |
| 100 - 1000 | quicksort |
| 1000 - ... | Super Scalar Sample Sort |
---++Things we get for free
---+++No Data Dependencies
---++++What it is
There is instruction _A_ that comes before an instruction _B_. Data Dendency occurs if _B_ gets stalled because it waits for the result of _A_, but _A_ is has not yet finished:
<verbatim>
c = a + b;	// Instruction A
d = c + 1;	// Instruction B
</verbatim>
---++++Why we don't have it
As soon as we have the _k_ -splitters fixed, all elements can be placed in the correct bucket independently.
---++++The Speed-Up
Since we don't have any data dependency we can interleave the corresponding instructions for several elements. The compiler will automaticly be able to do loop-unrollment and software-pipelining.
---+++Memory Hierarchies / Cache-Oblivious
_n_ = Number of input elements<br>
_k_ = Number of splitters<br>
_M_ = Number of elements that fit into primary memory<br>
_B_ = Number of elements that can be transfered from secondary to primary memory in one step.<br><br>

Since we can read through all the elements sequentially we have perfect spatial locality. Quicksort needs to write the entire input and output _log(n/m)_ times before a subproblem fits into _M_. But since Super Scalar Sample Sort is like a divide-and-conquer algorithm and splits the problem into subproblems of about size _n/k_ we only need to read and write the input _log.k.(n/m)_ times. In the I/O Model with an omniscent cache of size _M_ and transfers of size _B_, we have _k=O(M/B)_.
<br><br>
By appropratly choosing the size of _k_, this algorithm becomes cache oblivious, under the assumption that all k-splitters fit into memory and they are optimally distributed.
---++Experiments