Sorting algorithms: Merge sort

The next sorting algorithm in this series is the Merge Sort. Of the algorithms covered so far, this algorithm comes closest to the Shell Sort algorithm. The reason for this lies in how the comparisons are done in both algorithms: instead of working on the entire array, both the shell sort and the merge sort work by breaking the array to sort into smaller arrays and then sorting those first.

To be more specific, merge sort works by taking an array, breaking it down to the smallest parts, and then combining these parts. When combining the parts, there's a good chance that one part can just be appended to the other, which means that there's one comparison and a merge to be done. However, when the second part cannot be appended to the first part, the two parts must be merged together. Here, however, merge sort has an advantage in that both parts to be merged are already sorted, which makes the merge easier.

The algorithm looks roughly like this:

  1. Check if the array contains more than 1 element.
  2. If yes, break the array into two parts and go back to 1.
  3. If no, return the array - it is now sorted (there's 1 element in it, so it's trivially sorted).
  4. Compare the last element of the first returned array with the first element of the second return array.
  5. If the first array element from step 4 is less or equal to the second array element from step 4, then append the second array on to the first array.
  6. If the first array element from step 4 is not less or equal to the second array element, then merge the two arrays.
  7. Return the appended/merged array - it is now sorted.

The sub-part - the merging function - looks as follows:

  1. Initialize an empty array.
  2. Check if both arrays to merge have elements. If not, skip to 4.
  3. Compare the first elements of both arrays. Remove the lower one from it's array and append it to the result array. Continue from 2.
  4. If either array to be sorted still has elements left, append it to the result array.
  5. Return the result array - it's a sorted merge of the two provided arrays.

As can be gathered from this description, there are two parts to the algorithm: a recursive function, that breaks up the provided array into ever smaller parts, before recombining them, and a merge function, that sorts and merges the arrays to be recombined.

The Code

<?php

class MergeSort extends BaseSort
{
    public function sortFunction()
    {
        $this->store = $this->breakdownArray($this->store);
    }

    public function breakdownArray(array $array)
    {
        $count = count($array);
        if ($count > 1)
        {
            $half = $count / 2;
            $left = $this->breakdownArray(array_slice($array, 0, $half));
            $right = $this->breakdownArray(array_slice($array, $half));

            if (end($left) < reset($right))
            {
                return array_merge($left, $right);
            }
            else
            {
                return $this->arraySortMerge($left, $right);
            }
        }
        else
        {
            return $array;
        }

    }

    public function arraySortMerge(array $left, array $right)
    {
        $return = array();
        $count_left = count($left);
        $count_right = count($right);
        $l = $r = 0;
        while ($l < $count_left && $r < $count_right)
        {
            if ($left[$l] <= $right[$r])
            {
                $return[] = $left[$l];
                $l++;
            }
            else
            {
                $return[] = $right[$r];
                $r++;
            }
        }
        if ($l < $count_left)
        {
            return array_merge($return, array_slice($left, $l));
        }
        else
        {
            return array_merge($return, array_slice($right, $r));
        }
    }
}

Here the structure of the algorithm should be clear: the recursive function at the top, then the merge function below.

The most important part to optimize in this algorithm is the merge function - this is where most of the time in the sorting will be spent. When I wrote up the algorithm, this was the hardest part to create: a function that merges two already sorted arrays and takes advantage of this fact. First, I considered extracting the first variables of the arrays, and then reinsert the one not used (only one variable at a time should be append to the result array, the other should be checked again next loop). This was much slower than what the algorithm should run like, so I went over it a couple of times, trying to improve. The next version took advantage of the array_shift function - this function shifts out the first element of an array and then reorders the array, so the new first element can be reference via \$array[0]. This improved the speed of the algorithm by a factor 10, but it was still running very slow. In fact, the algorithm was running at O(n^2^) and not the O(n log n) it should. So, I went back to the code and rethought things: instead of changing the array, it might be faster to just use indexes for the merge - which indeed was the case. The code you can see above actually runs in much better time - it's in O(n log n) now.

The moral should be clear: if you're doing algorithms like this and you'll be throwing a lot of data after them, you need to do as little modifying to data structures as possible. Creating a new array by appending elements doesn't take a lot of time, but shifting elements out of one and reordering it will.

There's another lesson to be learned as well: the algorithm was running in O(n2) even with a proper merge function! So, while in theory the worst case scenario of merge sort is O(n log n) this very much depends upon the implementation.

Results

At 1,000 elements

1000 elements to sort.Sanity check. PHP native sort() algorithm took: 0.002649 seconds. Memory used: 48264 bytes
Running MergeSort sort. Algorithm took: 0.022447 seconds. Memory used: 48264 bytes

This is pretty good time, less than 10 times slower than PHP itself. Next, 10,000 elements:

10000 elements to sort.
Sanity check. PHP native sort() algorithm took: 0.035824 seconds. Memory used: 505704 bytes
Running MergeSort sort. Algorithm took: 0.268344 seconds. Memory used: 505744 bytes

About 11 times slower for 10 times as many elements. 100,000 elements:

100000 elements to sort.
Sanity check. PHP native sort() algorithm took: 0.541093 seconds. Memory used: 4924456 bytes
Running MergeSort sort. Algorithm took: 3.329802 seconds. Memory used: 4924504 bytes

About 12 times slower, at 10 times as many elements. The algorithm is running steady with running times incrementing as expected for an O(n log n) sort. Final test, 1,000,000 elements:

1000000 elements to sort.
Sanity check. PHP native sort() algorithm took: 7.504231 seconds. Memory used: 48194472 bytes
Running MergeSort sort. Algorithm took: 41.222942 seconds. Memory used: 48194504 bytes

Sorting one million elements in 41 seconds is not that bad, considering that the implementation is in PHP, an interpreted language!
Like the shell sort algorithm, this one could actually be used in a production environment (if you can't use the sort() functions in php). It's slightly faster than the shell sort as well, so we're clearly moving into the realm of effective algorithms here.

social