Friday, August 8, 2014

Sorting bubbles

Once upon a time ago, I needed to merge data from two text files: one file had household data, where each household had a unique id assigned to it; and another one had household members data, where each member was associated with its household through that unique id. Both files were big (over 2 million households and however many household members) but sorted, so it should have been easy. However, back in the days my naive approach and weak understanding of how to keep track of indices in a loop resulted in some hanging processes. A kind soul suggested that I used a binary search algorithm (since my data was sorted) but back then I have not even heard of it. (There was a happy end to the story).

So the lesson to be learned is you need to know your algorithms. As such, I am starting to review at least some basic ones and how they are different.

Bubble sort is usually taught as one of the first algorithms. It works by comparing two adjacent elements and swapping them if the first element is larger than the second (assuming we want to sort from lowest to highest). The example is below:

def bubble_sort(array)
    n = array.length
    begin
        (n...array.length-1).each do |i|
            if (array[i] > array[i+1])
                swap(array,i)
            end
        end
        n = n - 1
    end while n >= 0
    array
end

def swap(a,i)
    a[i], a[i+1] = a[i+1], a[i]
end 

In terms of Big O Notation, bubble sort's performance is described as having O(n^2) or quadratic runtime in the average and worst cases, and O(n) or linear runtime in the best case. Performance is determined by the number of swaps and comparisons to be made.

The best case is when the array is already sorted or nearly sorted. In this case, bubble sort just makes one iteration over the array and then stops. For example, we have an array like [1,2,3,4,5]. It compares 1 and 2 and finds that 1 and 2 are already sorted and do not need to be swapped, then moves onto comparing 2 and 3, 3 and 4, 4 and 5. Since all the elements are in a proper order, only one iteration of the array over n elements is needed, no swaps are needed and the number of comparisons to be made is equal to (n-1).

The worst case is when the smallest element is at the very end of the array; for example [2,3,4,5,1]. In this case, the smallest element will be changing its place once per each iteration until it is placed in a proper location:

[2,3,4,5,1]
[2,3,4,1,5]
[2,3,1,4,5]
[2,1,3,4,5]
[1,2,3,4,5]


As a result, in the worst case this requires one comparison and one swap per each iteration.

It also helps to see some real numbers for the visual purposes. Thus, I created two arrays with a length of 10,000 elements: both were sorted but in the second array the smallest element was placed at the end. Using the Benchmark library for Ruby, I got the following results:

array1: 2.3600000000000003 millisecs
array2: 4900.736           millisecs


That's quite a difference. Bubble sort doesn't perform well on large datasets, but it is quite efficient on small datasets and the nearly sorted ones.

No comments :

Post a Comment