Warning! Some information on this page is older than 5 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Tue

16

Nov 2010

Algorithm for comparing two sequences of some data is very useful. For example, version control systems like SVN or Git use it to highlight differences between two versions of a text file. But what if you need to code such algorithm from scratch? Today it was my task and it turned out to be not so simple, so I want to share my sample implementation in C++ which I came up with based on some readings from the Internet and my previous experiences with calculating Levenshtein distance.

What we need to compare is sometimes just lines of text files that are equal or not, but I needed something more complex, so let's say an item is defined as follows:

struct Item

{

unsigned ID;

const char *Text;

};

For items from opposite sides to be comparable they must have same ID values. Text may be different, but I'd prefer to match items with same Text (treated as NULL-terminated string) or at least same Text length.

Here are sample data:

const Item items1[] = {

{ 1, "Foo" },

{ 1, "Foo" },

{ 2, "Bar" },

{ 4, "Foobar" },

};

const Item items2[] = {

{ 1, "Foo" },

{ 1, "Firefox" },

{ 1, "Another one" },

{ 2, "Boo" },

{ 5, "Last one" },

};

const unsigned count1 = _countof(items1);

const unsigned count2 = _countof(items2);

The final result may look like this:

1 Foo <-> 1 Foo

-- <-> 1 Firefox

1 Foo <-> 1 Another one

2 Bar <-> 2 Boo

4 Foobar <-> --

-- <-> 5 Last one

To be able to print such result, I need to compute two arrays of bools - one for elements of items1 and one for elements of items2. Each bool will tell whether particular element matches an element from the opposite side. If not, then this element should be considered as inserted, without comparing it to any from the other array.

The "match" arrays for this example look like this:

match1:

[0] true

[1] true

[2] true

[3] false

match2:

[0] true

[1] false

[2] true

[3] true

[4] false

And the algorithm for printing the result is:

bool match1[count1], match2[count2];

CalcCompare(match1, match2, items1, items2, count1, count2);

// Print the final diff

unsigned index1 = 0, index2 = 0;

while (index1 < count1 && index2 < count2)

{

if (match1[index1])

{

if (match2[index2])

PrintCompareRow(&items1[index1++], &items2[index2++]);

else

PrintCompareRow(NULL, &items2[index2++]);

}

else

PrintCompareRow(&items1[index1++], NULL);

}

while (index1 < count1)

PrintCompareRow(&items1[index1++], NULL);

while (index2 < count2)

PrintCompareRow(NULL, &items2[index2++]);

Where the body of the PrintCompareRow function is:

void PrintCompareRow(const Item *lhsItem, const Item *rhsItem)

{

if (lhsItem)

std::cout << lhsItem->ID << ' ' << std::setw(16) << lhsItem->Text;

else

std::cout << "-- ";

std::cout << " <-> ";

if (rhsItem)

std::cout << rhsItem->ID << ' ' << std::setw(16) << rhsItem->Text;

else

std::cout << "--";

std::cout << std::endl;

}

Now it's time to present the main algorithm. It's enclosed in the CalcCompare function and all it does is filling match1 and match2 arrays. The idea of the algorithm is based on a vision of a 2D array with size (M+1)x(N+1), where M is the number of items in items1 and N is the number of items in items2. Element (0,0) represents a state where no items from either of two sequences have been processed and element (M, N) represents the end of the processing where all elements from both sequences have been processed (M from first and N from second one). It can be also viewed as a directed acyclic graph, so presenting the diff is the matter of traversing this graph from vertex (0,0) to (M,N) through the best possible route. Moving right means considering an element from items1 as inserted, without touching items2. Moving down means considering an element from items2 as inserted, without touching items1. Moving diagonally right-down means considering a single item from both sequences as same or similar and comparing them on the opposite sides of the diff. I hope you can now imagine now that when comparing to identifical sequences of elements, the best possible route should be moving diagonally all the time (comparing subsequents elements from both sequences) and the worst one is every route that moves only vertically and horizontally, thus considering each element of each sequence as inserted, without matching one from the opposite side.

How to find the optimal route in such graph? The idea is to calculate some weight for each vertex (i,j). If items items1[i] and items2[j] have same ID, I can consider them comparable. In this case I set the weight of (i,j) to the weight of (i-1,j-1) incremented by: 0 if their texts are also identical, 1 if they differ only by text content and 2 if their text lengths are different and only ID match. The weight of moving down or right is 3. I calculate weight that comes from all three possible directions (top, left, top-left) and take their minimum. I then save the information about the directions the minimum weight comes from. Here you can see all weights and directions:

In my code, I remember weights for only the current (currRowWeights) and previous row (lastRowWeights), while the directions are stored in full MxN array (directions). Notice that I don't store the 0th row or column of this imaginary matrix (outlined with gray on the diagram), but calculate their weights on demand.

After calculating the directions matrix, I traverse it from the bottom-right cell. Each step I take is the move in the direction pointed by the cell, which indicates the optimal path I was looking for (this path is marked with black arrows on the above diagram). The algorithm has O(M*N) time and memory complexity. I know that better algorithms exist, but they are even more complicated :)

I made a simple optimization: Before proceeding to the main M*N algorithm, I traverse both sequences from the beginning and mark their items as matching as long as I find identical items on both sides. Then I do the same at the end of both sequences. This way in the main algorithms I can consider only subset of items. For example, for the first array they are items from begIndex1 inclusive to endIndex1 exclusive.

Here is the final algorithm:

void CalcCompare(

bool *match1, bool *match2,

const Item *items1, const Item *items2,

unsigned count1, unsigned count2)

{

if (count1 == 0 || count2 == 0)

return;

for (unsigned i = 0; i < count1; ++i)

match1[i] = false;

for (unsigned i = 0; i < count2; ++i)

match2[i] = false;

// Mark identical items at the beginning of the sequences

// (here $begIndex1 == $begIndex2).

// This is just an optimization, not necessary.

unsigned begIndex1 = 0, begIndex2 = 0;

while (begIndex1 < count1 && begIndex2 < count2

&& items1[begIndex1].ID == items2[begIndex2].ID

&& strcmp(items1[begIndex1].Text, items2[begIndex2].Text) == 0)

{

match1[begIndex1] = true;

match2[begIndex2] = true;

++begIndex1;

++begIndex2;

}

// Mark identical items at the end of the sequences.

// This is just an optimization, not necessary.

unsigned endIndex1 = count1, endIndex2 = count2;

while (endIndex1 > begIndex1 && endIndex2 > begIndex2

&& items1[endIndex1-1].ID == items2[endIndex2-1].ID

&& strcmp(items1[endIndex1-1].Text, items2[endIndex2-1].Text) == 0)

{

match1[endIndex1-1] = true;

match2[endIndex2-1] = true;

--endIndex1;

--endIndex2;

}

unsigned middleCount1 = endIndex1 - begIndex1;

unsigned middleCount2 = endIndex2 - begIndex2;

unsigned middleCountProduct = middleCount1 * middleCount2;

if (middleCountProduct == 0)

return;

const unsigned insertWeight = 3;

unsigned *lastRowWeights = new unsigned[middleCount1];

unsigned *currRowWeights = new unsigned[middleCount1];

for (unsigned i = 0, weight = insertWeight

; i < middleCount1

; ++i, weight += insertWeight)

{

lastRowWeights[i] = weight;

}

// 0x1 = diagonal, 0x2 = left, 0x4 = top

unsigned *directions = new unsigned[middleCountProduct];

for (unsigned y = 0; y < middleCount2; ++y)

{

for (unsigned x = 0; x < middleCount1; ++x)

{

unsigned weightFromLeft = x == 0

? (y + 2) * insertWeight

: currRowWeights[x-1] + insertWeight;

unsigned weightFromTop = lastRowWeights[x] + insertWeight;

unsigned weightFromDiagonal;

if (items1[x+begIndex1].ID == items2[y+begIndex2].ID)

{

if (strlen(items1[x+begIndex1].Text)

== strlen(items2[y+begIndex2].Text))

{

if (strcmp(items1[x+begIndex1].Text, items2[y+begIndex2].Text) == 0)

weightFromDiagonal = 0;

else

weightFromDiagonal = 1;

}

else

weightFromDiagonal = 2;

weightFromDiagonal += x == 0

? (y + 1) * insertWeight

: lastRowWeights[x-1];

}

else

weightFromDiagonal = 0x7FFFFFFF;

unsigned minWeight = weightFromLeft;

if (weightFromTop < minWeight) minWeight = weightFromTop;

if (weightFromDiagonal < minWeight) minWeight = weightFromDiagonal;

currRowWeights[x] = minWeight;

unsigned direction = 0;

if (minWeight == weightFromDiagonal)

direction |= 0x01;

if (minWeight == weightFromLeft)

direction |= 0x02;

if (minWeight == weightFromTop)

direction |= 0x04;

directions[y * middleCount1 + x] = direction;

}

// DEBUG: Print weights.

/*std::cout << std::setw(3) << currRowWeights[0];

for (unsigned i = 1; i < middleCount1; ++i)

std::cout << ' ' << std::setw(3) << currRowWeights[i];

std::cout << std::endl;*/

// Swap lastRowWeights and currRowWeights so now current becomes last.

unsigned *tmp = lastRowWeights;

lastRowWeights = currRowWeights;

currRowWeights = tmp;

}

// DEBUG: Print directions

/*for (unsigned y = 0; y < middleCount2; ++y)

{

for (unsigned x = 0; x < middleCount1; ++x)

{

unsigned direction = directions[y * middleCount1 + x];

std::cout << (direction & 0x01 ? 'D' : '-');

std::cout << (direction & 0x02 ? 'L' : '-');

std::cout << (direction & 0x04 ? 'T' : '-');

std::cout << ' ';

}

std::cout << std::endl;

}*/

// Rewrite directions[y * middleCount1 + x] to match1[i] and match2[i].

int x = (int)middleCount1 - 1;

int y = (int)middleCount2 - 1;

unsigned index1 = endIndex1 - 1;

unsigned index2 = endIndex2 - 1;

while (x >= 0 && y >= 0)

{

unsigned direction = directions[y * middleCount1 + x];

if (direction & 0x01)

{

match1[index1--] = true;

match2[index2--] = true;

--x; --y;

}

else if (direction & 0x02)

{

--index1;

--x;

}

else // (direction & 0x04)

{

--index2;

--y;

}

}

delete [] directions;

delete [] currRowWeights;

delete [] lastRowWeights;

}

Here is the debug print of the weights and directions matrix (L=left, T=top, D=diagonal):

3 3 6 9

6 5 8 11

9 8 11 14

12 11 9 12

15 14 12 15

D-- D-- -L- -L-

--T D-- -L- -L-

--T D-T -LT -LT

--T --T D-- -L-

--T --T --T -LT

Copyright © 2004-2021