My Implementation of Diff Algorithm

Warning! Some information on this page is older than 3 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Nov 2010

My Implementation of Diff Algorithm

Algorithm for comparing two sequences of some data is very useful. For example, version control systems like SVN or Git use it to highlight differences between two versions of a text file. But what if you need to code such algorithm from scratch? Today it was my task and it turned out to be not so simple, so I want to share my sample implementation in C++ which I came up with based on some readings from the Internet and my previous experiences with calculating Levenshtein distance.

What we need to compare is sometimes just lines of text files that are equal or not, but I needed something more complex, so let's say an item is defined as follows:

struct Item
  unsigned ID;
  const char *Text;

For items from opposite sides to be comparable they must have same ID values. Text may be different, but I'd prefer to match items with same Text (treated as NULL-terminated string) or at least same Text length.

Here are sample data:

const Item items1[] = {
  { 1, "Foo" },
  { 1, "Foo" },
  { 2, "Bar" },
  { 4, "Foobar" },
const Item items2[] = {
  { 1, "Foo" },
  { 1, "Firefox" },
  { 1, "Another one" },
  { 2, "Boo" },
  { 5, "Last one" },
const unsigned count1 = _countof(items1);
const unsigned count2 = _countof(items2);

The final result may look like this:

1              Foo   <->   1              Foo
--                   <->   1          Firefox
1              Foo   <->   1      Another one
2              Bar   <->   2              Boo
4           Foobar   <->   --
--                   <->   5         Last one

To be able to print such result, I need to compute two arrays of bools - one for elements of items1 and one for elements of items2. Each bool will tell whether particular element matches an element from the opposite side. If not, then this element should be considered as inserted, without comparing it to any from the other array.

The "match" arrays for this example look like this:

[0]  true
[1]  true
[2]  true
[3]  false

[0]  true
[1]  false
[2]  true
[3]  true
[4]  false

And the algorithm for printing the result is:

bool match1[count1], match2[count2];
CalcCompare(match1, match2, items1, items2, count1, count2);

// Print the final diff
unsigned index1 = 0, index2 = 0;
while (index1 < count1 && index2 < count2)
  if (match1[index1])
    if (match2[index2])
      PrintCompareRow(&items1[index1++], &items2[index2++]);
      PrintCompareRow(NULL, &items2[index2++]);
    PrintCompareRow(&items1[index1++], NULL);
while (index1 < count1)
  PrintCompareRow(&items1[index1++], NULL);
while (index2 < count2)
  PrintCompareRow(NULL, &items2[index2++]);

Where the body of the PrintCompareRow function is:

void PrintCompareRow(const Item *lhsItem, const Item *rhsItem)
  if (lhsItem)
    std::cout << lhsItem->ID << ' ' << std::setw(16) << lhsItem->Text;
    std::cout << "--                ";

  std::cout << "   <->   ";

  if (rhsItem)
    std::cout << rhsItem->ID << ' ' << std::setw(16) << rhsItem->Text;
    std::cout << "--";

  std::cout << std::endl;

Now it's time to present the main algorithm. It's enclosed in the CalcCompare function and all it does is filling match1 and match2 arrays. The idea of the algorithm is based on a vision of a 2D array with size (M+1)x(N+1), where M is the number of items in items1 and N is the number of items in items2. Element (0,0) represents a state where no items from either of two sequences have been processed and element (M, N) represents the end of the processing where all elements from both sequences have been processed (M from first and N from second one). It can be also viewed as a directed acyclic graph, so presenting the diff is the matter of traversing this graph from vertex (0,0) to (M,N) through the best possible route. Moving right means considering an element from items1 as inserted, without touching items2. Moving down means considering an element from items2 as inserted, without touching items1. Moving diagonally right-down means considering a single item from both sequences as same or similar and comparing them on the opposite sides of the diff. I hope you can now imagine now that when comparing to identifical sequences of elements, the best possible route should be moving diagonally all the time (comparing subsequents elements from both sequences) and the worst one is every route that moves only vertically and horizontally, thus considering each element of each sequence as inserted, without matching one from the opposite side.

How to find the optimal route in such graph? The idea is to calculate some weight for each vertex (i,j). If items items1[i] and items2[j] have same ID, I can consider them comparable. In this case I set the weight of (i,j) to the weight of (i-1,j-1) incremented by: 0 if their texts are also identical, 1 if they differ only by text content and 2 if their text lengths are different and only ID match. The weight of moving down or right is 3. I calculate weight that comes from all three possible directions (top, left, top-left) and take their minimum. I then save the information about the directions the minimum weight comes from. Here you can see all weights and directions:

In my code, I remember weights for only the current (currRowWeights) and previous row (lastRowWeights), while the directions are stored in full MxN array (directions). Notice that I don't store the 0th row or column of this imaginary matrix (outlined with gray on the diagram), but calculate their weights on demand.

After calculating the directions matrix, I traverse it from the bottom-right cell. Each step I take is the move in the direction pointed by the cell, which indicates the optimal path I was looking for (this path is marked with black arrows on the above diagram). The algorithm has O(M*N) time and memory complexity. I know that better algorithms exist, but they are even more complicated :)

I made a simple optimization: Before proceeding to the main M*N algorithm, I traverse both sequences from the beginning and mark their items as matching as long as I find identical items on both sides. Then I do the same at the end of both sequences. This way in the main algorithms I can consider only subset of items. For example, for the first array they are items from begIndex1 inclusive to endIndex1 exclusive.

Here is the final algorithm:

void CalcCompare(
  bool *match1, bool *match2,
  const Item *items1, const Item *items2,
  unsigned count1, unsigned count2)
  if (count1 == 0 || count2 == 0)

  for (unsigned i = 0; i < count1; ++i)
    match1[i] = false;
  for (unsigned i = 0; i < count2; ++i)
    match2[i] = false;

  // Mark identical items at the beginning of the sequences
  // (here $begIndex1 == $begIndex2).
  // This is just an optimization, not necessary.
  unsigned begIndex1 = 0, begIndex2 = 0;
  while (begIndex1 < count1 && begIndex2 < count2
    && items1[begIndex1].ID == items2[begIndex2].ID
    && strcmp(items1[begIndex1].Text, items2[begIndex2].Text) == 0)
    match1[begIndex1] = true;
    match2[begIndex2] = true;

  // Mark identical items at the end of the sequences.
  // This is just an optimization, not necessary.
  unsigned endIndex1 = count1, endIndex2 = count2;
  while (endIndex1 > begIndex1 && endIndex2 > begIndex2
    && items1[endIndex1-1].ID == items2[endIndex2-1].ID
    && strcmp(items1[endIndex1-1].Text, items2[endIndex2-1].Text) == 0)
    match1[endIndex1-1] = true;
    match2[endIndex2-1] = true;

  unsigned middleCount1 = endIndex1 - begIndex1;
  unsigned middleCount2 = endIndex2 - begIndex2;
  unsigned middleCountProduct = middleCount1 * middleCount2;
  if (middleCountProduct == 0)

  const unsigned insertWeight = 3;
  unsigned *lastRowWeights = new unsigned[middleCount1];
  unsigned *currRowWeights = new unsigned[middleCount1];
  for (unsigned i = 0, weight = insertWeight
    ; i < middleCount1
    ; ++i, weight += insertWeight)
    lastRowWeights[i] = weight;
  // 0x1 = diagonal, 0x2 = left, 0x4 = top
  unsigned *directions = new unsigned[middleCountProduct];

  for (unsigned y = 0; y < middleCount2; ++y)
    for (unsigned x = 0; x < middleCount1; ++x)
      unsigned weightFromLeft = x == 0
        ? (y + 2) * insertWeight
        : currRowWeights[x-1] + insertWeight;
      unsigned weightFromTop = lastRowWeights[x] + insertWeight;

      unsigned weightFromDiagonal;
      if (items1[x+begIndex1].ID == items2[y+begIndex2].ID)
        if (strlen(items1[x+begIndex1].Text)
          == strlen(items2[y+begIndex2].Text))
          if (strcmp(items1[x+begIndex1].Text, items2[y+begIndex2].Text) == 0)
            weightFromDiagonal = 0;
            weightFromDiagonal = 1;
          weightFromDiagonal = 2;
        weightFromDiagonal += x == 0
          ? (y + 1) * insertWeight
          : lastRowWeights[x-1];
        weightFromDiagonal = 0x7FFFFFFF;

      unsigned minWeight = weightFromLeft;
      if (weightFromTop < minWeight) minWeight = weightFromTop;
      if (weightFromDiagonal < minWeight) minWeight = weightFromDiagonal;

      currRowWeights[x] = minWeight;

      unsigned direction = 0;
      if (minWeight == weightFromDiagonal)
        direction |= 0x01;
      if (minWeight == weightFromLeft)
        direction |= 0x02;
      if (minWeight == weightFromTop)
        direction |= 0x04;

      directions[y * middleCount1 + x] = direction;
    // DEBUG: Print weights.
    /*std::cout << std::setw(3) << currRowWeights[0];
    for (unsigned i = 1; i < middleCount1; ++i)
      std::cout << ' ' << std::setw(3) << currRowWeights[i];
    std::cout << std::endl;*/

    // Swap lastRowWeights and currRowWeights so now current becomes last.
    unsigned *tmp = lastRowWeights;
    lastRowWeights = currRowWeights;
    currRowWeights = tmp;

  // DEBUG: Print directions
  /*for (unsigned y = 0; y < middleCount2; ++y)
    for (unsigned x = 0; x < middleCount1; ++x)
      unsigned direction = directions[y * middleCount1 + x];
      std::cout << (direction & 0x01 ? 'D' : '-');
      std::cout << (direction & 0x02 ? 'L' : '-');
      std::cout << (direction & 0x04 ? 'T' : '-');
      std::cout << ' ';
    std::cout << std::endl;

  // Rewrite directions[y * middleCount1 + x] to match1[i] and match2[i].
  int x = (int)middleCount1 - 1;
  int y = (int)middleCount2 - 1;
  unsigned index1 = endIndex1 - 1;
  unsigned index2 = endIndex2 - 1;
  while (x >= 0 && y >= 0)
    unsigned direction = directions[y * middleCount1 + x];
    if (direction & 0x01)
      match1[index1--] = true;
      match2[index2--] = true;
      --x; --y;
    else if (direction & 0x02)
    else // (direction & 0x04)

  delete [] directions;
  delete [] currRowWeights;
  delete [] lastRowWeights;

Here is the debug print of the weights and directions matrix (L=left, T=top, D=diagonal):

  3   3   6   9
  6   5   8  11
  9   8  11  14
 12  11   9  12
 15  14  12  15

D-- D-- -L- -L-
--T D-- -L- -L-
--T D-T -LT -LT
--T --T D-- -L-
--T --T --T -LT

Comments (2) | Tags: algorithms | Author: Adam Sawicki | Share


2015-04-10 09:26:44
chanel bags,
flat iron,
converse shoes,
ferragamo outlet,
gucci belt,
hollister kids,
insanity workout calendar,
jimmy choo shoes,
air jordan shoes,
juicy couture,
juicy couture outlet,
kate spade,
louis vuitton outlet,
michael kors outlet online,
michael kors,
michael kors purses,
michael kors,
michael kors outlet online sale,
north face jackets,
oakley outlet,
pandora charms,
prada outlet,
prada shoes for men,
ray ban sunglasses outlet,
rolex watches,
swarovski outlet,
the north face outlet,
tiffany jewelry,
tiffany co,
true religion jeans outlet,
true religion outlet,
ugg boots,
uggs outlet,
ugg australia,
ugg boots clearance,
cheap oakley,
oakley vault,
ralph lauren outlet online,
ralph lauren outlet,,
toms shoes,
toms outlet,
christian louboutin shoes,
tory burch handbags,
gucci belt,
longchamp bags,,
hermes birkin bag,
cheap nfl jerseys,
beats headphones,
beats by dr dre,
hair straightener,
ray ban sunglasses,
michael kors bags,
michael kors uk,
louis vuitton outlet,
ralph lauren outlet,
nike running,
omega watches,
dresses for weddings,
pandora sale,
thomas sabo,
christian louboutin,
air huarache,
roshe run,
north face uk,
louis vuitton canada,
jordan shoes,
air jordan,
air jordan retro,
nike outlet,
cheap nike shoes,
nike sneakers,
air max 2015,
nike air max 2014,
nike air max 2015,
nike free,
nike running shoes,
supra shoes,
new balance,
mont blanc,
salvatore ferragamo outlet,
mcm bags,
nike mercurial,
celine bags,
roshe runs,
roshe run,
vans shoes,
timberland outlet,
iphone 4s cases,
softball bats,
babyliss pro,
louis vuitton handbags,
burberry outlet,
christian louboutin shoes,
louis vuitton outlet stores,
prada outlet,
michael kors outlet online sale,
michael kors handbags,
tommy hilfiger polos,
mcm handbags,
moncler outlet,
kate spade sale,
barbour quilted jacket,
salvatore ferragamo shoes,
canada goose outlet,
louis vuitton outlet online,
burberry outlet,
hollister clothes,
north face outlet online,
barbour jackets,
north face jackets outlet,
north face jackets women,
polo ralph lauren outlet online,
ralph lauren outlet,
ralph lauren outlet online,
beats by dre outlet,
louis vuitton outlet online,
gucci outlet,
longchamp outlet,
moncler clearance,
ralph lauren shirts,
hermes handbags,
beats by dre outlet,
coach factory outlet online,,,
michael kors outlet,,
oakley vault,
hermes bags,
marc jacobs bags sale,
nike air max,
woolrich clearance,
coach outlet store,
coach bags outlet,
coach handbags outlet,
gucci sale,
coach outlet online,
michael kors outlet store,
gucci handbags,
the north face outlet,
ugg boots clearance,
tommy hilfiger coupons,
uggs outlet,
louis vuitton outlet,
calvin kleins outlet,
christian louboutin,
jordan retro,
burberry outlet,
abercrombie fitch,
2015-07-12 10:15:02
Packers and movers @
Packers and Movers Chennai@
Packers and Movers Bangalore@
Packers and Movers delhi@
Packers and Movers Gurgaon@
Packers and Movers Hyderabad@
Packers and Movers Noida@
Packers and Movers Pune@
Packers and Movers Mumbai @

Post comment

Nick *
Your name or nickname
Your contact information (optional, will not be shown)
Text *
Content of your comment
Calculate *
(* - required field)
STAT NO AD [Stat] [Admin] [STAT NO AD] [pub] [Mirror] Copyright © 2004-2016 Adam Sawicki
Copyright © 2004-2016 Adam Sawicki