Skip to content
UtraDiff
Skip to content

The History of Diff Algorithms

From Levenshtein's edit distance in 1965 to modern AST-based semantic diffing — how the algorithms that power file comparison have evolved over six decades.

What Is File Comparison?

At its core, file comparison (or diffing) answers a simple question: given two versions of a file, what changed? The output is an edit script — a minimal set of instructions (insert this line, delete that line) that transforms one version into the other. Every code review, every git diff, every merge conflict you've seen relies on this operation.

Why is it hard?

The challenge is combinatorial explosion. For two files of length N and M, there are exponentially many ways to match them. A brute-force approach that tries every possible alignment is unusable for files longer than a few lines. The naive dynamic programming solution runs in O(MN) time and space — workable for small files, but prohibitively slow for the thousand-line source files developers compare daily.

Beyond raw performance, there's a deeper problem: which diff is the "right" one? Multiple minimal edit scripts often exist for the same input. One might group a function move as a clean block; another might scatter it across unrelated matches. Algorithms must balance minimality (fewest edits) with readability (changes that make sense to a human) — and these goals sometimes conflict.

The algorithms below represent six decades of progress on this problem, each building on the insights of its predecessors.

Edit Distance (Levenshtein)

1965 · O(MN) time, O(MN) space

In 1965, Soviet mathematician Vladimir Levenshtein formalized a deceptively simple question: given two strings, what is the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one into the other? This number became known as the Levenshtein distance, or edit distance.

The algorithm builds a two-dimensional matrix where each cell represents the cost of transforming a prefix of one string into a prefix of the other. The bottom-right cell contains the final answer. By backtracking through the matrix, you can reconstruct the exact sequence of edits.

    ""  K  I  T  T  E  N
""   0  1  2  3  4  5  6
S    1  1  2  3  4  5  6
I    2  2  1  2  3  4  5
T    3  3  2  1  2  3  4
T    4  4  3  2  1  2  3
I    5  5  4  3  2  2  3
N    6  6  5  4  3  3  2  ← edit distance = 2
G    7  7  6  5  4  4  3

"SITTING" → "KITTEN" requires 3 edits:
  S→K (substitute), e→i (substitute), remove G

Edit distance became the bedrock of computational string comparison. It is used in spell checkers (suggesting corrections for misspelled words), DNA sequence alignment (measuring genetic similarity), and fuzzy search (finding approximate matches in databases).

How UtraDiff uses this

The edit distance concept is the theoretical foundation of every comparison UtraDiff performs. Whether comparing text lines, JSON keys, or CSV rows, the core question is always: what is the minimum set of changes to transform A into B?

Longest Common Subsequence (LCS)

1970s · O(MN) time, O(MN) space

While edit distance counts how many changes are needed, the Longest Common Subsequence (LCS) finds the longest sequence of elements that appear in the same order in both inputs — without requiring them to be contiguous. The relationship is direct: the diff is everything not in the LCS.

The classic LCS algorithm uses dynamic programming, building a table where each cell records the length of the LCS for the corresponding prefixes. A diagonal move means a match (keep the line), a horizontal move means an insertion, and a vertical move means a deletion.

For diff tools, LCS is applied at the line level: each line is treated as a single element. The LCS of two files gives you all the unchanged lines; everything else is either added or removed. This is the fundamental insight behind all line-based diff algorithms.

Pseudocode

function LCS(X[1..m], Y[1..n]):
    // Build the DP table
    for i = 0 to m:
        for j = 0 to n:
            if i == 0 or j == 0:
                L[i][j] = 0
            else if X[i] == Y[j]:
                L[i][j] = L[i-1][j-1] + 1
            else:
                L[i][j] = max(L[i-1][j], L[i][j-1])

    // Backtrack to find the actual subsequence
    // Diagonal move = match (keep line)
    // Horizontal move = insertion (added line)
    // Vertical move = deletion (removed line)
    return backtrack(L, X, Y)

How UtraDiff uses this

UtraDiff's diff engine uses a dynamic programming LCS algorithm as its fallback when Myers' algorithm times out on complex inputs. This ensures UtraDiff always produces a result, even for very different files.

Hunt-McIlroy Algorithm

1976 · O(MN) worst case, fast in practice

In 1976, James W. Hunt and M. Douglas McIlroy at Bell Labs created the first practical diff tool, shipped with Unix Version 7. This was the algorithm that made source code comparison a daily developer activity.

Their key insight was to reduce the problem to LCS, but with a practical optimization: instead of comparing raw strings, hash each line first, then find matching line hashes. This made comparison much faster for typical source code files where most lines are shared between versions.

The algorithm works in two phases: first, build a table of matching line positions between the two files. Second, find the longest increasing subsequence of matches — this gives the LCS. Everything not in the LCS is a difference. The output format they established (< for removed, > for added) became a Unix standard.

While the worst-case complexity is O(MN), Hunt-McIlroy is fast in practice because real source code files share many lines. This algorithm proved that line-based comparison was the right abstraction for software development.

How UtraDiff uses this

Hunt-McIlroy established the line-based diff paradigm that every modern text comparison tool follows. When you compare two text files in UtraDiff, the fundamental approach — hash each line, find matching sequences, report differences — traces directly back to this 1976 algorithm.

Myers' Diff Algorithm

1986 · O(ND) time, O(N) space (linear for similar files)

In 1986, Eugene W. Myers published what became the most influential diff algorithm in computing history. His paper, "An O(ND) Difference Algorithm and Its Variations," introduced an algorithm whose running time depends not on the total file size, but on the number of differences.

The key insight is the edit graph. Imagine a grid where file A runs along the X-axis and file B runs along the Y-axis. Moving right means "delete a line from A," moving down means "insert a line from B," and moving diagonally means "keep a matching line." Finding the shortest diff is equivalent to finding the shortest path from the top-left to the bottom-right corner, where diagonal moves are free.

Myers' algorithm explores this graph in rounds. In round d, it finds all positions reachable with exactly d edits. When two files are similar (small D), the algorithm finishes quickly — often in near-linear time. This is why it dominates: most real-world diffs involve small changes to large files.

Edit Graph

        A   B   C   A   B   B   A      (File A)
      +---+---+---+---+---+---+---+
      |  ╲|   |   |  ╲|   |   |  ╲|
  C   +---+---+---+---+---+---+---+
      |   |   |  ╲|   |   |   |   |
  B   +---+---+---+---+---+---+---+
      |   |  ╲|   |   |  ╲|  ╲|   |
  A   +---+---+---+---+---+---+---+
      |  ╲|   |   |  ╲|   |   |  ╲|
  B   +---+---+---+---+---+---+---+
      |   |  ╲|   |   |  ╲|  ╲|   |
  A   +---+---+---+---+---+---+---+
      |  ╲|   |   |  ╲|   |   |  ╲|
  C   +---+---+---+---+---+---+---+
      |   |   |  ╲|   |   |   |   |
      +---+---+---+---+---+---+---+
                                     (File B)

  → = delete from A     ╲ = match (free!)
  ↓ = insert from B     Goal: shortest path ↘

Pseudocode

function MyersDiff(A[1..N], B[1..M]):
    // The edit graph: move right = delete from A,
    // move down = insert from B,
    // move diagonal = keep (lines match)

    MAX = N + M
    V[1] = 0  // V[k] = furthest x on diagonal k

    for d = 0 to MAX:  // d = number of edits
        for k = -d to d step 2:
            // Choose: move down (insert) or right (delete)
            if k == -d or (k != d and V[k-1] < V[k+1]):
                x = V[k+1]      // move down
            else:
                x = V[k-1] + 1  // move right
            y = x - k

            // Follow diagonal (matching lines) as far as possible
            while x < N and y < M and A[x+1] == B[y+1]:
                x = x + 1
                y = y + 1

            V[k] = x

            // Reached the end? We found the shortest edit script
            if x >= N and y >= M:
                return trace_path(d, V)

    // d = total number of edits (insertions + deletions)
    // When files are similar, D is small → runs in O(ND) ≈ O(N)

Myers' algorithm is used by Git (as its original default), GNU diff, and — critically — UtraDiff's diff engine. It produces provably minimal diffs: no shorter edit script exists for the given input.

How UtraDiff uses this

This is the algorithm that powers UtraDiff's text comparison. The diff engine implements Myers' algorithm as its primary strategy. When you drop two files into UtraDiff and see green and red highlighted lines, Myers' algorithm found that minimal edit script.

Patience Diff

2006 · O(N log N) for unique lines + recursive

In 2006, Bram Cohen (creator of BitTorrent) proposed an alternative approach. His observation: Myers produces minimaldiffs, but minimal doesn't always mean readable. When a function is moved in source code, Myers might match closing braces across unrelated functions, producing a confusing diff.

Patience diff works in three steps. First, find all lines that are unique to both files — these are your anchor points (typically function signatures, class declarations, and section headers). Second, find the Longest Increasing Subsequence(LIS) of these unique lines — this is the "patience sorting" step, named after the card game. Third, use these anchors as fixed match points and recursively diff the gaps between them.

The result is diffs that better respect the logical structure of code. Function signatures stay matched to their correct functions, and changes are grouped more intuitively. Available in Git via git diff --patience.

Pseudocode

function PatienceDiff(A, B):
    // Step 1: Find lines unique to both files
    unique_A = lines appearing exactly once in A
    unique_B = lines appearing exactly once in B
    common_unique = intersection(unique_A, unique_B)

    // Step 2: Find the Longest Increasing Subsequence
    // of positions in B for matching unique lines
    // (This is the "patience sorting" step — like
    // the card game Patience/Solitaire)
    anchors = LIS(common_unique positions in B)

    // Step 3: Use anchors as fixed match points
    // Recursively diff the gaps between anchors
    for each gap between anchors:
        PatienceDiff(gap_in_A, gap_in_B)

    // Result: better semantic grouping of changes
    // because unique lines (function signatures,
    // section headers) act as stable anchors

How UtraDiff uses this

While UtraDiff's text mode uses Myers' algorithm, the patience approach influenced how structured diff tools think about matching. The idea of anchoring on unique, meaningful lines is conceptually similar to how UtraDiff's JSON diff anchors on unique keys.

Histogram Diff

2011 · O(N) average case for low-occurrence lines

Shawn Pearce, while working on JGit (the Java implementation of Git), created histogram diff as a practical improvement over patience diff. The idea: instead of only matching unique lines, build a histogram of how often each line appears and preferentially match low-occurrence lines first.

This handles a common weakness of patience diff: files with no unique lines (like test fixtures, configuration files, or highly repetitive markup). By using occurrence counts instead of a strict uniqueness requirement, histogram diff produces readable output for a wider range of inputs.

Git adopted histogram diff as its default algorithm around 2012, replacing Myers. Today, when you run git diff, you're seeing histogram diff output. GitHub and GitLab also use it for their web-based diff views.

How UtraDiff uses this

Git uses histogram diff by default. If you compare Git's diff output with UtraDiff's text diff, you may notice small differences in how changes are grouped — UtraDiff uses Myers' algorithm while Git uses histogram. Both produce minimal diffs, but may split changes at different boundaries.

How UtraDiff's Diff Engine Works

UtraDiff's text comparison engine uses a two-tier diff strategy that combines the algorithms described above.

Primary: Myers' Algorithm

The engine first runs Myers' O(ND) algorithm. For typical files (similar content, small changes), this completes in milliseconds and produces the provably minimal edit script.

Fallback: Dynamic Programming LCS

If Myers' algorithm exceeds a configurable timeout (typically for very large files or files with many differences), the engine falls back to the standard O(MN) dynamic programming approach. This is slower but guaranteed to complete within bounded time.

Both algorithms return a DiffAlgorithmResult containing change segments with offset ranges. The engine then performs a second pass within each changed block to compute character-level differences, giving you the precise inline highlights you see in UtraDiff's text comparison view.

The timeout mechanism is what makes this practical for real-time editing. Rather than blocking the UI while computing a massive diff, it returns the best result it can within the time budget — even if that means using the slower but more predictable DP algorithm.

How UtraDiff uses this

This is exactly what powers UtraDiff's text comparison. When you compare two files in text mode, the diff engine runs Myers' algorithm first. If it times out (for very large or very different files), it falls back to the DP-based LCS. The result is always the best diff it can compute within its time budget.

Structured Diff Approaches

All the algorithms above work on sequences — lines of text treated as opaque tokens. But many file formats have structure: JSON has nested objects, XML has a DOM tree, CSV has rows and columns. Line-based diff often produces misleading results for these formats (a reordered JSON key shows as a full block deletion + insertion).

JSON Diff

Libraries like jsondiffpatch parse JSON into its native tree structure, then walk both trees simultaneously. For each key, the algorithm classifies it as added, removed, changed, or unchanged. Nested objects are diffed recursively. Arrays use LCS on their elements (with configurable object identity functions) to detect moves and reorderings.

XML & HTML Diff

XML diffing is a tree edit distance problem — a generalization of string edit distance to ordered, labeled trees. The classic Zhang-Shasha algorithm (1989) computes the minimum number of node insertions, deletions, and relabelings to transform one tree into another. For HTML, practical implementations often simplify by comparing the DOM structure rather than computing full tree edit distance.

CSV & Tabular Data

Tabular data adds another dimension. Instead of comparing lines, the diff operates at the cell level. With a designated key column, rows can be matched by identity rather than position — enabling detection of row insertions, deletions, moves, and per-cell value changes. Without a key column, row matching falls back to positional LCS.

How UtraDiff uses this

UtraDiff's structural mode parses files into their native data structures before diffing. JSON, YAML, TOML, and INI files are compared key-by-key using jsondiffpatch. CSV files use a custom cell-level algorithm with optional key-column matching. This means UtraDiff can compare across formats — e.g., a JSON config vs. its YAML equivalent.

Modern & Future Approaches

AST-Based Semantic Diff

Traditional diff tools see code as text. But if you parse code into an Abstract Syntax Tree (AST), you can diff at the semanticlevel: detecting that a function was renamed, a parameter was added, or a block was moved — not just that "line 42 changed." Tools like tree-sitter make this practical by providing fast, incremental parsers for dozens of programming languages.

AI-Assisted Diff

Large language models introduce a new dimension: explaining diffs rather than just displaying them. An AI-assisted diff can summarize what changed in natural language, flag potential bugs introduced by the change, or suggest that a superficial text difference is actually a semantic no-op (like reordering imports). This complements rather than replaces algorithmic diff.

Visual Pixel Diff

For images and rendered output, the comparison operates at the pixel level. Algorithms like pixelmatch compare two images pixel-by-pixel, computing a perceptual color distance (typically using the YIQ color space) and generating a heatmap of differences. This enables visual regression testing: comparing screenshots of a web page before and after a change.

How UtraDiff uses this

UtraDiff already supports visual pixel diff for images (using pixelmatch) and structural diff for data formats. AST-based semantic diff for programming languages is a natural next step that would bring code-aware diffing to UtraDiff's comparison toolkit.