# KVDiff: Compare Two Large Text Files by Key Columns

DONG Yuxuan @ Jun 29, 2019

## Abstract

An algorithm to compare two text files by key columns without loading the whole file into the memory

## Background

There’re two text files storing records and a record is a line separated into columns by delimiters. Some columns are key columns which identify a record and the rest of the columns are called the value columns. We want to compare these two files to see which records are inserted, deleted and modified. An inserted record is a record that its key appears only in the second file. A deleted record is a record that its key appears only in the first file. A modified record is a record that its key appears in both files but value columns are different. The following example is two files recording systems with versions, and the system column (column 1) is the key column.

before.txt

MacOS 10.12
Windows 10
Ubuntu 16


after.txt

MacOS 10.14
Windows 10
Manjaro 17


In this example, Manjaro with version 17 is inserted, Ubuntu is deleted, and MacOS has a version modification. A diff-like syntax could be used to describe it:

* MacOS 10.12
> MacOS 10.14
+ Manjaro 17
- Ubuntu 16


The problem is simple if files are small. All we need to do is loading files into a hash table in RAM and do the comparision. However, Files can be too large to load into a RAM, thus we need a fast algorithm which reads only a part of a file at one time.

## Algorithm

The algorithm first runs external sort on files by key columns. This can be done on most UNIX systems by running the following commands.

$sort -k1,1 before.txt > sorted_before.txt$ sort -k1,1 after.txt > sorted_after.txt


After the external sorting, the algorithm reads two sorted files line by line to run the following iteration.

• If the key of the current record of the first file is smaller than of the second file, the algorithm asserts that the current record of the first file is deleted and read the next record of the first file

• If the key of the current record of the first file is larger than of the second file, the algorithm asserts that the current record of the second file is inserted and read the next record of the second file

• If the keys of the current records of two files are equal, the algorithm reads next lines of both files, but before reading, it compares value columns of current records and asserts the record is modified if value columns are different

If there’re remaining records in the first file after the iteration, the algorithm asserts all these records are deleted. If there’re remaining records in the second file after the iteration, the algorithm asserts all these records are inserted.

## Analysis

The iteration is very similar to the merge algorithm in merge sort and obviously has the time complexity of $\Theta(n)$ where $n$ is the number of lines. External sort can be done in $\Theta(n\log n)$. Thus the time complexity of the algorithm is $\Theta(n\log n)$.

## Implement

Based on the algorithm, a UNIX utility KVDiff is developed. It can be installed by running pip install kvdiff and the usage can be got by running kvdiff ---help.