DONG Yuxuan @ Jun 29, 2019 Asia/Shanghai
An algorithm to compare two text files by key columns without loading the whole file into the memory.
There’re two text files storing records and a record is a line separated into columns by delimiters. Some columns are key columns which identify a record and the rest of the columns are called the value columns. We want to compare these two files to see which records are inserted, deleted and modified. An inserted record is a record that its key appears only in the second file. A deleted record is a record that its key appears only in the first file. A modified record is a record that its key appears in both files but value columns are different. The following example is two files recording systems with versions, and the system column (column 1) is the key column.
macOS 10.12 Windows 10 Ubuntu 16
macOS 10.14 Windows 10 Manjaro 17
In this example, Manjaro with version 17 is inserted, Ubuntu is deleted, and macOS has a version modification. A diff-like syntax could be used to describe it:
* macOS 10.12 > macOS 10.14 + Manjaro 17 - Ubuntu 16
The problem is simple if files are small. All we need to do is loading files into a hash table in RAM and do the comparision. However, Files can be too large to load into a RAM, thus we need a fast algorithm which reads only a part of a file at one time.
The algorithm first runs external sort on files by key columns. This can be done on most Unix systems by running the following commands.
$ sort -k1,1 before.txt > sorted_before.txt $ sort -k1,1 after.txt > sorted_after.txt
After the external sorting, the algorithm reads two sorted files line by line to run the following iteration.
If the key of the current record of the first file is smaller than of the second file, the algorithm asserts that the current record of the first file is deleted and read the next record of the first file
If the key of the current record of the first file is larger than of the second file, the algorithm asserts that the current record of the second file is inserted and read the next record of the second file
If the keys of the current records of two files are equal, the algorithm reads next lines of both files, but before reading, it compares value columns of current records and asserts the record is modified if value columns are different
If there’re remaining records in the first file after the iteration, the algorithm asserts all these records are deleted. If there’re remaining records in the second file after the iteration, the algorithm asserts all these records are inserted.
The iteration is very similar to the merge algorithm in merge sort and obviously has the time complexity of $\Theta(n)$ where $n$ is the number of lines. External sort can be done in $\Theta(n\log n)$. Thus the time complexity of the algorithm is $\Theta(n\log n)$.
Based on the algorithm, a Unix utility KVDiff is developed. It can be installed by running
pip install kvdiff and the usage can be got by running