how to compare 2 files efficiently?

sri_rng · Mar 8, 2008

Hi,
How do I compare the contents of 2 files most efficiently? say the files contain strings.

Kumar · Mar 8, 2008

don't know about files with strings but, isn't MD5 the best method?

linuxtechie · Mar 8, 2008

Mate if you got file containing strings then look no further. WinMerge is the king out there.

~LT

luster · Mar 8, 2008

use beyond compare

Nomad · Mar 8, 2008

try WinHex. WinHex: Computer Forensics & Data Recovery Software, Hex Editor & Disk Editor

sri_rng · Mar 9, 2008

^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?

rapt0r · Mar 9, 2008

What do you exactly want to do ? Do you want to verify if both files are exactly same and unaltered ? Or you want to check if strings in one file is present in other one or not ? In first case calculating md5 hash for both files and comparing would be best way .

KingKrool · Mar 9, 2008

You can use the unix diff command.

You really need to specify the use case a bit better. What do you want out of your comparison utility? What files are you comparing? Why?

iq6886 · Mar 9, 2008

sri_rng said:
^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?

u cn consult thomas cormen for this.. there are many algos available like rabin-karp, knuth-morris algo , boyer-moore algo , etc.. & cn be done via fa also. Complexity is polnomial for every also ( not sure on this..) u cn google it....

sri_rng · Mar 9, 2008

well, I don't want to compare two files to check if they are same or not. 1 file is a reference file for me, I want to compare the contents of an other file. Say an ebook, to the words in first file and see how similar they are or how many words are there in the ebook which are not there in the first file..Hope I made this a bit clearer now..

KingKrool · Mar 9, 2008

There are sequence matching algorithms. You can try looking in an algorithms textbook in the dynamic programming chapter.

hammerhead · Mar 10, 2008

What you are looking for is not easy. I would suggest you to look into books of pattern matching and pattern recognition. The popular methods are through Fuzzy sustems. Alternatively you may want to search acm or ieee for similar papers.

If you want to compare in a simpler fashion, you need to first establish the parameters based upon which you will decide if two files are similar or not, for example word count, frequency of words, their occurance in the files etc.

Based on these 'descriptors' you will need to further use some thresholds to classify if the files match or not. For example two files may be similar if the differnece in their word count is not more than 10 and the frequency of some selected words are same. You may require good trees to efficiently solve the problem.

Also look into Edit distance Edit distance - Wikipedia, the free encyclopedia
Levenshtein distance - Wikipedia, the free encyclopedia

two files will be similar if the edit distance to convert one to another is less than a certain threshold. That is the simplest also I could think of

btw out of curiosity, why do you need to make such a program?

sri_rng · Mar 10, 2008

I am not looking at methods to compare if 2 files are similar..KingKrool got me right, I wanna match a set of words in a file(say file1) and check how many times one of the strings in file1 occurs in test file.

hammerhead · Mar 10, 2008

Sorry I did catch you the first time. In that case do as iq6886 and kingkrool say. They are a few string search algorithms like Rabin-Karp string search algorithm - Wikipedia, the free encyclopedia