how to compare 2 files efficiently?

sri_rng

Adept
Sep 3, 2007
512
22
32
36
^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?
 

rapt0r

Adept
Jun 15, 2006
699
11
0
What do you exactly want to do ? Do you want to verify if both files are exactly same and unaltered ? Or you want to check if strings in one file is present in other one or not ? In first case calculating md5 hash for both files and comparing would be best way .
 

KingKrool

Skilled
Mar 16, 2005
3,556
71
136
You can use the unix diff command.

You really need to specify the use case a bit better. What do you want out of your comparison utility? What files are you comparing? Why?
 

iq6886

Adept
Nov 25, 2007
243
7
32
37
Hyderabad
sri_rng said:
^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?

u cn consult thomas cormen for this.. there are many algos available like rabin-karp, knuth-morris algo , boyer-moore algo , etc.. & cn be done via fa also. Complexity is polnomial for every also ( not sure on this..) u cn google it....
 

sri_rng

Adept
Sep 3, 2007
512
22
32
36
well, I don't want to compare two files to check if they are same or not. 1 file is a reference file for me, I want to compare the contents of an other file. Say an ebook, to the words in first file and see how similar they are or how many words are there in the ebook which are not there in the first file..Hope I made this a bit clearer now..
 

KingKrool

Skilled
Mar 16, 2005
3,556
71
136
There are sequence matching algorithms. You can try looking in an algorithms textbook in the dynamic programming chapter.
 

hammerhead

Adept
Jun 26, 2006
683
26
0
37
www.teamreflex.org
What you are looking for is not easy. I would suggest you to look into books of pattern matching and pattern recognition. The popular methods are through Fuzzy sustems. Alternatively you may want to search acm or ieee for similar papers.

If you want to compare in a simpler fashion, you need to first establish the parameters based upon which you will decide if two files are similar or not, for example word count, frequency of words, their occurance in the files etc.

Based on these 'descriptors' you will need to further use some thresholds to classify if the files match or not. For example two files may be similar if the differnece in their word count is not more than 10 and the frequency of some selected words are same. You may require good trees to efficiently solve the problem.

Also look into Edit distance Edit distance - Wikipedia, the free encyclopedia
Levenshtein distance - Wikipedia, the free encyclopedia

two files will be similar if the edit distance to convert one to another is less than a certain threshold. That is the simplest also I could think of

btw out of curiosity, why do you need to make such a program?
 

sri_rng

Adept
Sep 3, 2007
512
22
32
36
I am not looking at methods to compare if 2 files are similar..KingKrool got me right, I wanna match a set of words in a file(say file1) and check how many times one of the strings in file1 occurs in test file.