how to compare 2 files efficiently?

^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?
 
What do you exactly want to do ? Do you want to verify if both files are exactly same and unaltered ? Or you want to check if strings in one file is present in other one or not ? In first case calculating md5 hash for both files and comparing would be best way .
 
You can use the unix diff command.

You really need to specify the use case a bit better. What do you want out of your comparison utility? What files are you comparing? Why?
 
sri_rng said:
^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?

u cn consult thomas cormen for this.. there are many algos available like rabin-karp, knuth-morris algo , boyer-moore algo , etc.. & cn be done via fa also. Complexity is polnomial for every also ( not sure on this..) u cn google it....
 
well, I don't want to compare two files to check if they are same or not. 1 file is a reference file for me, I want to compare the contents of an other file. Say an ebook, to the words in first file and see how similar they are or how many words are there in the ebook which are not there in the first file..Hope I made this a bit clearer now..
 
What you are looking for is not easy. I would suggest you to look into books of pattern matching and pattern recognition. The popular methods are through Fuzzy sustems. Alternatively you may want to search acm or ieee for similar papers.

If you want to compare in a simpler fashion, you need to first establish the parameters based upon which you will decide if two files are similar or not, for example word count, frequency of words, their occurance in the files etc.

Based on these 'descriptors' you will need to further use some thresholds to classify if the files match or not. For example two files may be similar if the differnece in their word count is not more than 10 and the frequency of some selected words are same. You may require good trees to efficiently solve the problem.

Also look into Edit distance Edit distance - Wikipedia, the free encyclopedia
Levenshtein distance - Wikipedia, the free encyclopedia

two files will be similar if the edit distance to convert one to another is less than a certain threshold. That is the simplest also I could think of

btw out of curiosity, why do you need to make such a program?
 
I am not looking at methods to compare if 2 files are similar..KingKrool got me right, I wanna match a set of words in a file(say file1) and check how many times one of the strings in file1 occurs in test file.
 
Back
Top