^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?
What do you exactly want to do ? Do you want to verify if both files are exactly same and unaltered ? Or you want to check if strings in one file is present in other one or not ? In first case calculating md5 hash for both files and comparing would be best way .
^^ I want to compare files containing strings. Like an ebook in txt format.. I would like to see an algorithm rather than a tool.. What is the complexity of this process?
u cn consult thomas cormen for this.. there are many algos available like rabin-karp, knuth-morris algo , boyer-moore algo , etc.. & cn be done via fa also. Complexity is polnomial for every also ( not sure on this..) u cn google it....
well, I don't want to compare two files to check if they are same or not. 1 file is a reference file for me, I want to compare the contents of an other file. Say an ebook, to the words in first file and see how similar they are or how many words are there in the ebook which are not there in the first file..Hope I made this a bit clearer now..
What you are looking for is not easy. I would suggest you to look into books of pattern matching and pattern recognition. The popular methods are through Fuzzy sustems. Alternatively you may want to search acm or ieee for similar papers.
If you want to compare in a simpler fashion, you need to first establish the parameters based upon which you will decide if two files are similar or not, for example word count, frequency of words, their occurance in the files etc.
Based on these 'descriptors' you will need to further use some thresholds to classify if the files match or not. For example two files may be similar if the differnece in their word count is not more than 10 and the frequency of some selected words are same. You may require good trees to efficiently solve the problem.
I am not looking at methods to compare if 2 files are similar..KingKrool got me right, I wanna match a set of words in a file(say file1) and check how many times one of the strings in file1 occurs in test file.