Algorithm to Detect and Compare Phrases
- Take all your texts, and build a list of the words. Easy way : take all the words. Hard way : take only the relevant one (i.e : in English, "the" is never a pertinent word as it it used too often). Let's say you have V words in your vocabulary.
- For each text, build an adjacency matrix A, which size is V*V. The row A(i) states how close the words in your vocabulary are to the i-th word V(i). For example, if V(i)="skiing", then A(i,j) is how close the word V(j) is to the word "skiing". You'd prefer a small vocabulary!
Technical details : For the vocabulary, you have several possibilities to get a good vocabulary. Unfortunately, I can't remember the names. One of them consists of deleting words that are present often and everywhere. On the contrary, you should keep rare words that are present in few texts. However, there is no use in conserving words present exactly in one text.
For the adjacency matrix, the adjacency is measured is done by counting how far the words you are considering are (couting the number of words separating them). For example, let's use your very text =)
One method of comparing style is to look for similar phrases. If I find in one book "fishing, skiing and hiking" a couple of times and in another book "fishing, hiking and skiing" the similarity in style points to one author. I need to also be able to find "fishing and even skiing or hiking" though. Ideally I would also find "angling, hiking and skiing" but because they are non-English texts (Koine Greek), synonyms are harder to allow for and this aspect is not vital.
These are entirely made up values :
A(method, comparing) += 1.0
A(method, similarity) += 0.5
A(method, Greek) += 0.0
You mainly need a "typical distance". You can say for example that after 20 separation-words, then the words can't be considered adjacent anymore.
After a bit of normalization, just make a L2 distance between the adjacency matrix of two texts to see how close they are. You can do fancier stuff afterwards, but this should yield acceptable results. Now, if you have synonyms, you can update the adjacency in a nice way. For example, if you have in input "beautiful maiden", then
A(beautiful, maiden) += 1.0
A(magnificent, maiden) += 0.9
A(fair, maiden) += 0.8
A(sublime, maiden) += 0.8
...