How to auto-tag content, algorithms and suggestions needed
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
- http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
- http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
You should use a metric such as tf-idf to get the tags out:
- Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
- Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
- Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
- For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.