How to select stop words using tf-idf? (non english corpus)

Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.

The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.

As a quick note, as @Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.

edit: As @FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.

I hope this helps.


From "Introduction to Information Retrieval" book:

tf-idf assigns to term t a weight in document d that is

  1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
  3. lowest when the term occurs in virtually all documents.

So words with lowest tf-idf can considered as stop words.