What is the difference between HashingTF and CountVectorizer in Spark?

The hashing trick is actually the other name of feature hashing.

I'm citing Wikipedia's definition :

In machine learning, feature hashing, also known as the hashing trick, by analogy to the kernel trick, is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.

You can read more about it in this paper.

So actually actually for space efficient features vectorizing.

Whereas CountVectorizer performs just a vocabulary extraction and it transforms into Vectors.


The answers are great. I just want to emphasize this API difference:

  • CountVectorizer must be fit, which produces a new CountVectorizerModel, which can transform
  • vs HashingTF does not need to be fit, HashingTF instance can transform directly

For example

 CountVectorizer(inputCol="words", outputCol="features")
      .fit(original_df)
      .transform(original_df)

vs:

 HashingTF(inputCol="words", outputCol="features")
      .transform(original_df)

In this API difference CountVectorizer has an extra fit API step. Maybe this is because CountVectorizer does extra work (see accepted answer):

CountVectorizer requires additional scan over the data to build a model and additional memory to store vocabulary (index).

I think you can also skip the fitting step if you are able to create your CountVectorizerModel directly, as shown in example:

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")

cvModel.transform(df).show(false)

Another big difference!

  • HashingTF may create collisions! This means two different features/words are treated as the same term.
  • Accepted answer says this:

    a source of the information loss - in case of HashingTF it is dimensionality reduction with possible collisions

This is particularly a problem with an explicit low numFeatures value (pow(2,4), pow(2,8)); the default value is quite high (pow(2,20)) In this example:

wordsData = spark.createDataFrame([([
    'one', 'two', 'three', 'four', 'five', 
    'six',  'seven', 'eight', 'nine', 'ten'],)], ['tokens'])
hashing = HashingTF(inputCol="tokens", outputCol="hashedValues", numFeatures=pow(2,4))
hashed_df = hashing.transform(wordsData)
hashed_df.show(truncate=False)


+-----------------------------------------------------------+
|hashedValues                                               |
+-----------------------------------------------------------+
|(16,[0,1,2,6,8,11,12,13],[1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0])|
+-----------------------------------------------------------+

The output contains 16 "hash buckets" (because I used numFeatures=pow(2,4))

...16...

While my input had 10 unique tokens, the output only creates 8 unique hashes (due to hash collisions);

....v-------8x-------v....
...[0,1,2,6,8,11,12,13]...

Hash collisions mean 3 distinct tokens are given the same hash, (even though all tokens are unique; and should only occur 1x)

...---------------v
... [1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0] ...

(So leave the default value, or increase your numFeatures to try to avoid collisions :

This [Hashing] approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table.

Some other API differences

  • CountVectorizer constructor (i.e. when initializing) supports extra params:

    • minDF
    • minTF
    • etc...
  • CountVectorizerModel has a vocabulary member, so you can see the vocabulary generated (especially useful if you fit your CountVectorizer):

    • countVectorizerModel.vocabulary
    • >>> [u'one', u'two', ...]
  • CountVectorizer is "reversible" as the main answer says! Use its vocabulary member, which is an array mapping the term index, to the term (sklearn's CountVectorizer does something similar)

A few important differences:

  • partially reversible (CountVectorizer) vs irreversible (HashingTF) - since hashing is not reversible you cannot restore original input from a hash vector. From the other hand count vector with model (index) can be used to restore unordered input. As a consequence models created using hashed input can be much harder to interpret and monitor.
  • memory and computational overhead - HashingTF requires only a single data scan and no additional memory beyond original input and vector. CountVectorizer requires additional scan over the data to build a model and additional memory to store vocabulary (index). In case of unigram language model it is usually not a problem but in case of higher n-grams it can be prohibitively expensive or not feasible.
  • hashing depends on a size of the vector , hashing function and a document. Counting depends on a size of the vector, training corpus and a document.
  • a source of the information loss - in case of HashingTF it is dimensionality reduction with possible collisions. CountVectorizer discards infrequent tokens. How it affects downstream models depends on a particular use case and data.

As per the Spark 2.1.0 documentation,

Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.

HashingTF

HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing.

To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the columns. The default feature dimension is 2^18=262,144. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

CountVectorizer

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

Sample code

from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import CountVectorizer

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression models are neat")],
 ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)

hashingTF = HashingTF(inputCol="words", outputCol="Features", numFeatures=100)
hashingTF_model = hashingTF.transform(wordsData)
print "Out of hashingTF function"
hashingTF_model.select('words',col('Features').alias('Features(vocab_size,[index],[tf])')).show(truncate=False)
    

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="Features", vocabSize=20)

cv_model = cv.fit(wordsData)

cv_result = model.transform(wordsData)
print "Out of CountVectorizer function"
cv_result.select('words',col('Features').alias('Features(vocab_size,[index],[tf])')).show(truncate=False)
print "Vocabulary from CountVectorizerModel is \n" + str(cv_model.vocabulary)

Output is as below

enter image description here

Hashing TF misses the vocabulary which is essential for techniques like LDA. For this one has to use CountVectorizer function. Irrespective of the vocab size, CountVectorizer function estimates the term frequency without any approximation involved unlike in HashingTF.

Reference:

https://spark.apache.org/docs/latest/ml-features.html#tf-idf

https://spark.apache.org/docs/latest/ml-features.html#countvectorizer