PySpark vs sklearn TFIDF

That's because the IDFs are calculated a little differently between the two.

From sklearn's documentation:

enter image description here

Compare to pyspark's documentation:

enter image description here

Besides the addition of the 1 in the IDF the sklearn TF-IDF uses the l2 norm which pyspark doesn't

TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Both Python and Pyspark implementation of tfidf scores are the same. Refer the same Sklearn document but on following line,

enter image description here

The key difference between them is that Sklearn uses l2 norm by default, which is not the case with Pyspark. If we set the norm to None, we will get the same result in sklearn as well.

Click to copy

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

corpus = ["I heard about Spark","I wish Java could use case classes","Logistic regression models are neat"]
corpus = [sent.lower().split() for sent in corpus]

def dummy_fun(doc):
    return doc

tfidfVectorizer=TfidfVectorizer(norm=None,analyzer='word',
                                tokenizer=dummy_fun,preprocessor=dummy_fun,token_pattern=None)

tf=tfidfVectorizer.fit_transform(corpus)
tf_df=pd.DataFrame(tf.toarray(),columns= tfidfVectorizer.get_feature_names())
tf_df

enter image description here

Refer to my answer here to understand how norm works with tf-idf vectorizer.

PySpark vs sklearn TFIDF

Tags:

Python

Scikit Learn

Apache Spark

Pyspark

Related

Recent Posts