PySpark vs sklearn TFIDF
That's because the IDFs are calculated a little differently between the two.
From sklearn's documentation:
Compare to pyspark's documentation:
Besides the addition of the 1 in the IDF the sklearn TF-IDF uses the l2 norm which pyspark doesn't
TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
Both Python and Pyspark implementation of tfidf scores are the same. Refer the same Sklearn document but on following line,
The key difference between them is that Sklearn uses l2
norm by default, which is not the case with Pyspark. If we set the norm to None, we will get the same result in sklearn as well.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
corpus = ["I heard about Spark","I wish Java could use case classes","Logistic regression models are neat"]
corpus = [sent.lower().split() for sent in corpus]
def dummy_fun(doc):
return doc
tfidfVectorizer=TfidfVectorizer(norm=None,analyzer='word',
tokenizer=dummy_fun,preprocessor=dummy_fun,token_pattern=None)
tf=tfidfVectorizer.fit_transform(corpus)
tf_df=pd.DataFrame(tf.toarray(),columns= tfidfVectorizer.get_feature_names())
tf_df
Refer to my answer here to understand how norm works with tf-idf vectorizer.