How areTF-IDF calculated by the scikit-learn TfidfVectorizer
TF-IDF is done in multiple steps by Scikit Learn's TfidfVectorizer, which in fact uses TfidfTransformer and inherits CountVectorizer.
Let me summarize the steps it does to make it more straightforward:
- tfs are calculated by CountVectorizer's fit_transform()
- idfs are calculated by TfidfTransformer's fit()
- tfidfs are calculated by TfidfTransformer's transform()
You can check the source code here.
Back to your example. Here is the calculation that is done for the tfidf weight for the 5th term of the vocabulary, 1st document (X_mat[0,4]):
First, the tf for 'string', in the 1st document:
tf = 1
Second, the idf for 'string', with smoothing enabled (default behavior):
df = 2
N = 4
idf = ln(N + 1 / df + 1) + 1 = ln (5 / 3) + 1 = 1.5108256238
And finally, the tfidf weight for (document 0, feature 4):
tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238
I noticed you choose not to normalize the tfidf matrix. Keep in mind normalizing the tfidf matrix is a common and usually recommended approach, since most models will require the feature matrix (or design matrix) to be normalized.
TfidfVectorizer will L-2 normalize the output matrix by default, as a final step of the calculation. Having it normalized means it will have only weights between 0 and 1.
The precise computation formula is given in the docs:
The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead of tf * idf
and
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.
That means 1.51082562
is obtained as 1.51082562=1+ln((4+1)/(2+1))
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
print(corpus)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
z=X.toarray()
#term frequency is printed
print(z)
vectorizer1 = TfidfVectorizer(min_df=1)
X1 = vectorizer1.fit_transform(corpus)
idf = vectorizer1.idf_
print (dict(zip(vectorizer1.get_feature_names(), idf)))
#printing idf
print(X1.toarray())
#printing tfidf
#formula
# df = 2
# N = 4
# idf = ln(N + 1 / df + 1) + 1 = log (5 / 3) + 1 = 1.5108256238
#formula
# tfidf(0,4) = tf * idf = 1 * 1.5108256238 = 1.5108256238