scikit-learn TfidfVectorizer meaning?

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.

vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

What is?(e.g.: u'me': 8 )

It tells you that the token 'me' is represented as feature number 8 in the output matrix.

is this a matrix or just a vector?

Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)

Click to copy

{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6,  u'loves': 7, u'me': 8, u'more': 9, u'than': 10, }

And when the document is parsed to get its tf-idf. Document:

He watches basketball and baseball

and its output,

[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]

is equivalent to,

[baseball basketball he jane julie likes linda loves me more than]

Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.

tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

scikit-learn TfidfVectorizer meaning?

Tags:

Nlp

Machine Learning

Document Classification

Feature Extraction

Scikit Learn

Related

Recent Posts