scikit-learn TfidfVectorizer meaning?
TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.
vocabulary_
Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.
What is?(e.g.: u'me': 8 )
It tells you that the token 'me' is represented as feature number 8 in the output matrix.
is this a matrix or just a vector?
Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:
[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]
The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf
So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)
{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6, u'loves': 7, u'me': 8, u'more': 9, u'than': 10, }
And when the document is parsed to get its tf-idf. Document:
He watches basketball and baseball
and its output,
[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]
is equivalent to,
[baseball basketball he jane julie likes linda loves me more than]
Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.
tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).