Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?
Here is my suggestion:
- We don't have to fit the model twice. we could reuse the same vectorizer
- text cleaning function can be plugged into
TfidfVectorizer
directly usingpreprocessing
attribute.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)
def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
"""
vectorizer: TfIdfVectorizer model
docs_tfidf: tfidf vectors for all docs
query: query doc
return: cosine similarity between query and all docs
"""
query_tfidf = vectorizer.transform([query])
cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
return cosineSimilarities