ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error
That's because you are providing ["a"]
instead of "a"
in ColumnTransformer
. According to the documentation:
A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.
Now, TfidfVectorizer
requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer
(even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer
. And hence the error.
Change that to:
clmn = ColumnTransformer([("tfidf", tfidf, "a")],
remainder="passthrough")
For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:
dataset['a']
vs
dataset[['a']]
Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:
tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"),
("tfidf_2", tfidf_2, "b")
],
remainder="passthrough")
we can create a custom tfidf transformer, which can take a array of columns and then join them before applying .fit()
or .transform()
.
Try this!
from sklearn.base import BaseEstimator,TransformerMixin
class custom_tfidf(BaseEstimator,TransformerMixin):
def __init__(self,tfidf):
self.tfidf = tfidf
def fit(self, X, y=None):
joined_X = X.apply(lambda x: ' '.join(x), axis=1)
self.tfidf.fit(joined_X)
return self
def transform(self, X):
joined_X = X.apply(lambda x: ' '.join(x), axis=1)
return self.tfidf.transform(joined_X)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","word gone with wind"],
"b":[" gone fhgf wild","gone with wind"],
"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", custom_tfidf(tfidf), ['a','b'])],remainder="passthrough")
clmn.fit_transform(dataset)
#
array([[0.36439074, 0.51853403, 0.72878149, 0. , 0. ,
0.25926702, 1. ],
[0. , 0.438501 , 0. , 0.61629785, 0.61629785,
0.2192505 , 2. ]])
P.S. : May be you might want to create a tfidf vectorizer for each column, then create a dictionary with key as column name and value as fitted vectorizer. This dictionary can be used during transform of corresponding columns