Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required

Unfortunately those two are incompatible. A CountVectorizer produces a sparse matrix and the RandomForestClassifier requires a dense matrix. It is possible to convert using X.todense(). Doing this will substantially increase your memory footprint.

Below is sample code to do this based on which allows you to call .todense() in a pipeline stage.

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

Once you have your DenseTransformer, you are able to add it as a pipeline step.

pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
     ('classifier', RandomForestClassifier())

Another option would be to use a classifier meant for sparse data like LinearSVC.

from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])

The most terse solution would be use a FunctionTransformer to convert to dense: this will automatically implement the fit, transform and fit_transform methods as in David's answer. Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline convenience function to enable a more minimalist language for describing the model:

from sklearn.preprocessing import FunctionTransformer

pipeline = make_pipeline(
     FunctionTransformer(lambda x: x.todense(), accept_sparse=True), 

Random forests in 0.16-dev now accept sparse data.