spark.ml StringIndexer throws 'Unseen label' on fit()

Okay I think I got this. At least I got this working.

Caching the dataframe(including train/test partes) solves the problem. That's what I found in this JIRA issue: https://issues.apache.org/jira/browse/SPARK-12590.

So it's not a bug, just the fact that randomSample might yield a different result on the same, but differently partitioned dataset. And apparently, some of my munging functions (or Pipeline) involve repartition, therefore, results of the trainset recomputation from its definition might diverge.

What still interests me - it's the reproducibility: it's always 'pl-PL' row that gets mixed in the wrong part of the dataset, i.e. it's not random repartition. It's deterministic, just inconsistent. I wonder how exactly it happens.


Unseen label is a generic message which doesn't correspond to a specific column. Most likely problem is with a following stage:

StringIndexer(inputCol='lang', outputCol='lang_idx')

with pl-PL present in train("lang") and not present in test("lang").

You can correct it using setHandleInvalid with skip:

from pyspark.ml.feature import StringIndexer

train = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"])
test = sc.parallelize([(3, "foo"), (4, "foobar")]).toDF(["k", "v"])

indexer = StringIndexer(inputCol="v", outputCol="vi")
indexer.fit(train).transform(test).show()

## Py4JJavaError: An error occurred while calling o112.showString.
## : org.apache.spark.SparkException: Job aborted due to stage failure: 
##   ...
##   org.apache.spark.SparkException: Unseen label: foobar.

indexer.setHandleInvalid("skip").fit(train).transform(test).show()

## +---+---+---+
## |  k|  v| vi|
## +---+---+---+
## |  3|foo|1.0|
## +---+---+---+

or, in the latest versions, keep:

indexer.setHandleInvalid("keep").fit(train).transform(test).show()

## +---+------+---+
## |  k|     v| vi|
## +---+------+---+
## |  3|   foo|0.0|
## |  4|foobar|2.0|
## +---+------+---+