A regex based tokenizer that extracts tokens code example
Example: A regex based tokenizer that extracts tokens
df = spark.createDataFrame([("A B c",)], ["text"])
reTokenizer = RegexTokenizer(inputCol="text", outputCol="words")
reTokenizer.transform(df).head()
reTokneizer.setParams(outputCol="tokens").transform(df).head()
reTokenizer.transform(df,
{reTokenizer.outputCol: "words"}).head()
reTokenizer.transform(df).head()
reTokenizer.setParams("text")
regexTokenizerPath = temp_path + "/regex-tokenizer"
reTokenizer.save(regexTokenizerPath)
loadedReTokenizer = RegexTokenizer.load(regexTokenizerPath)
loadedReTokenizer.getMinTokenLength() == reTokenizer.getMinTokenLength()
loadedReTokenizer.getGaps() == reTokenizer.getGaps()