A tokenizer that converts the input string to lowercase and then splits it by white spaces code example
Example 1: A tokenizer that converts the input string to lowercase and then splits it by white spaces
df = spark.createDataFrame([("a b c",)], ["text"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenizer.transform(df).head()
tokenizer.setParams(outputCol="tokens").transform(df).head()
tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()
tokenizer.setParams("text")
tokenizerPath = temp_path + "/tokenizer"
tokenizer.save(tokenizerPath)
loadedTokenizer = Tokenizer.load(tokenizerPath)
loadedTokenizer.transform(df).head().tokens == tokenizer.transform(df).head().tokens
Example 2: tokenizer that converts the input string to lowercase and then splits it by white spaces
df = spark.createDataFrame([("a b c",)], ["text"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenizer.transform(df).head()
tokenizer.setParams(outputCol="tokens").transform(df).head()
tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()
tokenizer.transform(df).head()
tokenizer.setParams("text")
tokenizerPath = temp_path + "/tokenizer"
tokenizer.save(tokenizerPath)
loadedTokenizer = Tokenizer.load(tokenizerPath)
loadedTokenizer.transform(df).head().tokens == tokenizer.transform(df).head().tokens