Is there a way to measure string similarity in Google BigQuery

Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).

The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.

To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':

SELECT a, b, distance
FROM
  Levenshtein(
     SELECT
       some_url AS a, other_url AS b
     FROM
       your_table
  )

If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.

Steps:

Download the javascript version of fuzzywuzzy (fuzzball)
Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
Upload it to a google cloud storage bucket
Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)

CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
  library="gs://my-bucket/fuzzball.js");

with data as (select "my_test_string" as a, "my_other_string" as b)

SELECT  a, b, token_set_ratio(a, b) from data

Ready to use shared UDFs - Levenshtein distance:

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

6  2  0

Soundex:

SELECT fhoffa.x.soundex('felipe')
 , fhoffa.x.soundex('googgle')
 , fhoffa.x.soundex('guugle')

F410  G240  G240

Fuzzy choose one:

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  #, ['john', 'johnny', 'jonathan', 'jonas']
)

johnny

How-to:

https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83

Is there a way to measure string similarity in Google BigQuery

Tags:

Javascript

Regex

Google Bigquery

Udf

Related

Recent Posts