Wordcount in a field (all and unique) - is there a more elegant/optimal way?

First step obviously is to create the table and data (as per the question mentioned):

CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);

The first "lifesaver" was the dollar quoting ($$) - a really neat PostgreSQL feature. I was really floundering before I came across this - couldn't even get the data into the table (trying backslashes, double quotes &c.)

My final SQL looks like this (fiddle here):

WITH cte1 AS
(
  SELECT id,
    UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
    REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
    description
  FROM wordcount
)
SELECT id,
       COUNT(word),
       COUNT(DISTINCT(word)),
       description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;

Result:

id  Word_count  Distinct_count  description
 1           4               4    What a great day
 2           7               6    This is a product. It is useful
 3           2               2    ['a', ' ', ' ', 'b']

Logic explained:

I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.

Step 1:

SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.

Result:

id  regexp_split_to_table
1   What a great day
2   This is a product. It is useful
3   ['a'
3    ' '
3    ' '
3    'b']

Step 2 (remove all non-space, non-alpha)

SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Z\s]', '', 'g')
FROM wordcount;

-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer, 
-- even if not strictly necessary for purists

Result:

id  regexp_replace
1   What a great day
2   This is a product It is useful
3   a
3     
3     
3    b

Step 3 (put the strings into an array):

SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')
FROM wordcount;  
--              id again - not strictly necessary at this step.

Result:

id  string_to_array
1   {What,a,great,day}
2   {This,is,a,product,It,is,useful}
3   {a}
3   {"","",""}
3   {"","",""}
3   {"",b}

Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.

i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!

WITH cte1 AS
(
  SELECT id, 
    UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
    REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
    description
  FROM wordcount
)
SELECT blah... (see above)

As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.

Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...

Regular expression based string processing

Similar to your solution, with some alternative suggestions:

SELECT id
     , COALESCE(cardinality(arr), 0) AS word_count
     , unique_word_count
     , description
FROM  (
   SELECT *
        , string_to_array(trim(regexp_replace(description, '\W+', ' ', 'g')), ' ') AS arr
   FROM   wordcount
   ) a
LEFT   JOIN LATERAL (
   SELECT count(DISTINCT elem) AS unique_word_count
   FROM   unnest(arr) elem
   ) b ON true;

db<>fiddle here (extended test case)

The core is regexp_replace(description, '\W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.

Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.

Get word_count from the array directly. Again: cheap.

The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.

The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.

Or (faster in a quick test with short strings):

SELECT id
     , count(*) AS word_count
     , count(DISTINCT elem) AS unique_word_count
     , description
FROM  (
   SELECT id, description
        , unnest(string_to_array(trim(regexp_replace(description, '\W+', ' ', 'g')), ' ')) AS elem
   FROM   wordcount
   ) sub
GROUP  BY id, description;

This drops rows with 0 words like your answer does.

(Ab-)using text search parser

Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":

SELECT * FROM ts_token_type('default')

For only "ASCII Words": (Unlike above, the underscore (_) is not treated as word character here):

SELECT w.id
     , count(*) AS word_count
     , count(DISTINCT token) AS unique_word_count
     , w.description
FROM   wordcount w, ts_parse('default', w.description) t
WHERE  t.tokid = 1 -- 'asciiword'
GROUP  BY w.id;

To keep _ from separating words, use simple replace() first:

SELECT w.id
     , count(*) AS word_count
     , count(DISTINCT token) AS unique_word_count
     , w.description
FROM   wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE  t.tokid = 1 -- 'asciiword'
GROUP  BY w.id;

Again, to keep all rows:

SELECT w.id
     , count(token) AS word_count
     , count(DISTINCT token) AS unique_word_count
     , w.description
FROM   wordcount w
LEFT   JOIN LATERAL (
   SELECT t.token
   FROM   ts_parse('default', w.description) t
   WHERE  t.tokid = 1 -- 'asciiword'
   ) t ON true
GROUP  BY w.id;

db<>fiddle here

Efficient merging (removing duplicates) of arrays
How do I select arrays that are not empty?
What is the difference between LATERAL and a subquery in PostgreSQL?

Wordcount in a field (all and unique) - is there a more elegant/optimal way?

Logic explained:

Regular expression based string processing

(Ab-)using text search parser

Tags:

Postgresql

Optimization

Query Performance

String Manipulation

Related

Recent Posts