Wordcount in a field (all and unique) - is there a more elegant/optimal way?
First step obviously is to create the table and data (as per the question mentioned):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
The first "lifesaver" was the dollar quoting ($$
) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER()
function would solve that - it's not core to the question.
Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']
Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Z\s]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for purists
Result:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 b
Step 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.
Result:
id string_to_array
1 {What,a,great,day}
2 {This,is,a,product,It,is,useful}
3 {a}
3 {"","",""}
3 {"","",""}
3 {"",b}
Finally, the answer itself - UNNEST
and then select those words LENGTH > 0
grouping by id and description.
i.e. SELECT
the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST...
throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^\w\s]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)
As to your solution: smart and with a solid explanation. But what about these cases: ''
, NULL
, '"§$%'
, '-'
? No words. The count should be 0
- but your solution drops these rows altogether.
Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...
Regular expression based string processing
Similar to your solution, with some alternative suggestions:
SELECT id
, COALESCE(cardinality(arr), 0) AS word_count
, unique_word_count
, description
FROM (
SELECT *
, string_to_array(trim(regexp_replace(description, '\W+', ' ', 'g')), ' ') AS arr
FROM wordcount
) a
LEFT JOIN LATERAL (
SELECT count(DISTINCT elem) AS unique_word_count
FROM unnest(arr) elem
) b ON true;
db<>fiddle here (extended test case)
The core is regexp_replace(description, '\W+', ' ', 'g')
to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.
Followed up by cheap trim()
to remove leading/trailing spaces, and string_to_array()
to convert the prepared string to an array.
Get word_count
from the array directly. Again: cheap.
The unique_word_count
from a LATERAL
subquery with count(DISTINCT ...)
. That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.
The COALESCE
in the outer SELECT
takes care of NULL
input (the original question did not mention a NOT NULL
constraint). Optional, in case you need 0
instead of NULL
.
Or (faster in a quick test with short strings):
SELECT id
, count(*) AS word_count
, count(DISTINCT elem) AS unique_word_count
, description
FROM (
SELECT id, description
, unnest(string_to_array(trim(regexp_replace(description, '\W+', ' ', 'g')), ' ')) AS elem
FROM wordcount
) sub
GROUP BY id, description;
This drops rows with 0 words like your answer does.
(Ab-)using text search parser
Using the text search function ts_parse()
is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":
SELECT * FROM ts_token_type('default')
For only "ASCII Words":
(Unlike above, the underscore (_
) is not treated as word character here):
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
To keep _
from separating words, use simple replace()
first:
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
Again, to keep all rows:
SELECT w.id
, count(token) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w
LEFT JOIN LATERAL (
SELECT t.token
FROM ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
) t ON true
GROUP BY w.id;
db<>fiddle here
Related:
- Efficient merging (removing duplicates) of arrays
- How do I select arrays that are not empty?
- What is the difference between LATERAL and a subquery in PostgreSQL?