How do I use an Ispell dictionary with Postgres text search?
This example uses the Canadian English dictionary, but you can try it with others as well.
These are the steps needed for Windows:
- Open http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_CA.dic
- Select all of the text, copy it, and paste it to Text Mechanic: http://textmechanic.co/Sort-Text-Lines.html. Add a line break at the end.
- Open http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_CA.dic_delta
- Select all of the text, copy it, and paste it below the previously pasted text in Text Mechanic.
- Scroll to the top, and select and cut the first line (should be a five digit number), and get rid of the line break.
- Click the Alphabetical button, and wait for the text to sort.
- Select all of the text and copy it to the clipboard
- Open Windows Notepad as an administrator
- Paste the text from Step 7 into Notepad
- Save the file as en_ca.dict (with UTF-8 encoding) to your Postgres text search folder. Mine is C:\Program Files\PostgreSQL\9.3\share\tsearch_data .
- Open http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_CA.aff , select all, copy, and paste to Notepad. Save the file as en_ca.affix to your Postgres text search folder.
In PgAdmin, run the following SQL:
create text search dictionary ispell_en_ca (
template = ispell,
dictfile = en_ca,
afffile = en_ca,
stopwords = english
);
--make sure it works:
select * from ts_lexize('ispell_en_ca', 'colours');
/*
result:
ts_lexize
text[]
{coloured,colour}
*/
You will need to create a new text search configuration to use the dictionary.
I've written the following script to install an en_us dictionary on Ubuntu 14.04 running PostgreSQL 9.4. It should be fairly easy to modify for most situations.
#!/bin/bash
cd /usr/share/postgresql/9.4/tsearch_data
wget http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_US.dic
wget http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_US.dic_delta
wget http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_US.aff -O en_us.affix
# Remove first line
sed -i 1d en_US.dic
# Concat the dic and dic_delta, sort alphabetically and remove the leading blank line (leaves the ending newline intact)
cat en_US.dic en_US.dic_delta | sort > en_us.dict
sed -i 1d en_us.dict
# Set permissions
chown -R postgres:postgres *
sudo -u postgres psql -c "CREATE TEXT SEARCH DICTIONARY ispell_en_us (template = ispell, dictfile = en_us, afffile = en_us, stopwords = english);"
# Clean up source files
rm en_US*
Download and install the necessary files:
english.sh
#!/bin/sh
# https://www.cs.hmc.edu/~geoff/ispell.html
wget http://www.cs.hmc.edu/~geoff/tars/ispell-3.4.00.tar.gz
tar xvzf ispell-3.4.00.tar.gz
cat ispell-3.4.00/languages/english/english.{0,1,2,3} | sort > english.dic
cp ispell-3.4.00/languages/english/english.aff ./
#iconv -f ISO_8859-1 -t UTF-8 -o english.affix english.aff
#iconv -f ISO_8859-1 -t UTF-8 -o english.dict english.dic
iconv -f ISO_8859-1 -t UTF-8 english.aff > english.affix
iconv -f ISO_8859-1 -t UTF-8 english.dic > english.dict
sudo cp english.{affix,dict} `pg_config --sharedir`/tsearch_data
Then create the dictionary and configuation:
english.sql
-- first change the database name in the alter database command below
-- create dictionary using files installed by english.sh script
create text search dictionary english_ispell (
template = ispell,
dictfile = english, -- english.dict
afffile = english, -- english.affix
stopwords = english -- english.stop (should exist in default installation)
);
-- create new configuation using the default as a template
create text search configuration public.english ( copy = pg_catalog.english );
-- show current configuration
\dF+ public.english
-- alter appropriate mappings to use new dictionary
alter text search configuration public.english
alter mapping for asciiword, asciihword, hword_asciipart, word, hword, hword_part
with english_ispell, english_stem;
-- show changed configuration
\dF+ public.english
-- test that the new configuation works
select * from ts_debug(
'public.english',
'PostgreSQL, the highly scalable, SQL compliant, open source object-relational database management system, is now undergoing beta testing of the next version of our software.'
);
-- make database use new config
alter database <name> set default_text_search_config to 'public.english';
-- make session use new config
set default_text_search_config to 'public.english';
-- test that the new config is used by default
select * from ts_debug(
'PostgreSQL, the highly scalable, SQL compliant, open source object-relational database management system, is now undergoing beta testing of the next version of our software.'
);