How to add polish dictionary to postgres
Public
14 Aug 09:07

PostgreSQL dictionary consists of 3 files:

  • lang.dict
  • lang.affix
  • lang.stop

Generating dictionary files

  1. Download the most recent polish Ispell dictionary
    ISPELL_FILE=`curl https://sjp.pl/slownik/ort/ | grep -o ">sjp-ispell-pl-.*-src.tar.bz2" | cut -c2-`
    wget https://sjp.pl/slownik/ort/`echo $ISPELL_FILE`
    
  2. Unpack the file
    tar xvjf sjp-ispell-pl-*-src.tar.bz2
    
  3. Change directory
    cd sjp-ispell-pl-[0-9]*
    
  4. Convert encodings to UTF-8 and give proper extensions to files
    iconv -f ISO_8859-2 -t utf-8 polish.aff > polish.affix
    iconv -f ISO_8859-2 -t utf-8 polish.all > polish.dict
    
  5. Download polish stopwords
    curl https://raw.githubusercontent.com/dominem/stopwords/master/polish.stopwords.txt > polish.stop
    
  6. Copy files to valid directory
    cp polish.affix `pg_config --sharedir`/tsearch_data/
    cp polish.dict `pg_config --sharedir`/tsearch_data/
    cp polish.stop `pg_config --sharedir`/tsearch_data/
    

Complete script

ISPELL_FILE=`curl https://sjp.pl/slownik/ort/ | grep -o ">sjp-ispell-pl-.*-src.tar.bz2" | cut -c2-`
wget https://sjp.pl/slownik/ort/`echo $ISPELL_FILE`
tar xvjf sjp-ispell-pl-*-src.tar.bz2
cd sjp-ispell-pl-[0-9]*
iconv -f ISO_8859-2 -t utf-8 polish.aff > polish.affix
iconv -f ISO_8859-2 -t utf-8 polish.all > polish.dict
curl https://raw.githubusercontent.com/dominem/stopwords/master/polish.stopwords.txt > polish.stop
cp polish.affix `pg_config --sharedir`/tsearch_data/
cp polish.dict `pg_config --sharedir`/tsearch_data/
cp polish.stop `pg_config --sharedir`/tsearch_data/

Creating new dictionary and text search configuration

The following SQL does the trick:

BEGIN;
      CREATE TEXT SEARCH CONFIGURATION public.polish ( COPY = pg_catalog.english );

      CREATE TEXT SEARCH DICTIONARY polish_ispell (
        TEMPLATE = ispell,
        DictFile = polish,
        AffFile = polish,
        StopWords = polish
      );

      ALTER TEXT SEARCH CONFIGURATION polish
        ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
        WITH polish_ispell;
COMMIT;

You’ll probably want to execute it in database migration. For down direction the following SQL should be queried:

BEGIN;
      DROP TEXT SEARCH CONFIGURATION IF EXISTS public.polish;
      DROP TEXT SEARCH CONFIGURATION IF EXISTS polish;
      DROP TEXT SEARCH DICTIONARY         IF EXISTS polish_ispell;
COMMIT;

Testing

If everything went well, the following query:

SELECT to_tsvector('polish', 'Czuję się mniej więcej tak, jak ktoś, kto bujał w obłokach i nagle spadł.');

Will output:

                                     to_tsvector                                     
------------------------------------------------------------------------------------
'bujać':9 'czuć':1 'jaka':6 'mniej':3 'nagły':13 'obłok':11 'obłoki':11 'spaść':14

Comments

Joe
wolf
Updated for 2022: ``` ISPELL_FILE=$(curl -L https://sjp.pl/slownik/ort/ | grep -o ">sjp-ispell-pl-.*-src.tar.bz2" | cut -c2-) curl -LO "https://sjp.pl/slownik/ort/${ISPELL_FILE}" tar xvjf sjp-ispell-pl-*-src.tar.bz2 cd sjp-ispell-pl-[0-9]*/ iconv -f ISO_8859-2 -t utf-8 polish.aff > polish.affix iconv -f ISO_8859-2 -t utf-8 polish.all > polish.dict curl https://raw.githubusercontent.com/dominem/stopwords/master/polish.stopwords.txt > polish.stop cp polish.affix `pg_config --sharedir`/tsearch_data/ cp polish.dict `pg_config --sharedir`/tsearch_data/ cp polish.stop `pg_config --sharedir`/tsearch_data/ ``` Thank you @ownadi!