xindex - sorting local characters (ÆØÅæøå)

I've just created a new package that adds support for the Unicode collation algorithm for LuaTeX - Lua-UCA. I've already added support for few languages, like Czech, German or Norwegian. We can use it instead of Xindex built in sorting mechanism.

Try the following version of xindex-norsk.lua:

Click to copy

-----------------------------------------------------------------------
--         FILE:  xindex-norsk.lua
--  DESCRIPTION:  configuration file for xindex.lua
-- REQUIREMENTS:  
--       AUTHOR:  Herbert Voß
--     MODIFIED:  Sveinung Heggen (2020-01-02)
--      LICENSE:  LPPL1.3
-----------------------------------------------------------------------

if not modules then modules = { } end modules ['xindex-cfg'] = {
      version = 0.20,
      comment = "configuration to xindex.lua",
       author = "Herbert Voss",
    copyright = "Herbert Voss",
      license = "LPPL 1.3"
}

local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"
local collator_obj = collator.new(ducet)

local language = "en" -- default language
-- language specified on the command line doesn't seem to be available
-- in the config file, so we just try to find it ourselves
for i, a in ipairs(arg) do
  if a == "-l" or a=="--language" then
    language = arg[i+1]
    break
  end
end

if languages[language] then
  print("[Lua-UCA] Loading language: " .. language)
  collator_obj = languages[language](collator_obj)
end

local upper = unicode.utf8.upper


escape_chars = { -- by default " is the escape char
  {'""', "\\escapedquote",      '\"{}' },
  {'"@', "\\escapedat",         "@"    },
  {'"|', "\\escapedvert",       "|"    },
  {'"!', "\\escapedexcl",       "!"    },
  {'"(', "\\escapedparenleft",  "("   },
  {'")', "\\escapedparenright", ")"  }
}

itemPageDelimiter = ","     -- Hello, 14
compressPages     = true    -- something like 12--15, instead of 12,13,14,15. the |( ... |) syntax is still valid
fCompress         = true    -- 3f -> page 3, 4 and 3ff -> page 3, 4, 5
minCompress       = 3       -- 14--17 or 
numericPage       = true    -- for non-numerical page numbers, like "VI-17"
sublabels         = {"", "---\\,", "--\\,", "-\\,"} -- for the (sub(sub(sub-items  first one is for item
pageNoPrefixDel   = ""     -- a delimiter for page numbers like "VI-17"
indexOpening      = ""     -- commands after \begin{theindex}
rangeSymbol       = "--"
idxnewletter      = "\\textbf"  -- Only valid if -n is not set

folium = { 
  de = {"f.", "ff."},
  en = {"f.", "ff."},
  fr = {"\\,sq","\\,sqq"},
  no = {"\\,f.","\\,ff."},
} 

function UTFCompare(a,b)
  local A = a["SortKey"]
  local B = b["SortKey"]
  return collator_obj:compare_strings(A,B)
end

function SORTendhook(list)
  -- get the headers for letter groups
  for k,v in ipairs(list) do 
    -- the collator:get_lowest_char will return character on the given
    -- position. It will be lowercase and without accents.
    local codepoints = collator_obj:string_to_codepoints(v.Entry)
    local codes = collator_obj:get_lowest_char(codepoints, 1)
    local sort_char = utf8.char(table.unpack(codes))
    v.sortChar = upper(sort_char) -- use unicode.utf8.upper to make the char uppercase
  end
  return list
end

--[[
    Each character's position in this array-like table determines its 'priority'.
    Several characters in the same slot have the same 'priority'.
]]

alphabet_lower = { --   for sorting
    { ' ' },  -- only for internal tests
    { 'a', 'á', 'à', },
    { 'b' },
    { 'c', 'ç' },
    { 'd' },
    { 'e', 'é', 'è', 'ë', 'ê' },
    { 'f' },
    { 'g' },
    { 'h' },
    { 'i', 'í', 'ì', 'î', 'ï' },
    { 'j' },
    { 'k' },
    { 'l' },
    { 'm' },
    { 'n', 'ñ' },
    { 'o', 'ó', 'ò', 'ô' },
    { 'p' },
    { 'q' },
    { 'r' },
    { 's', 'š', 'ß' },
    { 't' },
    { 'u', 'ú', 'ù', 'û' },
    { 'v' },
    { 'w' },
    { 'x' },
    { 'y', 'ý', 'ÿ', 'ü' },
    { 'z', 'ž' },
    { 'æ', 'œ', 'ä' },
    { 'ø', 'ö' },
    { 'å' }
}
alphabet_upper = { -- for sorting
    { ' ' },
    { 'A', 'Á', 'À', 'Â'},
    { 'B' },
    { 'C', 'Ç' },
    { 'D' },
    { 'E', 'È', 'É', 'Ë', 'Ê' },
    { 'F' },
    { 'G' },
    { 'H' },
    { 'I', 'Í', 'Ì', 'Ï', 'Î' },
    { 'J' },
    { 'K' },
    { 'L' },
    { 'M' },
    { 'N', 'Ñ' },
    { 'O', 'Ó', 'Ò', 'Ô' },
    { 'P' },
    { 'Q' },
    { 'R' },
    { 'S', 'Š' },
    { 'T' },
    { 'U', 'Ú', 'Ù', 'Û' },
    { 'V' },
    { 'W' },
    { 'X' },
    { 'Y', 'Ý', 'Ÿ', 'Ü' },
    { 'Z', 'Ž' },
    { 'Æ', 'Œ', 'Ä' },
    { 'Ø', 'Ö' },
    { 'Å' }
}

The relevant code is this:

Click to copy

local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"
local collator_obj = collator.new(ducet)
local language = "en" -- default language
-- language specified on the command line doesn't seem to be available
-- in the config file, so we just try to find it ourselves
for i, a in ipairs(arg) do
  if a == "-l" or a=="--language" then
    language = arg[i+1]
    break
  end
end

if languages[language] then
  print("[Lua-UCA] Loading language: " .. language)
  collator_obj = languages[language](collator_obj)
end

local upper = unicode.utf8.upper

function UTFCompare(a,b)
  local A = a["SortKey"]
  local B = b["SortKey"]
  return collator_obj:compare_strings(A,B)
end

function SORTendhook(list)
  -- get the headers for letter groups
  for k,v in ipairs(list) do 
    -- the collator:get_lowest_char will return character on the given
    -- position. It will be lowercase and without accents.
    local codepoints = collator_obj:string_to_codepoints(v.Entry)
    local codes = collator_obj:get_lowest_char(codepoints, 1)
    local sort_char = utf8.char(table.unpack(codes))
    v.sortChar = upper(sort_char) -- use unicode.utf8.upper to make the char uppercase
  end
  return list
end

It loads the needed libraries, creates the sorting object and applies the Norwegian rules. The UTFSort function is used by Xindex. We redefine it to use our sorting function. I've found that sorting works, but there is one problem - the first letters are not handled correctly, so Xindex produced separate headings for uppercase and lowercase letters. This is handled in the SORTendhook function.

This is the result:

enter image description here

With the current xindex (version 0.23) and

Click to copy

xindex -u -l no -c norsk <file>

you'll get

enter image description here

Inserted by Sveinung 4.6.2020

Sorting order table for Nordic character according to Norwegian rules (including Sami):

Click to copy

A   Á   B   C   Č   D   Ð   E   F   G   H   I   J   K   L   M   N   Ŋ   O   P   Q   R   S   Š   T   Ŧ   U   V   W   X   Y   Z   Ž   Æ   Ä   Ø   Ö   Å   Aa  
1   3   5   7   9   11  13  15  17  19  21  23  25  27  29  31  33  35  37  39  41  43  45  47  49  51  53  55  57  59  61  63  65  67  69  71  73  75  75  
a   á   b   c   č   d   đ   e   f   g   h   i   j   k   l   m   n   ŋ   o   p   q   r   s   š   t   ŧ   u   v   w   x   y   z   ž   æ   ä   ø   ö   å   aa  
2   4   6   8   10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  76  

A   1
a   2
Á   3
á   4
B   5
b   6
C   7
c   8
Č   9
č   10
D   11
d   12
Ð   13
đ   14
E   15
e   16
F   17
f   18
G   19
g   20
H   21
h   22
I   23
i   24
J   25
j   26
K   27
k   28
L   29
l   30
M   31
m   32
N   33
n   34
Ŋ   35
ŋ   36
O   37
o   38
P   39
p   40
Q   41
q   42
R   43
r   44
S   45
s   46
Š   47
š   48
T   49
t   50
Ŧ   51
ŧ   52
U   53
u   54
V   55
v   56
W   57
w   58
X   59
x   60
Y   61
y   62
Z   63
z   64
Ž   65
ž   66
Æ   67
æ   68
Ä   69
ä   70
Ø   71
ø   72
Ö   73
ö   74
Å   75
Aa  75
å   76
aa  76

xindex - sorting local characters (ÆØÅæøå)

Tags:

Sorting

Indexing

Extended Characters

Xindex

Related

Recent Posts