xindex - sorting local characters (ÆØÅæøå)
I've just created a new package that adds support for the Unicode collation algorithm for LuaTeX - Lua-UCA. I've already added support for few languages, like Czech, German or Norwegian. We can use it instead of Xindex
built in sorting mechanism.
Try the following version of xindex-norsk.lua
:
-----------------------------------------------------------------------
-- FILE: xindex-norsk.lua
-- DESCRIPTION: configuration file for xindex.lua
-- REQUIREMENTS:
-- AUTHOR: Herbert Voß
-- MODIFIED: Sveinung Heggen (2020-01-02)
-- LICENSE: LPPL1.3
-----------------------------------------------------------------------
if not modules then modules = { } end modules ['xindex-cfg'] = {
version = 0.20,
comment = "configuration to xindex.lua",
author = "Herbert Voss",
copyright = "Herbert Voss",
license = "LPPL 1.3"
}
local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"
local collator_obj = collator.new(ducet)
local language = "en" -- default language
-- language specified on the command line doesn't seem to be available
-- in the config file, so we just try to find it ourselves
for i, a in ipairs(arg) do
if a == "-l" or a=="--language" then
language = arg[i+1]
break
end
end
if languages[language] then
print("[Lua-UCA] Loading language: " .. language)
collator_obj = languages[language](collator_obj)
end
local upper = unicode.utf8.upper
escape_chars = { -- by default " is the escape char
{'""', "\\escapedquote", '\"{}' },
{'"@', "\\escapedat", "@" },
{'"|', "\\escapedvert", "|" },
{'"!', "\\escapedexcl", "!" },
{'"(', "\\escapedparenleft", "(" },
{'")', "\\escapedparenright", ")" }
}
itemPageDelimiter = "," -- Hello, 14
compressPages = true -- something like 12--15, instead of 12,13,14,15. the |( ... |) syntax is still valid
fCompress = true -- 3f -> page 3, 4 and 3ff -> page 3, 4, 5
minCompress = 3 -- 14--17 or
numericPage = true -- for non-numerical page numbers, like "VI-17"
sublabels = {"", "---\\,", "--\\,", "-\\,"} -- for the (sub(sub(sub-items first one is for item
pageNoPrefixDel = "" -- a delimiter for page numbers like "VI-17"
indexOpening = "" -- commands after \begin{theindex}
rangeSymbol = "--"
idxnewletter = "\\textbf" -- Only valid if -n is not set
folium = {
de = {"f.", "ff."},
en = {"f.", "ff."},
fr = {"\\,sq","\\,sqq"},
no = {"\\,f.","\\,ff."},
}
function UTFCompare(a,b)
local A = a["SortKey"]
local B = b["SortKey"]
return collator_obj:compare_strings(A,B)
end
function SORTendhook(list)
-- get the headers for letter groups
for k,v in ipairs(list) do
-- the collator:get_lowest_char will return character on the given
-- position. It will be lowercase and without accents.
local codepoints = collator_obj:string_to_codepoints(v.Entry)
local codes = collator_obj:get_lowest_char(codepoints, 1)
local sort_char = utf8.char(table.unpack(codes))
v.sortChar = upper(sort_char) -- use unicode.utf8.upper to make the char uppercase
end
return list
end
--[[
Each character's position in this array-like table determines its 'priority'.
Several characters in the same slot have the same 'priority'.
]]
alphabet_lower = { -- for sorting
{ ' ' }, -- only for internal tests
{ 'a', 'á', 'à', },
{ 'b' },
{ 'c', 'ç' },
{ 'd' },
{ 'e', 'é', 'è', 'ë', 'ê' },
{ 'f' },
{ 'g' },
{ 'h' },
{ 'i', 'í', 'ì', 'î', 'ï' },
{ 'j' },
{ 'k' },
{ 'l' },
{ 'm' },
{ 'n', 'ñ' },
{ 'o', 'ó', 'ò', 'ô' },
{ 'p' },
{ 'q' },
{ 'r' },
{ 's', 'š', 'ß' },
{ 't' },
{ 'u', 'ú', 'ù', 'û' },
{ 'v' },
{ 'w' },
{ 'x' },
{ 'y', 'ý', 'ÿ', 'ü' },
{ 'z', 'ž' },
{ 'æ', 'œ', 'ä' },
{ 'ø', 'ö' },
{ 'å' }
}
alphabet_upper = { -- for sorting
{ ' ' },
{ 'A', 'Á', 'À', 'Â'},
{ 'B' },
{ 'C', 'Ç' },
{ 'D' },
{ 'E', 'È', 'É', 'Ë', 'Ê' },
{ 'F' },
{ 'G' },
{ 'H' },
{ 'I', 'Í', 'Ì', 'Ï', 'Î' },
{ 'J' },
{ 'K' },
{ 'L' },
{ 'M' },
{ 'N', 'Ñ' },
{ 'O', 'Ó', 'Ò', 'Ô' },
{ 'P' },
{ 'Q' },
{ 'R' },
{ 'S', 'Š' },
{ 'T' },
{ 'U', 'Ú', 'Ù', 'Û' },
{ 'V' },
{ 'W' },
{ 'X' },
{ 'Y', 'Ý', 'Ÿ', 'Ü' },
{ 'Z', 'Ž' },
{ 'Æ', 'Œ', 'Ä' },
{ 'Ø', 'Ö' },
{ 'Å' }
}
The relevant code is this:
local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"
local collator_obj = collator.new(ducet)
local language = "en" -- default language
-- language specified on the command line doesn't seem to be available
-- in the config file, so we just try to find it ourselves
for i, a in ipairs(arg) do
if a == "-l" or a=="--language" then
language = arg[i+1]
break
end
end
if languages[language] then
print("[Lua-UCA] Loading language: " .. language)
collator_obj = languages[language](collator_obj)
end
local upper = unicode.utf8.upper
function UTFCompare(a,b)
local A = a["SortKey"]
local B = b["SortKey"]
return collator_obj:compare_strings(A,B)
end
function SORTendhook(list)
-- get the headers for letter groups
for k,v in ipairs(list) do
-- the collator:get_lowest_char will return character on the given
-- position. It will be lowercase and without accents.
local codepoints = collator_obj:string_to_codepoints(v.Entry)
local codes = collator_obj:get_lowest_char(codepoints, 1)
local sort_char = utf8.char(table.unpack(codes))
v.sortChar = upper(sort_char) -- use unicode.utf8.upper to make the char uppercase
end
return list
end
It loads the needed libraries, creates the sorting object and applies the Norwegian rules. The UTFSort
function is used by Xindex
. We redefine it to use our sorting function. I've found that sorting works, but there is one problem - the first letters are not handled correctly, so Xindex
produced separate headings for uppercase and lowercase letters. This is handled in the SORTendhook
function.
This is the result:
With the current xindex
(version 0.23) and
xindex -u -l no -c norsk <file>
you'll get
Inserted by Sveinung 4.6.2020
Sorting order table for Nordic character according to Norwegian rules (including Sami):
A Á B C Č D Ð E F G H I J K L M N Ŋ O P Q R S Š T Ŧ U V W X Y Z Ž Æ Ä Ø Ö Å Aa
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 75
a á b c č d đ e f g h i j k l m n ŋ o p q r s š t ŧ u v w x y z ž æ ä ø ö å aa
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 76
A 1
a 2
Á 3
á 4
B 5
b 6
C 7
c 8
Č 9
č 10
D 11
d 12
Ð 13
đ 14
E 15
e 16
F 17
f 18
G 19
g 20
H 21
h 22
I 23
i 24
J 25
j 26
K 27
k 28
L 29
l 30
M 31
m 32
N 33
n 34
Ŋ 35
ŋ 36
O 37
o 38
P 39
p 40
Q 41
q 42
R 43
r 44
S 45
s 46
Š 47
š 48
T 49
t 50
Ŧ 51
ŧ 52
U 53
u 54
V 55
v 56
W 57
w 58
X 59
x 60
Y 61
y 62
Z 63
z 64
Ž 65
ž 66
Æ 67
æ 68
Ä 69
ä 70
Ø 71
ø 72
Ö 73
ö 74
Å 75
Aa 75
å 76
aa 76