How to generate an automatic index (concordance) in a large file?
The output file should look like this: http://www.gurt-der-wahrheit.org/files/konkordanz_schlachter_1951_A4.pdf
Sure, it is possible. How about this?
The complete concordance, of which the above is page 2, was generated by the following file (compile with lualatex
rather than pdflatex
):
\documentclass[a4paper]{article}
\usepackage{luatex85} % a4paper doesn't seem to take effect otherwise
\usepackage[margin=1cm, top=0.5cm, footskip=0.5cm]{geometry}
\usepackage{fontspec}
\setmainfont{Arial}
\begin{document}
\parindent=0pt \twocolumn \scriptsize
\pretolerance=-1 \sloppy % Skip the first pass, and avoid overfull boxes
\spaceskip=\fontdimen2\font plus 2\fontdimen3\font minus \fontdimen4\font % Fewer underfull warnings, by allowing more stretch
\directlua{dofile('concordance.lua')}
\directlua{words, locations = concordance('bibschl.txt', 'latin1')}
\directlua{printConcordance(words, locations, {minLength=2, otherExclusions={'Aaron','RECHT','zorn'}})}
\end{document}
where concordance.lua
is what generates the concordance (for each word, find all the places where it occurs) and typesets individual entries (bold keys, semicolons separating the locations, etc.):
function concordance(filename, encoding)
-- Given a file that has the following structure:
-- <blank line>
-- BookName<space>ChapterNumber
-- VerseNumber<space>Verse
-- VerseNumber<space>Verse
-- ...
-- <blank line>
-- BookName<space>ChapterNumber
-- ...
-- (Each verse itself is a sequence of space-separated words, ignoring case and trailing punctuation.)
-- Returns two tables: (1) the words, in sorted order, and (2) mapping words to locations (book, chapter, verse)
local readBookNext = true -- Whether the *next* line contains Book & Chapter
local currentBook = ''
local currentChapter = 0
local concordanceTable = {}
for line in io.lines(filename) do
line = makeUTF8(line, encoding) -- Just in case encoding='latin1'
if line == '' then readBookNext = true
elseif readBookNext then
currentBook, currentChapter = string.match(line, '^(.*) ([0-9]*)$')
readBookNext = false
else
verseNumber, verse = string.match(line, '^([0-9]*) (.*)$')
for word in string.gmatch(verse, '%S+') do
addWordToConcordance(word, currentBook, currentChapter, verseNumber, concordanceTable)
end
end
end
local keys = {}
for word, _ in pairs(concordanceTable) do table.insert(keys, word) end
table.sort(keys)
return keys, concordanceTable
end
local badFirsts = {['¶'] = true, ['«'] = true, ['-'] = true, ['<']=true, ['(']=true, [',']=true}
local badLasts = {['.']=true, [',']=true, [':']=true, ['!']=true, [';']=true, ['?']=true, ['»']=true, [')']=true, ["'"]=true, ['>']=true, ['`']=true}
function addWordToConcordance(origWord, book, chapter, verse, concordanceTable)
-- In `concordanceTable`, adds (book, chapter, verse) to the entry for word
local word = unicode.utf8.upper(origWord)
while badFirsts[unicode.utf8.sub(word, 1, 1)] do word = unicode.utf8.sub(word, 2) end -- Strip leading punctuation
while badLasts[unicode.utf8.sub(word, -1)] do word = unicode.utf8.sub(word, 1, -2) end -- Strip trailing punctuation
if string.match(word, '^[0-9-]*B?$') then return end -- Ignore empty words and words like "42-3" or "29-39B"
local list = concordanceTable[word] or {}
table.insert(list, {book=book, chapter=chapter, verse=verse})
concordanceTable[word] = list
end
function makeUTF8(line, encoding)
-- Converts text `line` from latin1 (ISO-8859-1) to UTF-8, if necessary.
if encoding == 'utf8' or encoding == nil then
return line
elseif encoding == 'latin1' then
local utf8Line = ''
for c in string.gmatch(line, '.') do utf8Line = utf8Line .. unicode.utf8.char(string.byte(c)) end
return utf8Line
else
error(string.format('Unknown encoding "%s"', encoding))
end
end
--------------------------------------------------------------------------------
-- Above are functions that generate the concordance; below are functions for injecting that into TeX
--------------------------------------------------------------------------------
function printConcordance(words, locations, options)
options = options or {}
local includeThreshold = options['includeThreshold'] or 300 -- Words that occur too often are dropped
local breakThreshold = options['breakThreshold'] or 1000 -- A "paragraph" break is added after enough entries
local minLength = options['minLength'] or 1 -- Words shorter than this length are dropped
local otherExclusions = {} -- Words in this table are dropped
for _, ex in ipairs(options['otherExclusions'] or {}) do
otherExclusions[unicode.utf8.upper(ex)] = true
end
local numPrinted = breakThreshold + 100 -- more than breakThreshold: we want a “break” before the first word
for _, word in ipairs(words) do
tex.print([[\hskip 1.5\fontdimen2\font plus 5\fontdimen3\font minus \fontdimen4\font]])
local n = #locations[word]
if n > includeThreshold then
print(string.format('Dropping word %s (occurs %d times)', word, n))
elseif unicode.utf8.len(word) < minLength then
print(string.format('Dropping word %s (its length %d is less than %d)', word, unicode.utf8.len(word), minLength))
elseif otherExclusions[word] ~= nil then
print(string.format('Dropping word %s (it was specified as an exclusion)', word))
else
if numPrinted > breakThreshold then
tex.print(string.format([[\par\underline{\textbf{%s}}\par]], word))
numPrinted = 0
else
tex.print(string.format([[\textbf{%s}]], word))
end
numPrinted = numPrinted + n
for i, v in ipairs(locations[word]) do
if i > 1 then tex.sprint('; ') end
tex.sprint(string.format('%s%s:%s', abbrev(v.book), v.chapter, v.verse))
end
end
end
end
-- Abbreviations that aren't just first 3 letters
local knownBooks = {Richter='Ri', Ruth='Rt', Hiob='Hi', Psalmen='Ps', Hohelied='Hld', Klagelieder='Klg',
Amos='Am', Zephania='Zef', Matthäus='Mt', Lukas='Lk', Apostelgeschichte='Apg', Philemon='Phm'}
function abbrev(book)
if knownBooks[book] ~= nil then return knownBooks[book] end
-- First 3 letters, but if 2nd letter is a space then ignore it.
if string.sub(book, 2, 2) == ' ' then
knownBooks[book] = unicode.utf8.sub(book, 1, 1) .. unicode.utf8.sub(book, 3, 4)
else
knownBooks[book] = unicode.utf8.sub(book, 1, 3)
end
return knownBooks[book]
end
To change what TeX sees, you can change the printConcordance
function. It has an options
parameter through which you can control various things:
includeThreshold
for dropping words that occur too frequently. (For example, “UND” occurs over 42000 times and you surely don't want to index it; what about “DAVID” which occurs 862 times?) The default threshold is set at 300 occurrences.minLength
for “restrictions from minimal word length” as mentioned in the questionotherExclusions
for “restrictions to specific words to exclude” as mentioned in the question.
And of course you can edit the function yourself to change the behaviour further. When you compile the above file, the terminal output will tell you which words are being dropped, and why. On my laptop it takes about 5 seconds to generate the concordance, and about 25 seconds to typeset it, so the whole run takes about 30 seconds total.
Because you asked for links to other programs, there is a solution, albeit a commercial one. First, create your PDF file using LaTeX. Second, run that file through PDF Index Generator: https://www.pdfindexgenerator.com/. You will be able to modify the list of words so as to remove common words. You will then get a file which you can append to your LaTeX-generated PDF. One caveat is the the generated list will not be in the same format, that is in terms of font and styles, as your LaTeX-generated text. To achieve this you would have to extract the concordance text from the PDF and then append it to your LaTeX-generated text. I have no connection with PDF Generator other than being a user.
A freeware list of common or "stop words" for German and other languages can be found at: https://github.com/Alir3z4/stop-words