XeTeX text layout strangely sensitive to spaces
I don't have the first or last font. However, Polyglossia works correctly for me. (I assume it would probably also work with just the correct font configuration, but I did it this way as this is presumably what you want in the end.)
\documentclass{article}
\usepackage{polyglossia}
\setmainlanguage{kannada}
\setotherlanguage[variant=british]{english}
\newfontfamily\kannadafont{Noto Serif Kannada}[Script=Kannada]
\newfontfamily\kannadafontsf{Noto Sans Kannada}[Script=Kannada]
\tracinglostchars=2 % https://tex.stackexchange.com/a/41235/48
\def\testtext{R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ}
\begin{document}
% \fontspec{Arial Unicode MS} \testtext
\testtext
\sffamily \testtext
% \fontspec{Kedage} \testtext
\end{document}
(Sharing what I understood as a result of all this.)
Solutions
Firstly, the solutions to the problem:
- As @cfr's answer pointed out, I should have used
[Script=Kannada]
for this font, as documented in thefontspec
andpolyglossia
manuals. And when it's used, everything works as expected: with the space or without, the whole text is rendered as appropriate for the Kannada script. - Additionally, we actually don't want the non-Kannada characters like the R rendered in the Kannada script: the different-script characters like
R
must be marked as being in a different language or at least a different font (see below for how to do this).
So is this a bug, either in XeTeX or some library it uses? No, I'd say it's a user error. Still, the fact that everything works fine when there are spaces between words (without having to specify the script) perhaps makes this user error more likely.
Explanation
What explains this discrepancy in behaviour depending on the space (just what is going on)? And can this behaviour be changed in XeTeX? What I found is the following.
The library used by XeTeX for text layout, namely HarfBuzz (which is used in Firefox, Chrome, LibreOffice, etc., see What is Harfbuzz?), comes with a command-line program called hb-view
which can be invoked with a font and a string of text. With it I get the following output:
hb-view NotoSansKannada-Regular.ttf "ಶ್ರೀ"
and with--script=knda
:hb-view NotoSansKannada-Regular.ttf " ಶ್ರೀ"
and with--script=knda
:hb-view NotoSansKannada-Regular.ttf "Rಶ್ರೀ"
and with--script=knda
hb-view NotoSansKannada-Regular.ttf "R ಶ್ರೀ"
and with--script=knda
What this shows is that the output is correct if either the first non-space character is from the right script, or the script is specified explicitly.
So the behaviour seen in XeTeX (the difference between "Rಶ್ರೀ" and "R ಶ್ರೀ") is explained by what @Ulrike Fischer pointed out in The XeTeX companion:
XeTeX’s approach is the following:
the typesetting process collects runs of characters (words) whose widths are obtained via the API to the system libraries […] to determine the widths,
a XeTeX paragraph is a sequence of word nodes separated by glue.
Thus XeTeX’s typesetting engine places words rather than glyphs, the latter being drawn by the font rendering engine.
(The “system libraries” and “font rendering engine” above are HarfBuzz now (thanks to Khaled Hosny); they used to be ICU earlier.) So
with “Rಶ್ರೀವತ್ಸ”, XeTeX asks HarfBuzz to render that whole string as one unit, which fails (as seen in the hb-view experiments above) because it neither starts with a character from the desired script nor did we specify the script correctly, while
with “R ಶ್ರೀವತ್ಸ”, XeTeX asks HarfBuzz separately for each of the two words, and in this case the second word is correctly rendered (even if we didn't specify the script) because it starts with a character from the correct script.
Still it seems best not to rely on such guessing, and specify the script explicitly.
Working with both scripts
To have both scripts work smoothly, we ought to specify that the characters like R are in a different language. We could do this by writing \textenglish{R}ಶ್ರೀವತ್ಸ
instead of Rಶ್ರೀವತ್ಸ
. If we don't want to change the input though, there is a way to do this using the ucharclasses
package.
I wasn't able to get it to work for some reason, so I just did it manually (referring to the example in texdoc xetex
and a post from the author of ucharclasses
, and with 255 changed to 4095 as mentioned in for example this answer):
\documentclass{article}
\usepackage{fontspec}
\usepackage{polyglossia}
\newfontfamily\kannadafont{Noto Serif Kannada}[Script=Kannada]
\newfontfamily\englishfont{Georgia}
\setdefaultlanguage{kannada}
\setotherlanguage{english}
\XeTeXinterchartokenstate = 1 % Enable the character classes functionality
\newXeTeXintercharclass \CharEnglish
\XeTeXcharclass `R = \CharEnglish
\XeTeXinterchartoks 0 \CharEnglish = {\selectlanguage{english}}
\XeTeXinterchartoks 4095 \CharEnglish = {\selectlanguage{english}}
\XeTeXinterchartoks \CharEnglish 0 = {\selectlanguage{kannada}}
\XeTeXinterchartoks \CharEnglish 4095 = {\selectlanguage{kannada}}
\begin{document}
R ಶ್ರೀವತ್ಸ \quad Rಶ್ರೀವತ್ಸ
\end{document}
This changes the language every time we move between an English character (only R
above) and either a word boundary (4095) or a regular (not specified to be English) character (0).
For my original document, to deal with all the English characters, I wrote a loop to do the equivalent of
\XeTeXcharclass `R = \CharEnglish
for every uppercase and lowercase letter of the alphabet:
\newcount\tmpchar
\tmpchar = `A
\loop
\ifnum \tmpchar < `[ % [ comes just after Z
\XeTeXcharclass \tmpchar = \CharEnglish
\XeTeXcharclass \lccode \tmpchar = \CharEnglish
\advance \tmpchar by 1
\repeat