ucharclasses misbehaves with Spacing Modifier Letters and Combining Diacritical Marks
Mixing unicode blocks in words = humans writing; setting a font when entering a different unicode block (or leaving it) = ucharclasses.
So English and Vietnamese aren't distinguishable by which block a character belongs to, since they both share the Latin block. But English and Old Persian are distinguishable by character class.
The combining diacritical marks block is a different block to the Basic Latin one, so, yes, this is possible:
and even this:
\documentclass[12pt]{article}
\usepackage[no-math]{fontspec}
\usepackage[BasicLatin, CombiningDiacriticalMarks]{ucharclasses}
\usepackage{xcolor}
\setmainfont{Noto Serif}
\newfontfamily\fdiac[Colour=red,Scale=1.5]{Fira Sans Black}
\setTransitionTo{BasicLatin}{\normalfont}
\setTransitionTo{CombiningDiacriticalMarks}{\fdiac}
\begin{document}
\large
a a\symbol{"0302} xyẑ abc \ \ o\symbol{"0302}\symbol{"0344}o\symbol{"0302}\symbol{"0321}\symbol{"0325}\symbol{"032C}
\end{document
"Disjoint" means that ucharclasses
can produce only one output (at a time), not two or more, so that in turn means that the sets of characters to process should not overlap or share elements.
=== Edited to add:
These combining marks could be really useful.
The sign for the "Hm, oh, er, um, that's a really nice..." conversation filler, as used in polite baboon social interactions among deferential individuals, say.
\documentclass[12pt]{article}
\usepackage[no-math]{fontspec}
\usepackage[BasicLatin, CombiningDiacriticalMarks]{ucharclasses}
\usepackage{xcolor}
\setmainfont{Noto Serif}
\newfontfamily\fdiac[Colour=red,Scale=1.5]{Fira Sans Black}
\newfontfamily\fdiacb[Colour=blue,Scale=2.5]{Gentium Plus}
\setTransitionTo{BasicLatin}{\normalfont}
\setTransitionTo{CombiningDiacriticalMarks}{\fdiac}
\begin{document}
\large
(o\symbol{"0302}\symbol{"032B}{\let\fdiac\fdiacb\symbol{"0308}\symbol{"036A}}o\symbol{"0302}\symbol{"0321}\symbol{"0325}\symbol{"032C})
\end{document}
Further edit
About transitioning -
On the presumption that transitioning requires a (sequential?) transition, insert a transition, using either {}
or a zero-width joiner (both being outside the relevant code blocks):
Diacritical mark and base character function as a unit (in a sense), font-wise, so insert a transition after the base character.
\documentclass{article}
\usepackage{xcolor}
\usepackage[Latin, Phonetics, Diacritics, SpacingModifierLetters]{ucharclasses}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]
\setTransitionsFor{IPAExtensions}{\dejavuserif}{\normalfont}
\setTransitionsFor{CombiningDiacriticalMarks}{\dejavuserif}{\normalfont}
\setTransitionsFor{SpacingModifierLetters}{\dejavuserif}{\normalfont}
\newcommand\zwnj{^^^^200c}
\begin{document}
thaaw [tʰ{}ɑɑɯ] [tɑɑɯ] [tʰ{}ɑ́{}ɑɯ] [tɑ́{}ɑɯ] thaaw
\normalfont
thaaw [t^^^^02b0\zwnj ɑɑɯ] [tɑɑɯ] [tʰ\zwnj ɑ́\zwnj ɑɯ] [tɑ́\zwnj ɑɯ] thaaw
\end{document}
Although, keeping the units of meaning and display synchronized would be less of a cognitive load on the reader:
\documentclass{article}
\usepackage{xcolor}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]
\newcommand\ph[1]{[{\dejavuserif #1}]}
\begin{document}
thaaw \ph{tʰɑɑɯ} \ph{tɑɑɯ} \ph{tʰɑ́ɑɯ} \ph{tɑ́ɑɯ} thaaw
\end{document}
On the matter of stacking diacritics, the font-designer's hand and choice comes into play.
Some random fonts, to illustrate:
Noto Serif
Acariya
Ajoure
Andika
Arial
DejaVu Serif
Looping
Hypothesis: The root cause is that counting starts from 1, and then goes upwards. Once only. So the last font-switch command put into the typesetting stream is the one that has a visible effect.
What happens when A block text and B block text are typed next to each other with no separator(s), the A-B transition code loops through all the blocks, finds A is ending, outputs the "coming out of A block" code, finds B is starting, outputs the "going into B block" code - if A codeblock is examined first.
If the A codeblock has a higher Unicode start/end point than the B code block, the looping finds instead that: B block is starting, outputs the "coming into B block" code, finds the A block is ending, outputs the "coming out of A block" code, and the user is surprised: we have gone back to normal font (for example).
In real life, the normal separator between blocks (intended as script blocks) is a space (Latin), which Tex converts to glue - but ZW characters from the punctuation block, as above, can also act as 'separators' between other blocks (technically, classes, not blocks).
Higher classes trump lower classes.
Ideally, explicitly specifying all the entry/exit pairwise combinations of code block transitions (where the code blocks are contiguous text) would cover the general case - except for cross-Unicode block text.
\documentclass{article}
\usepackage{xcolor}
\usepackage[Latin, Cyrillic, Cuneiform, Coptic]{ucharclasses}
\usepackage{fontspec}
\setmainfont{DejaVu Sans}
\newfontfamily\fa{Noto Sans Coptic}[Colour=red]
\newfontfamily\fb{Noto Serif}[Colour=blue]
\newfontfamily\fc{Noto Sans Cuneiform}[Colour=green]
\setTransitionsFor{Coptic}{\fa}{\normalfont}
\setTransitionsFor{Cyrillic}{\fb}{\normalfont}
\setTransitionsFor{Cuneiform}{\fc}{\normalfont}
\newcommand\zwnj{^^^^200c}
\begin{document}
ⲀⲁⲂⲃⲄⲅxАБВГДЕxxⲀⲁⲂⲃⲄⲅ
ⲀⲁⲂⲃⲄⲅАБВГДЕⲀⲁⲂⲃⲄⲅ
ⲀⲁⲂⲃⲄⲅ АБВГДЕ ⲀⲁⲂⲃⲄⲅ
xАБВГДЕxⲀⲁⲂⲃⲄⲅxxⲀⲁⲂⲃⲄⲅ
АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ
АБВГДЕ ⲀⲁⲂⲃⲄⲅ ⲀⲁⲂⲃⲄⲅ
АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ
АБВГДЕ ⲀⲁⲂⲃⲄⲅ ⲀⲁⲂⲃⲄⲅ
\end{document}
Using \XeTeXinterchartoks transitions directly
Another way to have single-point transitions, instead of multiple, is to put (in this specific case) all three code blocks -- IPAExtensions, CombiningDiacriticalMarks, and SpacingModifierLetters -- into the same class; ucharclasses
is not needed.
But that still leaves the semantic ambiguity that phonetic t
and non-phonetic t
are the same glyph.
(code adapted from an answer by Jonathon Kew on the TUG maillist 2008: here
\documentclass{article}
\usepackage{xcolor}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]
\newcount\n
\n=`\ɐ \loop \XeTeXcharclass \n=4 \ifnum\n<`\ʯ \advance\n by 1 \repeat
%\n=`\a \loop \XeTeXcharclass \n=4 \ifnum\n<`\z \advance\n by 1 \repeat
\n=`\ʰ \loop \XeTeXcharclass \n=4 \ifnum\n<`\˿ \advance\n by 1 \repeat
\n=`\̀ \loop \XeTeXcharclass \n=4 \ifnum\n<`\ͯ \advance\n by 1 \repeat
% when we encounter class 4, we'll do \startling
\XeTeXinterchartoks 0 4 {\startling}
\XeTeXinterchartoks 4095 4 {\startling}
% and when we encounter class 0, we'll do \finishling
\XeTeXinterchartoks 4095 0 {\finishling}
\XeTeXinterchartoks 4 0 {\finishling}
%\newif\ifling
\newcommand\startling{\dejavuserif}
\newcommand\finishling{\normalfont}
\XeTeXinterchartokenstate=1
\begin{document}
thaaw [tʰɑɑɯ] [tɑɑɯ] [tʰɑ́ɑɯ] [tɑ́ɑɯ] thaaw
\end{document}
Edit More on looping
Changing the sequence of the \setTransitionsFor
commands affects the outcome:
etc
OK. ucharclasses
wasn't designed for multiply-overlapping transitions: more of a 'into Greek, switch to Greek font; into Cyrillic, switch to a Cyrillic font; etc'.
The transitions are (leaving aside CJK matters, which take up classes 1,2,3):
(a) from/to class 0 (any glyph not defined in a class),
(b) from/to class 4095 (any non-glyph = glue, maths, boxes: collectively called 'boundary', as in word boundary; space becomes glue during typesetting, so that's why spaces are what I've been calling 'separators').
(c) any pair-wise transitions between user-defined classes (presumably 5,6,7,...)
So, reducing the complexity by having just three named ucharclasses, xxxClass
, where xxx
is the codeblock name (which makes things easier coding-wise, because we don't need to work out what the class numbers are), we have 12 'single' transitions: 3 into our classes from class 0, 3 into our classes from class 4095, and the 6 corresponding transitions out of our classes into classes 0 and 4095.
%singles ========================
%entering
%encountering our 3 classes
\XeTeXinterchartoks 0 \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks 0 \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks 0 \CombiningDiacriticalMarksClass = {\dejavuserif}
% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks 4095 \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks 4095 \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks 4095 \CombiningDiacriticalMarksClass = {\dejavuserif}
%leaving
%encountering everything else
\XeTeXinterchartoks \SpacingModifierLettersClass 0 = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 0 = {\normalfont}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 0 = {\normalfont}
% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks \SpacingModifierLettersClass 4095 = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 4095 = {\normalfont}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 4095 = {\normalfont}
Next, we have the pairwise-combinations of transitions into/out of our three classes, with respect to each other: 3x2=6 of them.
%pairs ===============
\XeTeXinterchartoks \SpacingModifierLettersClass \CombiningDiacriticalMarksClass = {\dejavuserif}
\XeTeXinterchartoks \IPAExtensionsClass \CombiningDiacriticalMarksClass = {\dejavuserif}
\XeTeXinterchartoks \SpacingModifierLettersClass \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks \IPAExtensionsClass \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \SpacingModifierLettersClass = {\dejavuserif}
giving:
Full MWE:
\documentclass[varwidth,border=6pt]{standalone}
\usepackage{xcolor}
\usepackage[
%Latin,
%Phonetics,
%Diacritics,
SpacingModifierLetters,
CombiningDiacriticalMarks,
IPAExtensions,
]{ucharclasses}
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase,Mapping=tex-text}
\newfontfeature{IPA}{+mgrk}
\setmainfont[IPA]{DejaVu Sans}
\newfontfamily\dejavuserif[IPA]{DejaVu Serif}[Colour=red]
%\setTransitionsFor{CombiningDiacriticalMarks}{\dejavuserif}{\normalfont}
%\setTransitionsFor{SpacingModifierLetters}{\dejavuserif}{\normalfont}
%\setTransitionsFor{IPAExtensions}{\dejavuserif}{\normalfont}
%singles ========================
%entering
%encountering our 3 classes
\XeTeXinterchartoks 0 \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks 0 \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks 0 \CombiningDiacriticalMarksClass = {\dejavuserif}
% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks 4095 \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks 4095 \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks 4095 \CombiningDiacriticalMarksClass = {\dejavuserif}
%leaving
%encountering everything else
\XeTeXinterchartoks \SpacingModifierLettersClass 0 = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 0 = {\normalfont}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 0 = {\normalfont}
% glue, maths, boxes etc = `boundary'
\XeTeXinterchartoks \SpacingModifierLettersClass 4095 = {\normalfont}
\XeTeXinterchartoks \IPAExtensionsClass 4095 = {\normalfont}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass 4095 = {\normalfont}
%pairs ===============
\XeTeXinterchartoks \SpacingModifierLettersClass \CombiningDiacriticalMarksClass = {\dejavuserif}
\XeTeXinterchartoks \IPAExtensionsClass \CombiningDiacriticalMarksClass = {\dejavuserif}
\XeTeXinterchartoks \SpacingModifierLettersClass \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \IPAExtensionsClass = {\dejavuserif}
\XeTeXinterchartoks \IPAExtensionsClass \SpacingModifierLettersClass = {\dejavuserif}
\XeTeXinterchartoks \CombiningDiacriticalMarksClass \SpacingModifierLettersClass = {\dejavuserif}
\begin{document}
thaaw [tʰɑɑɯ] [tɑɑɯ] [tʰɑ́ɑɯ] [tɑ́ɑɯ] [tʰɑ́ɑɯʰ] thaaw
\end{document}
Adding additional blocks/classes - like Latin or Phonetics - will increase the number of combinations and permutations to be covered (if space
, or something else from class 4095 or what remains of class 0, is not going to be used to activate a transition event).
In that multi-coloured example using Coptic, Cyrillic, Cuneiform and Latin, the lines with the text strings next to each other with no spaces were corrected by specifying all the combinations:
Classes 4,5,6,7 were arbitarily used to (manually) class up the glyphs.
\documentclass[varwidth,border=6pt]{standalone}
\usepackage{xcolor}
\usepackage{fontspec}
\setmainfont{DejaVu Sans}
\newfontfamily\fa{Noto Sans Coptic}[Colour=red]
\newfontfamily\fb{Noto Serif}[Colour=blue]
\newfontfamily\fc{Noto Sans Cuneiform}[Colour=green]
\newcount\n
%===
%latin
\n=`\A \loop \XeTeXcharclass \n=4 \ifnum\n<`\Z \advance\n by 1 \repeat
\n=`\a \loop \XeTeXcharclass \n=4 \ifnum\n<`\z \advance\n by 1 \repeat
% when we encounter class 4, we'll do \startling
\XeTeXinterchartoks 0 4 {\startling}
\XeTeXinterchartoks 4095 4 {\startling}
% and when we encounter class 0, we'll do \finishling
\XeTeXinterchartoks 4095 0 {\finishling}
\XeTeXinterchartoks 4 0 {\finishling}
%\newif\ifling
\newcommand\startling{\normalfont}
\newcommand\finishling{}
%===
%cyrillic
\n=`\Ѐ \loop \XeTeXcharclass \n=5 \ifnum\n<`\ӿ \advance\n by 1 \repeat
% when we encounter class 5, we'll do \startling
\XeTeXinterchartoks 0 5 {\startlingcyr}
\XeTeXinterchartoks 4095 5 {\startlingcyr}
% and when we encounter class 0, we'll do \finishling
%\XeTeXinterchartoks 4095 0 {\finishlingcyr}
\XeTeXinterchartoks 5 0 {\finishlingcyr}
%\newif\ifling
\newcommand\startlingcyr{\fb}
\newcommand\finishlingcyr{\normalfont}
%===
%cuneiform
\n="12000 \loop \XeTeXcharclass \n=6 \ifnum\n<"12399 \advance\n by 1 \repeat
\XeTeXinterchartoks 0 6 {\startlingcun}
\XeTeXinterchartoks 4095 6 {\startlingcun}
\XeTeXinterchartoks 6 0 {\finishlingcun}
\newcommand\startlingcun{\fc}
\newcommand\finishlingcun{\normalfont}
%===
%coptic
\n=`\Ⲁ \loop \XeTeXcharclass \n=7 \ifnum\n<`\⳿ \advance\n by 1 \repeat
\XeTeXinterchartoks 0 7 {\startlingcop}
\XeTeXinterchartoks 1 7 {\startlingcop}
\XeTeXinterchartoks 2 7 {\startlingcop}
\XeTeXinterchartoks 3 7 {\startlingcop}
\XeTeXinterchartoks 5 7 {\startlingcop}
\XeTeXinterchartoks 6 7 {\startlingcop}
\XeTeXinterchartoks 4095 7 {\startlingcop}
\XeTeXinterchartoks 4095 0 {\finishlingcop}
\XeTeXinterchartoks 5 6 {\finishlingcyrc}
\XeTeXinterchartoks 7 0 {\finishlingcop}
\XeTeXinterchartoks 7 5 {\finishlingcopb}
\XeTeXinterchartoks 6 5 {\finishlingcopb}
\XeTeXinterchartoks 7 6 {\finishlingc}
\XeTeXinterchartoks 4 5 {\finishlingcopb}
\XeTeXinterchartoks 4 6 {\finishlingc}
\XeTeXinterchartoks 4 7 {\startlingcop}
\XeTeXinterchartoks 7 4 {\startling}
\XeTeXinterchartoks 6 4 {\startling}
\XeTeXinterchartoks 5 4 {\startling}
\newcommand\startlingcop{\fa}
\newcommand\finishlingcop{}
\newcommand\finishlingcyrc{\fc}
\newcommand\finishlingc{\fc}
\newcommand\finishlingcopb{\fb}
\XeTeXinterchartokenstate=1
\begin{document}
ⲀⲁⲂⲃⲄⲅxАБВГДЕxxⲀⲁⲂⲃⲄⲅ
ⲀⲁⲂⲃⲄⲅАБВГДЕⲀⲁⲂⲃⲄⲅ
ⲀⲁⲂⲃⲄⲅ АБВГДЕ ⲀⲁⲂⲃⲄⲅ
xАБВГДЕxⲀⲁⲂⲃⲄⲅxxⲀⲁⲂⲃⲄⲅ
АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ
АБВГДЕ ⲀⲁⲂⲃⲄⲅ ⲀⲁⲂⲃⲄⲅ
АБВГДЕⲀⲁⲂⲃⲄⲅⲀⲁⲂⲃⲄⲅ
АБВГДЕ ⲀⲁⲂⲃⲄⲅ ⲀⲁⲂⲃⲄⲅ
\end{document}