Unicode math at TeX primitive level
The key information in the accepted answer is: load Unicode Math font (UnMaFo) as family 2 and the same font as family 3. The TeX engine (XeTeX or LuaTeX) re-calculates appropriate fontdimens for family 2 and family 3 from UnMaFo when these families are set.
After understanding this, I am able to do simple plain TeX macros for
unicode math. I did this in the uni-math.tex
file released newly in the
csplain package. It was mirrored from my www pages to CTAN yesterday and it
was mirrored to TL-pretest today. The uni-math.tex
uses straightforward
plain TeX macros (including loading fonts at arbitrary sizes; this feature
is used in OPmac macros, for example). You can compare 270 clear plain TeX
code lines of uni-math.tex
with 5671 lines in unicode-math.sty
. Or compare
ten milions lines of log file when simple LaTeX document (with fontspec and
unicode-math) is traced by \tracingall
with 262k lines in log when the
same is done using uni-math.tex
. UnMaFo is nothig mystical.
I try to summarize the basic information for macro programmers. If I am wrong, please, correct me.
There is no difference between XeTeX and LuaTeX. If LuaTeX is used, then you must only do
\input luafonts
to execute one\directlua
which re-declares \font primitive. Now, the \font primitive in LauTeX has somewhat extended syntax than in XeTeX, but using XeTeX syntax is sufficient and works in both engines.Load UnMaFo in family 2 and the same in family 3, as mentioned above. You can load the same in family 1 too and set all Umathcodes to family 1 as default. The fontfeatures must be appended by mode=base;script=math; and the fonts for script and scriptcsript sizes can have more fontfeatures +ssty=0; and +ssty=1; respectively. May be, the UnMaFo is able to deal with optical sizes for such cases.
Set
\Umathcodes
for all codes fromMathClass.txt
as\Umathcode<code> = <type> 1 <code>
, where is math object type from TeX point of view (0 is Ord, 1 is Op, 2 is Bin etc.). Use the following conversion table from letters used in MathClass.txt to TeX types: L=1, B=2, V=2, R=3, N=0, U=0, F=0, O=4, C=5, P=6. A=7. For codes of type O, C and F set\Udelcode<code> = 1 <code>
. They are extensible delimiters. These codes must be able to extend vertically if UnMaFo is correctly prepared. Note1
in settings: this is family 1 where UnMaFo is loaded.Now, the Math typesetting is prepared. But you must use correct unicode codes between
$...$
, no control sequences. Especially, you must use direct codes for math italic, because codes`A-`Z, `a-`z
are set as roman upright in Unicode table. This is not comfortable. Moreover, we (TeX users) are lazy and we write{\cal A}
instead of selecting the proper unicode calligraphic A in text editor when preparing the source of the document.To prepare such Math alphabet selectors like
\cal
we must know, that all Math alphabets are in one font at given codes, so switching by\fam=something
is bad. We use "base-code sets"`A-`Z, `a-`z
for roman Latin characters,`0-`9
for digits and"391-"3D5
for Greek characters. The macros like\cal
changes the\Umathcodes
of characters in appropriate base-code set. So, user can writeA
from base-code set and it creates calligraphic A when\cal
was used.There are following Math alphabets in UnMaFo: rm, bf, it and bi of Latin roman and Greek; sans, bfsans, itsans and bisans of Latin; bfsans and bisans of Greek; cal, bfcal frak, bffrak of Latin; doublestroked of latin; rm, bf, sans, bfsans, doublestroked of digits; typewriter of Latin and digits, see http://www.unicode.org/charts/PDF/U1D400.pdf. When preparing
\cal
-like macros, you must to set new codes from whole base-set using\Umathcode <base-code> = 7 1 <new-code>
in a loop. But there is a little problem with absurd holes in the Math alphabets in Unicode table (see the document above). So, a little macro programming must be done. For example, you can see the macro\umathcharholes
inuni-math.tex
.If you have prepared
\itlatin
,\itgreekrmGreek
macros similar to\cal
mentioned above, then you can set\itlatin \itgreekrmGreek
as default. This is normal behaviour in TeX.Typical control sequences used in TeX's math mode (like
\sum
,\pm
,\oplus
) can be scanned fromunicode-math-table.tex
file. You can set most of these control sequences as an equivalent of direct\Umathcode
by\Umathcharnumdef<sequence>=\Umathcodenum<code>
.The codes with type L (declared in
MathClass.txt
) have two (or more) sizes in good prepared UnMaFo. So different sizes will be used in\textstyle
anddisplaystyle
automatically.The control sequences declared as
\mathopen
and\mathclose
inunicode-math-table.tex
can be defined as macro with\Udelimiter
which simply points to its code. So\langle
,\rangle
,\lbrace
etc. will work in the context of\left...\right
. See the scanning process ofMathClass.txt
andunicode-math-table.tex
inuni-math.tex
file for more details.The codes declared as D (diacritics) in
MathClass.txt
can be used as\Umathaccent
argument. This primitive declares extensive math accent (like\widetilde
) as default. The feature of extensibility is prepared in UnMaFo for such codes. Unextensible accents must be declared by keyword "fixed", extensible bottom placed accents have "bottom" keyword in\Umathaccent
. Define "fixed" math accents for all sequences fromunicode-math-table.tex
where\mathaccent
type is mentioned. Next, define more flexible accents using\Umathaccent
for control sequences\overbrace
,\underbrace
,\overparen
,\underparen
,\overbracket
,\underbracket
,\widehat
,\widetilde
,\overleftharpoon
,\overrightharpoon
,\overleftarrow
,\overrightarrow
,\overleftrightarrow
.Define
\sqrt
,\cuberoot
and\fourthroot
by\Uradical
primitive wich points to the codes "221A, "0221B, "0221C. Good prepared UnMaFo includes more variants of these characters for different sizes. (Xe/Lua)TeX uses a proper size and adds the vertical bar like normal TeX does it when\radical
primitive is used.Declare
\Umathcode `- = 2 1 "2212
because we are using the hyphen character as minus symbol in math mode in TeX.Define
\let\intop=\int \def\int{\intop \nolimits}
in order to keep the plainTeX behavior of integral operator. Similarly, do this for other integral-like symbols if you want. Declare integral code as inmath-active and do the similar trick if you want to use direct integral symbol instead\int
in the document.
More fonts can be loaded to next families 4, 5, 6, etc. It is irrelevant what type of font it is: UnMaFo or Text Unicode or classical 8bit. Then you can switch to usage of characters from such fonts by \fam=number, if the given code has a type 7.
Unfortunately, there isn't complete bold math version of common UnMaFo, only
specific math alphabets have bf variant, no other symbols. We need compleete
math bold in titles where bold text font is used. I used "poor bold" for
such case implemented by 2 Tr
PDF operator.
In classical TeX a number of math mode fonts are used to supply the output glyphs based on the input, and as observed in the question the relevant \mathcode
of the input token. In contrast, when using a Unicode math mode font only one font is used to supply all of the glyphs. As such, rather than the limited number of slots available in a TeX font there are large number of math mode-specific entries in a Unicode font.
Both Unicode engines (XeTeX and LuaTeX) provide the primitive \Umathcode
for setting the extended math codes required for this to work. Details are available in both the XeTeX and LuaTeX manual: the syntax is
\Umathcode ⟨char slot⟩ [=] ⟨math type⟩ ⟨fam.⟩ ⟨glyph slot⟩
Notice that there is a requirement to supply a family here but that these will all be the same.
To set up the font dimensions required for math mode working, the engine or a suitable loader has to read the table supplied by the font. In XeTeX this happens as part of the (extended) \font
primitive, for example
\font\lmmx = "[latinmodern-math.otf]/OT:mode=base;script=math;"
whilst in LuaTeX a Lua-based loader is required to extend the \font
primitive (which out-of-the-box is identical to that in TeX90) (Realistically the font loader to use with LuaTeX is luaotfload
, which is based on that written for ConTeXt but loadable with plain, LaTeX, _etc. There is work ongoing to use the HarfBuzz shaper with LuaTeX but this is not at present usable to my knowledge.)
As only one font is in use, conversion between input and output glyphs requires some differences from classical TeX. For example, input such as
$y = mx + c$
will not give italic letters unless they have the correct \Umathcode
to point to the 'correct' codepoint. For example, we need
\Umathcode `\y = "7 "1 "1D466
(I'm assuming that we will use font 1 for all glyphs: this is not required.)
Operators in Unicode math are scaled by the font shaper directly rather than needing extensible parts. As such, something like \int
is defined for Unicode use by
\let\int=∫
with the correct math code then chosen
\Umathcode `∫= "1 "1 `∫
Both XeTeX and LuaTeX have the \Uradical
primitive for radicals: LuaTeX also has \Uroot
.
An important consequence of using only one font is that for example making symbols bold requires that all of the relevant math codes change. Thus setting up something \bf
requires that we map over all code points affected and alter their \Umathcode
.
Whilst only one font is required, it is necessary to define math families two and three to satisfy the engine that sufficient math parameters are available. (This may change, certainly in LuaTeX, as it seems to be a hold-over of code paths from TeX90.) At the same time, script fonts need to be loaded telling the loader what they are. This leads to a minimal font loading set up something like
\font\lmmx = "[latinmodern-math.otf]/OT:mode=base;script=math;" %
\font\lmmvii = "[latinmodern-math.otf]/OT:mode=base;script=math;+ssty=0;" at 7pt %
\font\lmmv = "[latinmodern-math.otf]/OT:mode=base;script=math;+ssty=1;" at 5pt %
\textfont1 = \lmmx
\textfont2 = \lmmx
\textfont3 = \lmmx
\scriptfont1 = \lmmvii
\scriptfont2 = \lmmvii
\scriptfont3 = \lmmvii
\scriptscriptfont1 = \lmmv
\scriptscriptfont2 = \lmmv
\scriptscriptfont3 = \lmmv
(Again, I am assuming XeTeX font syntax here.)
As noted in comments, there are a large number of additional font dimensions in Unicode math fonts. LuaTeX gives these names (all listed in the LuaTeX manual), whilst for XeTeX they have numbers and are accessed using \fontdimen
.
The TeX90 primitives \delimiter
, \mathaccent
and \radical
all have extended Unicode versions: \Udelimiter
, \Umathaccent
and \Uradical
. Unlike the TeX90 versions, \Udelimiter
and \Uradical
do not need to point to multiple glyph slots: only one slot is needed and the font shaper is responsible for growing the glyph as required. The syntax of \Umathaccent
is significantly extended compared to \mathaccent
, certainly for LuaTeX. All three primitives are described in the LuaTeX manual and to a lesser extend in the XeTeX one.