Unicode math at TeX primitive level

The key information in the accepted answer is: load Unicode Math font (UnMaFo) as family 2 and the same font as family 3. The TeX engine (XeTeX or LuaTeX) re-calculates appropriate fontdimens for family 2 and family 3 from UnMaFo when these families are set.

After understanding this, I am able to do simple plain TeX macros for unicode math. I did this in the uni-math.tex file released newly in the csplain package. It was mirrored from my www pages to CTAN yesterday and it was mirrored to TL-pretest today. The uni-math.tex uses straightforward plain TeX macros (including loading fonts at arbitrary sizes; this feature is used in OPmac macros, for example). You can compare 270 clear plain TeX code lines of uni-math.tex with 5671 lines in unicode-math.sty. Or compare ten milions lines of log file when simple LaTeX document (with fontspec and unicode-math) is traced by \tracingall with 262k lines in log when the same is done using uni-math.tex. UnMaFo is nothig mystical.

I try to summarize the basic information for macro programmers. If I am wrong, please, correct me.

There is no difference between XeTeX and LuaTeX. If LuaTeX is used, then you must only do \input luafonts to execute one \directlua which re-declares \font primitive. Now, the \font primitive in LauTeX has somewhat extended syntax than in XeTeX, but using XeTeX syntax is sufficient and works in both engines.
Load UnMaFo in family 2 and the same in family 3, as mentioned above. You can load the same in family 1 too and set all Umathcodes to family 1 as default. The fontfeatures must be appended by mode=base;script=math; and the fonts for script and scriptcsript sizes can have more fontfeatures +ssty=0; and +ssty=1; respectively. May be, the UnMaFo is able to deal with optical sizes for such cases.
Set \Umathcodes for all codes from MathClass.txt as \Umathcode<code> = <type> 1 <code>, where is math object type from TeX point of view (0 is Ord, 1 is Op, 2 is Bin etc.). Use the following conversion table from letters used in MathClass.txt to TeX types: L=1, B=2, V=2, R=3, N=0, U=0, F=0, O=4, C=5, P=6. A=7. For codes of type O, C and F set \Udelcode<code> = 1 <code>. They are extensible delimiters. These codes must be able to extend vertically if UnMaFo is correctly prepared. Note 1 in settings: this is family 1 where UnMaFo is loaded.
Now, the Math typesetting is prepared. But you must use correct unicode codes between $...$ , no control sequences. Especially, you must use direct codes for math italic, because codes `A-`Z, `a-`z are set as roman upright in Unicode table. This is not comfortable. Moreover, we (TeX users) are lazy and we write {\cal A} instead of selecting the proper unicode calligraphic A in text editor when preparing the source of the document.
To prepare such Math alphabet selectors like \cal we must know, that all Math alphabets are in one font at given codes, so switching by \fam=something is bad. We use "base-code sets" `A-`Z, `a-`z for roman Latin characters, `0-`9 for digits and "391-"3D5 for Greek characters. The macros like \cal changes the \Umathcodes of characters in appropriate base-code set. So, user can write A from base-code set and it creates calligraphic A when \cal was used.
There are following Math alphabets in UnMaFo: rm, bf, it and bi of Latin roman and Greek; sans, bfsans, itsans and bisans of Latin; bfsans and bisans of Greek; cal, bfcal frak, bffrak of Latin; doublestroked of latin; rm, bf, sans, bfsans, doublestroked of digits; typewriter of Latin and digits, see http://www.unicode.org/charts/PDF/U1D400.pdf. When preparing \cal-like macros, you must to set new codes from whole base-set using \Umathcode <base-code> = 7 1 <new-code> in a loop. But there is a little problem with absurd holes in the Math alphabets in Unicode table (see the document above). So, a little macro programming must be done. For example, you can see the macro \umathcharholes in uni-math.tex.
If you have prepared \itlatin, \itgreekrmGreek macros similar to \cal mentioned above, then you can set \itlatin \itgreekrmGreek as default. This is normal behaviour in TeX.
Typical control sequences used in TeX's math mode (like \sum, \pm, \oplus) can be scanned from unicode-math-table.tex file. You can set most of these control sequences as an equivalent of direct \Umathcode by \Umathcharnumdef<sequence>=\Umathcodenum<code>.
The codes with type L (declared in MathClass.txt) have two (or more) sizes in good prepared UnMaFo. So different sizes will be used in \textstyle and displaystyle automatically.
The control sequences declared as \mathopen and \mathclose in unicode-math-table.tex can be defined as macro with \Udelimiter which simply points to its code. So \langle, \rangle, \lbrace etc. will work in the context of \left...\right. See the scanning process of MathClass.txt and unicode-math-table.tex in uni-math.tex file for more details.
The codes declared as D (diacritics) in MathClass.txt can be used as \Umathaccent argument. This primitive declares extensive math accent (like \widetilde) as default. The feature of extensibility is prepared in UnMaFo for such codes. Unextensible accents must be declared by keyword "fixed", extensible bottom placed accents have "bottom" keyword in \Umathaccent. Define "fixed" math accents for all sequences from unicode-math-table.tex where \mathaccent type is mentioned. Next, define more flexible accents using \Umathaccent for control sequences \overbrace, \underbrace, \overparen, \underparen, \overbracket, \underbracket, \widehat, \widetilde, \overleftharpoon, \overrightharpoon, \overleftarrow, \overrightarrow, \overleftrightarrow.
Define \sqrt, \cuberoot and \fourthroot by \Uradical primitive wich points to the codes "221A, "0221B, "0221C. Good prepared UnMaFo includes more variants of these characters for different sizes. (Xe/Lua)TeX uses a proper size and adds the vertical bar like normal TeX does it when \radical primitive is used.
Declare \Umathcode `- = 2 1 "2212 because we are using the hyphen character as minus symbol in math mode in TeX.
Define \let\intop=\int \def\int{\intop \nolimits} in order to keep the plainTeX behavior of integral operator. Similarly, do this for other integral-like symbols if you want. Declare integral code as inmath-active and do the similar trick if you want to use direct integral symbol instead \int in the document.

More fonts can be loaded to next families 4, 5, 6, etc. It is irrelevant what type of font it is: UnMaFo or Text Unicode or classical 8bit. Then you can switch to usage of characters from such fonts by \fam=number, if the given code has a type 7.

Unfortunately, there isn't complete bold math version of common UnMaFo, only specific math alphabets have bf variant, no other symbols. We need compleete math bold in titles where bold text font is used. I used "poor bold" for such case implemented by 2 Tr PDF operator.

In classical TeX a number of math mode fonts are used to supply the output glyphs based on the input, and as observed in the question the relevant \mathcode of the input token. In contrast, when using a Unicode math mode font only one font is used to supply all of the glyphs. As such, rather than the limited number of slots available in a TeX font there are large number of math mode-specific entries in a Unicode font.

Both Unicode engines (XeTeX and LuaTeX) provide the primitive \Umathcode for setting the extended math codes required for this to work. Details are available in both the XeTeX and LuaTeX manual: the syntax is

\Umathcode ⟨char slot⟩ [=] ⟨math type⟩ ⟨fam.⟩ ⟨glyph slot⟩

Notice that there is a requirement to supply a family here but that these will all be the same.

To set up the font dimensions required for math mode working, the engine or a suitable loader has to read the table supplied by the font. In XeTeX this happens as part of the (extended) \font primitive, for example

\font\lmmx = "[latinmodern-math.otf]/OT:mode=base;script=math;"

whilst in LuaTeX a Lua-based loader is required to extend the \font primitive (which out-of-the-box is identical to that in TeX90) (Realistically the font loader to use with LuaTeX is luaotfload, which is based on that written for ConTeXt but loadable with plain, LaTeX, _etc. There is work ongoing to use the HarfBuzz shaper with LuaTeX but this is not at present usable to my knowledge.)

As only one font is in use, conversion between input and output glyphs requires some differences from classical TeX. For example, input such as

$y = mx + c$

will not give italic letters unless they have the correct \Umathcode to point to the 'correct' codepoint. For example, we need

\Umathcode `\y =  "7 "1 "1D466

(I'm assuming that we will use font 1 for all glyphs: this is not required.)

Operators in Unicode math are scaled by the font shaper directly rather than needing extensible parts. As such, something like \int is defined for Unicode use by

\let\int=∫

with the correct math code then chosen

\Umathcode `∫= "1 "1 `∫

Both XeTeX and LuaTeX have the \Uradical primitive for radicals: LuaTeX also has \Uroot.

An important consequence of using only one font is that for example making symbols bold requires that all of the relevant math codes change. Thus setting up something \bf requires that we map over all code points affected and alter their \Umathcode.

Whilst only one font is required, it is necessary to define math families two and three to satisfy the engine that sufficient math parameters are available. (This may change, certainly in LuaTeX, as it seems to be a hold-over of code paths from TeX90.) At the same time, script fonts need to be loaded telling the loader what they are. This leads to a minimal font loading set up something like

\font\lmmx   = "[latinmodern-math.otf]/OT:mode=base;script=math;" %
\font\lmmvii = "[latinmodern-math.otf]/OT:mode=base;script=math;+ssty=0;" at 7pt %
\font\lmmv   = "[latinmodern-math.otf]/OT:mode=base;script=math;+ssty=1;" at 5pt %
\textfont1 = \lmmx
\textfont2 = \lmmx
\textfont3 = \lmmx
\scriptfont1 = \lmmvii
\scriptfont2 = \lmmvii
\scriptfont3 = \lmmvii
\scriptscriptfont1 = \lmmv
\scriptscriptfont2 = \lmmv
\scriptscriptfont3 = \lmmv

(Again, I am assuming XeTeX font syntax here.)

As noted in comments, there are a large number of additional font dimensions in Unicode math fonts. LuaTeX gives these names (all listed in the LuaTeX manual), whilst for XeTeX they have numbers and are accessed using \fontdimen.

The TeX90 primitives \delimiter, \mathaccent and \radical all have extended Unicode versions: \Udelimiter, \Umathaccent and \Uradical. Unlike the TeX90 versions, \Udelimiter and \Uradical do not need to point to multiple glyph slots: only one slot is needed and the font shaper is responsible for growing the glyph as required. The syntax of \Umathaccent is significantly extended compared to \mathaccent, certainly for LuaTeX. All three primitives are described in the LuaTeX manual and to a lesser extend in the XeTeX one.

Unicode math at TeX primitive level

Tags:

Tex Core

Plain Tex

Unicode Math

Related

Recent Posts