input tokens through the unicode position
udpated:
Since 2018/04/01 LaTeX release, inputenc+utf8 is active (sic) by default, and this broke the v3 approach below, because the ascii bytes in range 128-255 are now active and they were assumed here of catcode 12.
I have made a small edit to my v3 code to let it be usable both with or without inputenc loaded for old LaTeX or with new LaTeX.
(pdflatex only answer)
Table of contents
v1 and v2 exploit already existing macros from the LaTeX kernel to produce the macros ultimately expected by the LaTeX font selection system.
v3 is completely different. It produces expandably the UTF-8 bytes. It does not use inputenc at all but does produce active leading bytes so that it is "inputenc+utf8" ready. The active tokens are protected using
\protect
(no e-TeX\protected
here), and to write them to file stream one needs to issue\set@display@protect
. They can be use inside an\edef
, but need the LaTeX\protected@edef
for that, for the same reason. Also macro\UniToUTF
now admits numerical input, so that the"
must be explicitely used in case of hexadecimal input (I do not understand why LaTeX's\DeclareUnicodeCharacter
contrarily toutf8x
's one does not do that either). The advantage here is that it is easy to set-up loops producing such characters in a row using an associated counter or other devices. For the sake of illustrating this, the mwe uses packagexinttools
but this is only for the examples.
v1
Here is quickly jotted down hack which seems to achieve with pdflatex what you want. May need some corrections for working in general, (short on time today).
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\DeclareUnicodeCharacter{20AC}{Hallo}
\makeatletter
\newcommand\unicodetoutf[1]{{\count@ "#1 \parse@XML@charref}%
\UTFviii@tmp}
\let\@preamblecmds\@empty
\makeatother
\begin{document}
€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo
\end{document}
v2
Here is edit to address issues about avoiding \let\@preamblecmds\@empty
and allowing ascii chars:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[LGR,T1]{fontenc}
%\tracingmacros1
\DeclareUnicodeCharacter{20AC}{Hallo}
%\tracingmacros0
\makeatletter
\let\UF@parse@UTFviii@a\parse@UTFviii@a
\let\UF@parse@UTFviii@b\parse@UTFviii@b
\newcommand*\UF@parse@XML@charref{%
\ifnum\count@<"A0\relax
\xdef\UTFviii@tmp{\char\the\count@\space}% \count@ not globally modified
\else\ifnum\count@<"800\relax
\UF@parse@UTFviii@a,%
\UF@parse@UTFviii@b C\UTFviii@two@octets.,%
\else\ifnum\count@<"10000\relax
\UF@parse@UTFviii@a;%
\UF@parse@UTFviii@a,%
\UF@parse@UTFviii@b E\UTFviii@three@octets.{,;}%
\else
\UF@parse@UTFviii@a;%
\UF@parse@UTFviii@a,%
\UF@parse@UTFviii@a!%
\UF@parse@UTFviii@b F\UTFviii@four@octets.{!,;}%
\fi
\fi
\fi
}
\newcommand\unicodetoutf[1]{%
{\count@ "#1 \UF@parse@XML@charref}%
\UTFviii@tmp
}
\makeatother
\begin{document}
€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo
A ^^41 \unicodetoutf{0041}
e ^^65 \unicodetoutf{0065}
é ^^c3^^a9 \unicodetoutf{00E9}
\fontencoding{LGR}\selectfont
α ^^ce^^b1 \unicodetoutf{03B1}
\end{document}
The idea is to plug into the utf8.def
Unicode to UTF-8 already coded parsing by the LaTeX team.
v3
Here is the expandable method. I long hesitated with the 7bit ascii range and the problems with catcodes, to start with many of them are declared ignored tokens by TeX/LaTeX. In the end I decided that all will be produced with catcode 12. However, when writing out to a file, they will nevertheless be sometimes output in TeX's core ^^
notation, not as genuine UTF-8 7bit ascii bytes.
It is not completely obvious to set-up macros expanding to all those tokens as catcode 12 tokens, I have chosen a rather systematic way rather than use shortcuts.
Please check out the code comments for more details.
\documentclass{article}
\makeatletter
% First, the most complicated, define macros extending to catcode 12
% tokens in the 7bit ascii range. Unfortunately, when writing out
% to a file, some of them will get converted back to ^^ notation, and
% not be written there as UTF-8 bytes.
%% 7bit ASCII RANGE (with all its TeX catcode complications)
%% First we handle the upper part (ascii code 127 handled later)
\begingroup
\catcode`\^ 12
\catcode`\_ 12
\catcode`\{ 12
\catcode`\} 12
\catcode`\~ 12
\catcode`\? 0
\catcode`\< 1
\catcode`\> 2
\catcode`\\ 12
?def?y<?endgroup
?count@ 64
?@tfor?x:=%
@ABCDEFGHIJKLMNO%
PQRSTUVWXYZ[\]^_%
`abcdefghijklmno%
pqrstuvwxyz{|}~?do<%
?expandafter?xdef?csname utf8byte?the?count@?endcsname<?expandafter?string?x>%
?advance?count@?@ne>%
>?y
%% Now we handle the lower part
\begingroup
\catcode127 14
\def\z{utf8byte}%
\def\x{%
\count@ 0
\loop
\catcode\count@ 12
\advance\count@\@ne
\ifnum\count@<64
\repeat}\x^^?
\def\y{^^?
\count@=0\relax^^?
\@tfor\x:=^^?
^^@^^A^^B^^C^^D^^E^^F^^G^^H^^I^^J^^K^^L^^M^^N^^O^^?
^^P^^Q^^R^^S^^T^^U^^V^^W^^X^^Y^^Z^^[^^\^^]^^^^^_^^?
^^`^^a^^b^^c^^d^^e^^f^^g^^h^^i^^j^^k^^l^^m^^n^^o^^?
^^p^^q^^r^^s^^t^^u^^v^^w^^x^^y^^z^^{^^|^^}^^~\do{^^?
\expandafter\xdef\csname\z\the\count@\endcsname{\x}^^?
\advance\count@\@ne}}^^?
\y\endgroup
%% ascii codes 63 and 127 were not handled yet
\@namedef{utf8byte63}{?}%
\begingroup
\catcode127 12
\global\@namedef{utf8byte127}{^^?}%
\endgroup
%% Now, the easier part, we define the UTF-8 continuation bytes
%% (catcode 12) and the UTF-8 leading bytes (as \protect'ed active
%% tokens). This set-up is ready for utf8+inputenc but does not need
%% it for its definitions. (this document does not load inputenc)
%% UPDATE: but LaTeX newer than 2018/04/01 loads inputenc+utf8 per
%% default, so a small change has been made below.
%% CONTINUATION BYTES
\begingroup
% prior to LaTeX 2018/04/01 these 64 bytes were assigned catcode12
% by LaTeX in absence of inputenc; but nowadays these bytes are active
% by default. Hence I modified \x into \expandafter\string\x which
% works whether or not the document incorporating this code is with
% old (with or without utf8+inputenc) LaTeX or with newer LaTeX
\count@ "80
\@tfor\x:=%
^^80^^81^^82^^83^^84^^85^^86^^87^^88^^89^^8a^^8b^^8c^^8d^^8e^^8f%
^^90^^91^^92^^93^^94^^95^^96^^97^^98^^99^^9a^^9b^^9c^^9d^^9e^^9f%
^^a0^^a1^^a2^^a3^^a4^^a5^^a6^^a7^^a8^^a9^^aa^^ab^^ac^^ad^^ae^^af%
^^b0^^b1^^b2^^b3^^b4^^b5^^b6^^b7^^b8^^b9^^ba^^bb^^bc^^bd^^be^^bf%
\do{\expandafter\xdef\csname utf8byte\the\count@\endcsname{\expandafter\string\x}%
\advance\count@\@ne}%
%% LEADING BYTES
%% we will make them TeX active, as inputenc does with utf8
%% We \protect them à la LaTeX, and writing to a file will
%% have to be made in a \set@display@protect context
\count@ "C2
\loop
\catcode\count@\active
\advance\count@ 1
\ifnum\count@ < "F5
\repeat
\count@ "C2
\@tfor\x:=^^c2^^c3^^c4^^c5^^c6^^c7^^c8^^c9^^ca^^cb^^cc^^cd^^ce^^cf%
^^d0^^d1^^d2^^d3^^d4^^d5^^d6^^d7^^d8^^d9^^da^^db^^dc^^dd^^de^^df%
^^e0^^e1^^e2^^e3^^e4^^e5^^e6^^e7^^e8^^e9^^ea^^eb^^ec^^ed^^ee^^ef%
^^f0^^f1^^f2^^f3^^f4%
\do{\expandafter\protected@xdef
\csname utf8byte\the\count@\endcsname{\expandafter\protect\x}%
\advance\count@\@ne}%
\endgroup
%% Time now to define our expandable Unicode to UTF-8 converter
%% a "case-switch" utility
\long\def\xintdothis #1#2\xintorthat #3{\fi #1}%
\let\xintorthat \@firstofone
%% EXPANDABLE UNICODE TO UTF-8 BYTES CONVERTER
%% The macro accepts any numexpr compatible input.
%% This means hexadecimal input must be prefixed by "
%% (and must use uppercase hex digits.)
%% Advantage with such numerical inputs is to allow easy usage
%% over some arithmetic range with counts or counters.
%% NOTE: \numexpr is the only e-TeX extension used.
\newcommand*\UniToUTF[1]{\expandafter\UniToUTF@\the\numexpr#1.}%
\def\UniToUTF@ #1.{%
\ifnum#1<"80 \xintdothis\uni@ascii\fi
\ifnum#1<"800 \xintdothis\uni@twooctets\fi
\ifnum#1<"10000 \xintdothis\uni@threeoctets\fi
%% maybe add here some out of range check ?
\xintorthat\uni@fouroctets {#1}%
}
%% CONVERSION FORMULAS MORALLY BASED ON EXPLANATIONS OF
%% https://en.wikipedia.org/wiki/UTF-8
%% USING OCTAL NOTATION
\def\uni@ascii#1{\csname utf8byte#1\endcsname}%
\def\uni@twooctets#1{\expandafter\uni@twooctets@i
\the\numexpr(#1+32)/64-1.#1.}%
\def\uni@twooctets@i#1.#2.{%
\csname utf8byte\the\numexpr192+#1\expandafter\expandafter\expandafter
\endcsname
\csname utf8byte\the\numexpr128+#2-#1*64\endcsname
}%
\def\uni@threeoctets#1{\expandafter\uni@threeoctets@i
\the\numexpr(#1+32)/64-1.#1.}%
\def\uni@threeoctets@i#1.#2.{\expandafter\uni@threeoctets@ii
\the\numexpr(#1+32)/64-1\expandafter.%
\the\numexpr#1\expandafter\expandafter\expandafter.%
\csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@threeoctets@ii#1.#2.{%
\csname utf8byte\the\numexpr224+#1\expandafter\expandafter\expandafter
\endcsname
\csname utf8byte\the\numexpr128+#2-#1*64\endcsname
}%
\def\uni@fouroctets#1{\expandafter\uni@fouroctets@i
\the\numexpr(#1+32)/64-1.#1.}%
\def\uni@fouroctets@i#1.#2.{\expandafter\uni@fouroctets@ii
\the\numexpr(#1+32)/64-1\expandafter.%
\the\numexpr#1\expandafter\expandafter\expandafter.%
\csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@fouroctets@ii#1.#2.{\expandafter\uni@fouroctets@iii
\the\numexpr(#1+32)/64-1\expandafter.%
\the\numexpr#1\expandafter\expandafter\expandafter.%
\csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\def\uni@fouroctets@iii#1.#2.{%
\csname utf8byte\the\numexpr240+#1\expandafter\expandafter\expandafter
\endcsname
\csname utf8byte\the\numexpr128+#2-#1*64\endcsname}%
\makeatother
% xinttools loaded only for easying up loops in illustrative code
\usepackage{xinttools}
\makeatletter
\newcommand*{\WriteOutSixtyFourUTFchars}[2]{%
% #1 = starting Unicode code point
% #2 = out stream
\begingroup
\set@display@protect %%<<<---- makes \protect=\string
\def\mymacro##1{\UniToUTF{#1+##1}}%
\immediate\write#2{\xintApplyUnbraced\mymacro{\xintSeq{0}{63}}}%
\endgroup
}
% For \edef we need to take into account LaTeX's protection mechanism
% for the UTF-8
% leading bytes, as we defined them to be active for inputenc+utf8 readiness.
\begingroup
\global\let\protectedATedef\protected@edef
\endgroup
\makeatother
\begin{document}
\protectedATedef\x{\UniToUTF{"20AC}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"20AC}}
\typeout{}
\protectedATedef\x{\UniToUTF{"03B1}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"03B1}}
\typeout{}
\protectedATedef\x{\UniToUTF{"0416}}
\typeout{\meaning\x}
\typeout{\x}
\typeout{\UniToUTF{"0416}}
{\catcode127 12
\if\UniToUTF{127}\string^^?\else\ERROR\fi
}
\if\UniToUTF{126}\string~\else\ERROR\fi
{\makeatletter
\if\UniToUTF{37}\expandafter\@gobble\string\%\else\ERROR\fi}
\newwrite\foo
\immediate\openout\foo=\jobname ASCIIMEANINGS.txt
% we don't use \typeout here, but write to another file,
% else Emacs/AUCTeX wrongly
% reports a compilation error due to misinterpreation of log
% containing the ! character
\begingroup
\makeatletter
\set@display@protect %%<<<---- makes \protect=\string
\xintFor* #1 in{\xintSeq{0}{127}}\do
{\protectedATedef\x{\UniToUTF{#1}}%
\immediate\write\foo{\UniToUTF{#1} (\meaning\x)}}
\endgroup
\immediate\closeout\foo
\immediate\openout\foo=\jobname ASCII.txt
\WriteOutSixtyFourUTFchars{"0000}\foo
\WriteOutSixtyFourUTFchars{"0040}\foo
\immediate\closeout\foo
\immediate\openout\foo=\jobname0380.txt
\WriteOutSixtyFourUTFchars{"0380}\foo
\WriteOutSixtyFourUTFchars{"03C0}\foo
\immediate\closeout\foo
\immediate\openout\foo=\jobname0400.txt
\WriteOutSixtyFourUTFchars{"0400}\foo
\WriteOutSixtyFourUTFchars{"0440}\foo
\WriteOutSixtyFourUTFchars{"0480}\foo
\WriteOutSixtyFourUTFchars{"04C0}\foo
\WriteOutSixtyFourUTFchars{"0500}\foo
\immediate\closeout\foo
\immediate\openout\foo=\jobname2000.txt
\WriteOutSixtyFourUTFchars{"2000}\foo
\WriteOutSixtyFourUTFchars{"2040}\foo
\immediate\closeout\foo
\immediate\openout\foo=\jobname4000.txt
\WriteOutSixtyFourUTFchars{"4000}\foo
\WriteOutSixtyFourUTFchars{"4040}\foo
\immediate\closeout\foo
\immediate\openout\foo=\jobname10000.txt
\WriteOutSixtyFourUTFchars{"10000}\foo
\WriteOutSixtyFourUTFchars{"10040}\foo
\immediate\closeout\foo
\end{document}
To illustrate the functioning here is what one finds in log:
macro:->\protect €
€
€
macro:->\protect α
α
α
macro:->\protect Ж
Ж
Ж
then in file \jobname ASCII.txt
^^@^^A^^B^^C^^D^^E^^F^^G^^H
^^L^^M^^N^^O^^P^^Q^^R^^S^^T^^U^^V^^W^^X^^Y^^Z^^[^^\^^]^^^^^_ !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^^?
(the CTRL-I
, CTRL-J
, CTRL-K
are genuine bytes)
then in file \jobname ASCIIMEANINGS.txt
(this is from default situation, one wonders why CTRL-K
which is vertical tabulation is handled as genuine byte... see remark below about pdflatex's -8bit
option ...)
^^@ (macro:->^^@)
^^A (macro:->^^A)
^^B (macro:->^^B)
^^C (macro:->^^C)
^^D (macro:->^^D)
^^E (macro:->^^E)
^^F (macro:->^^F)
^^G (macro:->^^G)
^^H (macro:->^^H)
(macro:-> )
(macro:->
)
(macro:->)
^^L (macro:->^^L)
^^M (macro:->^^M)
^^N (macro:->^^N)
^^O (macro:->^^O)
^^P (macro:->^^P)
^^Q (macro:->^^Q)
^^R (macro:->^^R)
^^S (macro:->^^S)
^^T (macro:->^^T)
^^U (macro:->^^U)
^^V (macro:->^^V)
^^W (macro:->^^W)
^^X (macro:->^^X)
^^Y (macro:->^^Y)
^^Z (macro:->^^Z)
^^[ (macro:->^^[)
^^\ (macro:->^^\)
^^] (macro:->^^])
^^^ (macro:->^^^)
^^_ (macro:->^^_)
(macro:-> )
! (macro:->!)
" (macro:->")
# (macro:->#)
$ (macro:->$)
% (macro:->%)
& (macro:->&)
' (macro:->')
( (macro:->()
) (macro:->))
* (macro:->*)
+ (macro:->+)
, (macro:->,)
- (macro:->-)
. (macro:->.)
/ (macro:->/)
0 (macro:->0)
1 (macro:->1)
2 (macro:->2)
3 (macro:->3)
4 (macro:->4)
5 (macro:->5)
6 (macro:->6)
7 (macro:->7)
8 (macro:->8)
9 (macro:->9)
: (macro:->:)
; (macro:->;)
< (macro:-><)
= (macro:->=)
> (macro:->>)
? (macro:->?)
@ (macro:->@)
A (macro:->A)
B (macro:->B)
C (macro:->C)
D (macro:->D)
E (macro:->E)
F (macro:->F)
G (macro:->G)
H (macro:->H)
I (macro:->I)
J (macro:->J)
K (macro:->K)
L (macro:->L)
M (macro:->M)
N (macro:->N)
O (macro:->O)
P (macro:->P)
Q (macro:->Q)
R (macro:->R)
S (macro:->S)
T (macro:->T)
U (macro:->U)
V (macro:->V)
W (macro:->W)
X (macro:->X)
Y (macro:->Y)
Z (macro:->Z)
[ (macro:->[)
\ (macro:->\)
] (macro:->])
^ (macro:->^)
_ (macro:->_)
` (macro:->`)
a (macro:->a)
b (macro:->b)
c (macro:->c)
d (macro:->d)
e (macro:->e)
f (macro:->f)
g (macro:->g)
h (macro:->h)
i (macro:->i)
j (macro:->j)
k (macro:->k)
l (macro:->l)
m (macro:->m)
n (macro:->n)
o (macro:->o)
p (macro:->p)
q (macro:->q)
r (macro:->r)
s (macro:->s)
t (macro:->t)
u (macro:->u)
v (macro:->v)
w (macro:->w)
x (macro:->x)
y (macro:->y)
z (macro:->z)
{ (macro:->{)
| (macro:->|)
} (macro:->})
~ (macro:->~)
^^? (macro:->^^?)
Important update: to get genuine bytes for the entire 7bit ascii range, use pdflatex -8bit
!
then in file \jobname 0380.txt
΄΅Ά·ΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξο
πρςστυφχψωϊϋόύώϏϐϑϒϓϔϕϖϗϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰϱϲϳϴϵ϶ϷϸϹϺϻϼϽϾϿ
then in file \jobname 0400.txt
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп
рстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿ
Ҁҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ
ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ
ԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿ
then in file \jobname 4000.txt
䀀䀁䀂䀃䀄䀅䀆䀇䀈䀉䀊䀋䀌䀍䀎䀏䀐䀑䀒䀓䀔䀕䀖䀗䀘䀙䀚䀛䀜䀝䀞䀟䀠䀡䀢䀣䀤䀥䀦䀧䀨䀩䀪䀫䀬䀭䀮䀯䀰䀱䀲䀳䀴䀵䀶䀷䀸䀹䀺䀻䀼䀽䀾䀿
䁀䁁䁂䁃䁄䁅䁆䁇䁈䁉䁊䁋䁌䁍䁎䁏䁐䁑䁒䁓䁔䁕䁖䁗䁘䁙䁚䁛䁜䁝䁞䁟䁠䁡䁢䁣䁤䁥䁦䁧䁨䁩䁪䁫䁬䁭䁮䁯䁰䁱䁲䁳䁴䁵䁶䁷䁸䁹䁺䁻䁼䁽䁾䁿
as examples. Rendering in your browser will depend on its font set-up.
Here is the \jobname 10000.txt
file as it shows in an Emacs buffer:
Update of an image to show how the ASCIIMEANINGS file renders in an Emacs buffer, proving the ^^K
is indeed a one-byte there, in the default usage of pdflatex
:
Important added note
This is the default situation, but using pdflatex -8bit
solves all our
anguishes, here is Emacs buffer screen shot of ASCII output file:
Thus the -8bit
option we can indeed create arbitrary UTF-8 encoded files from within pdflatex, using only Unicode code points as input.
\documentclass{article}
\usepackage[utf8]{inputenc}
\DeclareUnicodeCharacter{20AC}{Hallo}
\DeclareUnicodeCharacter{12345}{Bye}
\begin{document}
There are two parts to doing what the question desires:
- Converting a Unicode code point to its UTF-8 encoding, e.g. U+20AC to the sequence of bytes
E2 82 AC
. - Injecting those bytes, or the equivalent of typing those bytes, into the input stream.
The first part is a UTF-8 encoder. It's already implemented in utf8ienc.dtx
/ utf8.def
, but a fresh implementation is below. It was fun to write, and is different from the official implementation. (For one thing, it uses the e-TeX extension \numexpr
which could not be assumed when that original implementation was written… but apart from using that new primitive, it does not use any external macros, and works in both plain (e)TeX and LaTeX.)
This is how it works (and how UTF-8 is defined): Given a number n
(which stands for a Unicode codepoint, e.g. "20AC
is the number 8364, representing the codepoint U+20AC Euro Sign),
- Look at the size of
n
, to determine how many bytes will be the UTF-8 encoding - Peel off six bits at a time from the right: these bits, with
10
prefixed, become the rightmost bytes - For the first byte, use the leftover bits with a prefix that depends on the number of bytes
The code which does it:
% Division and remainder. Based on https://tex.stackexchange.com/a/34449/48 but with bugs fixed. Note: assumes #2 positive
\def\modulo#1#2{\ifnum \numexpr(#1 - (#1/#2)*(#2))\relax < 0 (#1 - (#1/#2)*(#2) + #2) \else (#1 - (#1/#2)*(#2)) \fi}
\def\truncdiv#1#2{((#1 - \modulo{#1}{#2})/(#2))}
% The hypothetical continuation bytes: last byte, the one before that, etc.
\def\utfByteLastOne #1{\numexpr(128 + \modulo{#1}{64})\relax}
\def\utfByteLastTwo #1{\numexpr(128 + \modulo{\truncdiv{#1}{64}}{64})\relax}
\def\utfByteLastThree#1{\numexpr(128 + \modulo{\truncdiv{#1}{(64*64)}}{64})\relax}
\def\utfByteLastFour #1{\numexpr(128 + \modulo{\truncdiv{#1}{(64*64*64)}}{64})\relax}
% The actual individual bytes in the stream, for lengths 1 to 4.
\def\utfStreamOneByteOne #1{#1}
\def\utfStreamTwoByteOne #1{\numexpr(\utfByteLastTwo{#1} + 64)\relax}
\def\utfStreamTwoByteTwo #1{\utfByteLastOne{#1}}
\def\utfStreamThreeByteOne #1{\numexpr(\utfByteLastThree{#1} + 96)\relax}
\def\utfStreamThreeByteTwo #1{\utfByteLastTwo{#1}}
\def\utfStreamThreeByteThree#1{\utfByteLastOne{#1}}
\def\utfStreamFourByteOne #1{\numexpr(\utfByteLastFour{#1} + 112)\relax}
\def\utfStreamFourByteTwo #1{\utfByteLastThree{#1}}
\def\utfStreamFourByteThree #1{\utfByteLastTwo{#1}}
\def\utfStreamFourByteFour #1{\utfByteLastOne{#1}}
% Expands to \utfCallbackOne{#1} or ... or \utfCallbackFour{#1} depending on whether the code point #1 has 1, 2, 3, or 4 bytes in its UTF-8 encoding.
\def\utfStreamFromNumber#1{%
\ifnum #1 < 128
\utfCallbackOne{#1}%
\else \ifnum #1 < 2048 % 2^11
\utfCallbackTwo{#1}%
\else \ifnum #1 < 65536 % 2^16
\utfCallbackThree{#1}%
\else
\utfCallbackFour{#1}%
\fi
\fi
\fi
}
The interface of this UTF-8 encoder is quite general (and also completely independent of inputenc
) in the sense that you can simply redefine the \utfCallbackOne
to \utfCallbackFour
macros, to do different things with the bytes of the UTF-8 stream. For example, see the first revision of this answer for a version that simply prints out the bytes (and without \usepackage[utf8]{inputenc}
).
The second part (injecting arbitrary bytes/tokens into the input stream, as if the user had typed them) is straightforward in LuaTeX, but nontrivial in other engines. Firstly, if all we want to do is emulate what inputenc
would have done if those tokens were inserted (which is what the question asks for), then we can use the fact that inputenc
stores the \DeclareUnicodeCharacter
definitions in internal macros named u8:<byte1><byte2><byte3>
etc., and simply invoke those macros. The code below uses the \lccode
and \lowercase
trick (inspired by utf8ienc.dtx
) to construct the macro names that have “special” bytes in them.
% These get the full byte stream -- here, do whatever you want with the individual bytes.
% In this case, these callbacks call internal control sequences u8:... defined by \DeclareUnicodeCharacter
\def\utfCallbackOne #1{{\char\utfStreamOneByteOne{#1}}}
\def\utfCallbackTwo #1{{\lccode`A=\utfStreamTwoByteOne{#1}\lccode`B=\utfStreamTwoByteTwo{#1}\lowercase{\csname u8:AB\endcsname}}}
\def\utfCallbackThree#1{{\lccode`A=\utfStreamThreeByteOne{#1}\lccode`B=\utfStreamThreeByteTwo{#1}\lccode`C=\utfStreamThreeByteThree{#1}\lowercase{\csname u8:ABC\endcsname}}}
\def\utfCallbackFour #1{{\lccode`A=\utfStreamFourByteOne{#1}\lccode`B=\utfStreamFourByteTwo{#1}\lccode`C=\utfStreamFourByteThree{#1}\lccode`D=\utfStreamFourByteFour{#1}\lowercase{\csname u8:ABCD\endcsname}}}
\newcommand\unicodetoutf[1]{\utfStreamFromNumber{"#1}}
€ ^^e2^^82^^ac \unicodetoutf{20AC}
© ^^c2^^a9 \unicodetoutf{00A9}
^^f0^^92^^8d^^85 \unicodetoutf{12345}
\end{document}
But this is “cheating” in some sense: we're using the fact that we know the internal names of the macros, to call them. Also, if we use a character that has not been given a definition with \DeclareUnicodeCharacter
, then \csname u8:...
has no definition and TeX treats it as \relax
, so we get no error — so we need a wrapper that first checks whether it has been defined, for which we need to reproduce or borrow more from inputenc
.
Instead, we can extend the \lccode
/\lowercase
trick to get the effect of having typed those characters, without assuming anything about the internals of inputenc
. Let's think about it: why can't we just write \lowercase{AB}
, after setting the \lccode
s of A
and B
? It's because \lowercase
changes the character codes, but not the category codes. So we need to change the category codes of A
and B
as well, to be the same as those of the bytes in the UTF-8 stream. But even after doing that, if we say \lowercase{AB}
inside the macro, then it picks up the catcodes at the time of reading the macro, which is not what we want. One solution I was able to find was \scantokens
(introduced in e-TeX). Using it, we can get a “true” solution, with no “cheating”: replace the previous snippet with:
% The full byte stream -- here, do whatever you want with the individual bytes.
\newcount\charX
\newcount\charY
\newcount\charZ
\newcount\charW
\def\utfCallbackOne #1{{%
\charX=\utfStreamOneByteOne{#1}%
\catcode255=\catcode\charX%
\lccode255=\charX%
\lowercase{\scantokens{^^ff\endlinechar=-1}}}}
\def\utfCallbackTwo #1{{%
\charX=\utfStreamTwoByteOne{#1}%
\charY=\utfStreamTwoByteTwo{#1}%
\catcode`X=\catcode\charX%
\catcode`Y=\catcode\charY%
\lccode`X=\charX%
\lccode`Y=\charY%
\lowercase{\scantokens{XY\endlinechar=-1}}}}
\def\utfCallbackThree#1{{%
\charX=\utfStreamThreeByteOne{#1}%
\charY=\utfStreamThreeByteTwo{#1}%
\charZ=\utfStreamThreeByteThree{#1}%
\catcode`X=\catcode\charX%
\catcode`Y=\catcode\charY%
\catcode`Z=\catcode\charZ%
\lccode`X=\charX%
\lccode`Y=\charY%
\lccode`Z=\charZ%
\lowercase{\scantokens{XYZ\endlinechar=-1}}}}
\def\utfCallbackFour #1{{%
\charX=\utfStreamFourByteOne{#1}%
\charY=\utfStreamFourByteTwo{#1}%
\charZ=\utfStreamFourByteThree{#1}%
\charW=\utfStreamFourByteFour{#1}%
\catcode`X=\catcode\charX%
\catcode`Y=\catcode\charY%
\catcode`Z=\catcode\charZ%
\catcode`W=\catcode\charW%
\lccode`X=\charX%
\lccode`Y=\charY%
\lccode`Z=\charZ%
\lccode`W=\charW%
\lowercase{\scantokens{XYZW\endlinechar=-1}}}}
\newcommand\unicodetoutf[1]{\utfStreamFromNumber{"#1}}
A ^^41 \unicodetoutf{0041}
\def~{what}
~ ^^7e \unicodetoutf{007E}
€ ^^e2^^82^^ac \unicodetoutf{20AC}
© ^^c2^^a9 \unicodetoutf{00A9}
^^f0^^92^^8d^^85 \unicodetoutf{12345}
\end{document}
Now everything works! The same output as before, and in each case we have injected tokens equivalent to having typed out those Unicode characters in the file directly.
This directly implements the Unicode to UTF-8 conversion algorithm, in fully expandable way.
It uses the standard feature that an UTF-8 sequence abcd
(at most four characters) is transformed into the control sequence \u8:abcd
.
If the Unicode point is not defined, pdflatex
stops with an error
! Undefined control sequence.
<argument> \ERROR
BAD UTF (U+AAAA)
l.98 \unicodetoutf{AAAA}
Here's the code
% algorithm from https://home.kpn.nl/vanadovv/uni/utf8conversion.html
\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
\NewExpandableDocumentCommand{\unicodetoutf}{m}
{
\int_compare:nTF { "#1 < 128 }
{
\char_generate:nn { "#1 } { 12 }
}
{
\egreg_unicodetoutf:n { #1 }
}
}
\cs_new:Nn \egreg_unicodetoutf:n
{
\__egreg_unicodetoutf:nf { #1 }
{
\int_compare:nTF { "#1 < 2048 }
{
\__egreg_unicodetoutf_two:n { #1 }
}
{
\int_compare:nTF { "#1 < 65536 }
{
\__egreg_unicodetoutf_three:n { #1 }
}
{
\__egreg_unicodetoutf_four:n { #1 }
}
}
}
}
\cs_new:Nn \__egreg_unicodetoutf:nn
{
\cs_if_exist_use:cF { #2 } { \ERROR BAD~UTF~(U+#1) }
}
\cs_generate_variant:Nn \__egreg_unicodetoutf:nn { nf }
\cs_new:Nn \__egreg_unicodetoutf_two:n
{
u8:
\char_generate:nn { 192 + \int_div_truncate:nn { "#1 } { 64 } } { 12 }
\char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
}
\cs_new:Nn \__egreg_unicodetoutf_three:n
{
u8:
\char_generate:nn { 224 + \int_div_truncate:nn { "#1 } { 4096} } { 12 }
\char_generate:nn
{
128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 64 } } { 64 }
}
{ 12 }
\char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
}
\cs_new:Nn \__egreg_unicodetoutf_four:n
{
u8:
\char_generate:nn { 240 + \int_div_truncate:nn { "#1 } { 262144 } } { 12 }
\char_generate:nn
{
128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 4096 } } { 64 }
}
{ 12 }
\char_generate:nn
{
128 + \int_mod:nn { \int_div_truncate:nn { "#1 } { 64 } } { 64 }
}
{ 12 }
\char_generate:nn { 128 + \int_mod:nn { "#1 } { 64 } } { 12 }
}
\ExplSyntaxOff
\DeclareUnicodeCharacter{00E9}{b}
\DeclareUnicodeCharacter{20AC}{Hallo}
\DeclareUnicodeCharacter{10006}{World}
\begin{document}
a ^^61 \unicodetoutf{0061}
é ^^c3^^a9 \unicodetoutf{00E9}
€ ^^e2^^82^^ac \unicodetoutf{20AC} %<-- should output Hallo
^^f0^^90^^80^^86 \unicodetoutf{10006} % should output World
\edef\temp{\unicodetoutf{20AC}}
\texttt{\meaning\temp}
\unicodetoutf{AAAA}
\end{document}