How to get string length from multi-byte character string when passed as a \def using pdftex
You just have to expand once #1
in the context of the \zz
definition. This will allow the first argument to be a macro containing the actual data of interest.
\documentclass[a4paper,11pt]{article}
% Attempt to get the length of a utf8 multi byte string
% Only works when supplied with the string directly
% does not work with \def strings
%
% https://tex.stackexchange.com/questions/419215/multibyte-strlen-strlen-for-chinese-characters
%
\def\zz#1{\edef\theresult{\expandafter\zzz\expandafter0#1\relax}}
\def\zzz#1#2{%
\ifx\relax#2 \the\numexpr#1\relax
\else
\expandafter\zzz\expandafter{%
\the\numexpr(#1+\ifnum\expandafter`\string#2<"80 1\else \ifnum\expandafter`\string#2>"BF 1 \else 0 \fi\fi
\expandafter)\expandafter\relax\expandafter}%
\fi}%
\begin{document}
\def\v1{abc}
\v1 \zz{abc} \theresult\\ %this works
\zz{\v1} \theresult\\ %this doesn't work Error: Missing = inserted for \ifnum.
\end{document}
p.s. The use of non alpha symbols are not generally a good practice in user code macro names, such as \v1
. While it may appear the variable is named \v1
, it is in fact named \v
, requiring a mandatory argument of 1
.
SUPPLEMENT
Here's a different approach that, while not expandable, counts spaces, group tokens, unexpanded macro tokens, as well as characters. It will expand the argument only if it is a single token, which covers the case of interest to the OP.
It counts individual tokens inside groups, rather than treating the group as a single "token".
\documentclass[a4paper,11pt]{article}
\usepackage{tokcycle}
\newcounter{mycount}
\tokcycleenvironment\countenv
{\stepcounter{mycount}}
{\addtocounter{mycount}{2}\processtoks{##1}}
{\stepcounter{mycount}}
{\stepcounter{mycount}}
\newcommand\countem[1]{%
\setcounter{mycount}{0}%
\countenv#1\endcountenv
\ifnum\themycount=1\relax
\setcounter{mycount}{0}%
\expandafter\countenv#1\endcountenv
\fi
\themycount
}
\begin{document}
\def\v{abc}
\countem{abc}
\countem{\v}
\countem{\v2345}
\countem{a b{c{\today}e}fg}
\end{document}
Here's a modification of David Carlisle's \zz
macro at https://tex.stackexchange.com/a/419216/4427 that distinguishes if the argument of \utfstrlen
is a single macro or not. Don't try squeezing two macros in the argument: either a single macro representing a string of UTF-8 characters or an explicit list of characters.
\documentclass{article}
%\usepackage{xparse} % uncomment if using LaTeX release prior to 2020-10-01
\ExplSyntaxOn
\NewExpandableDocumentCommand{\utfstrlen}{m}
{
\egreg_utf_str_len:n { #1 }
}
\cs_generate_variant:Nn \tl_to_str:n { e }
\cs_new:Nn \egreg_utf_str_len:n
{
\bool_lazy_and:nnTF { \tl_if_single_p:n { #1 } } { \token_if_cs_p:N #1 }
{% #1 is a single control sequence
\__egreg_utf_str_len:e { \tl_to_str:e { \exp_not:V #1 } }
}
{% #1 is a list of characters
\__egreg_utf_str_len:e { \tl_to_str:n { #1 } }
}
}
\cs_new:Nn \__egreg_utf_str_len:n
{
\int_eval:n { \tl_map_function:nN { #1 } \__egreg_utf_char:n }
}
\cs_generate_variant:Nn \__egreg_utf_str_len:n { e }
\cs_new:Nn \__egreg_utf_char:n
{
\int_compare:nTF { `#1 < "80 }
{ +1 } % ascii 7-bit
{ \int_compare:nT { `#1 > "BF } { +1 } } % prefix character
}
\ExplSyntaxOff
\begin{document}
\utfstrlen{容容}
\utfstrlen{abc}
\utfstrlen{¢Àïα}
\def\test{容容abc¢Àïα}
\utfstrlen{\test}
\end{document}
Like in David's answer the list of characters (possibly obtained by expanding once a macro) is parsed one character at a time; if a 7-bit ASCII character is found, 1 is added; if a character with higher code than hexadecimal BF is found, it is a prefix for a multibyte character, so 1 is added; otherwise the character is ignored.
The macro \utfstrlen
is fully expandable.