Capitalizing strings ignoring closed class words
\documentclass[a4paper]{article}
\usepackage[latin1]{inputenc}
\usepackage{xparse}
\ExplSyntaxOn
\NewDocumentCommand{\capitalize}{>{\SplitList{~}}m}{
\CapitalizeFirst#1\Capitalize\unskip
}
\ExplSyntaxOff
\def\Sentinel{\Capitalize}
\def\CapitalizeFirst#1{\MakeUppercase#1 \Capitalize}
\def\Capitalize#1{%
\def\next{#1}%
\ifx\next\Sentinel
\expandafter\unskip
\else
\CheckInList{#1}\space\expandafter\Capitalize
\fi}
\def\CheckInList#1{%
\ifcsname List@\detokenize{#1}\endcsname
#1%
\else
\MakeUppercase#1%
\fi}
\makeatletter
\def\AppendToList#1{%
\@for\next:=#1\do
{\expandafter\let\csname List@\detokenize\expandafter{\next}\endcsname\empty}}
\makeatother
\AppendToList{a,is,of}
\begin{document}
\capitalize{here is a list of words école}
\end{document}
Won't work with UTF-8 (unless XeLaTeX or LuaLaTeX are used).
It won't work with UTF-8 in pdflatex
because \MakeUppercase
will apply only to the first byte of a possible two, three or four byte combination (for Western languages probably only two). For that to work one has to feed the whole block of bytes to \MakeUppercase
.
To be clearer: when we say \MakeUppercase
, LaTeX will uppercase the argument; in general the call is \MakeUppercase{word}
; here we're saying instead \MakeUppercase#1
(without braces), so only the first token (usually a character) will be uppercased; here's where it will fail with input such as \'ecole
: the token passed to \MakeUppercase
would be \'
, which it doesn't know what to do. Using école
(and a one byte encoding such as latin1
), \MakeUppercase
will process é
and give the correct result.
With UTF-8 this would fail: what we see as é
on our screen when writing a LaTeX document is actually two bytes (C3 and A9, for é
) and again \MakeUppercase
would be passed only the first one. So a more complex routine is necessary.
In order to have this work with pdflatex
and UTF-8, the definition of \CheckInList
and \CapitalizeFirst
above can be changed into the following
\def\CapitalizeFirst#1{\expandafter\UC@next#1 \Capitalize}
\def\CheckInList#1{%
\ifcsname List@\detokenize{#1}\endcsname
#1%
\else
\expandafter\UC@next#1%
\fi}
\def\UC@next#1{%
\ifx#1\UTFviii@two@octets
\expandafter\@firstoffour
\else
\ifx#1\UTFviii@three@octets
\expandafter\expandafter\expandafter\@secondoffour
\else
\ifx#1\UTFviii@four@octets
\expandafter\expandafter\expandafter\expandafter\expandafter
\@thirdoffour
\else
\expandafter\expandafter\expandafter\expandafter\expandafter
\expandafter\expandafter\@fourthoffour
\fi
\fi
\fi
{\UC@two}{\UC@three}{\UC@four}{\MakeUppercase}#1}
\def\UC@two#1#2#3{\MakeUppercase{#1#2#3}}
\def\UC@three#1#2#3#4{\MakeUppercase{#1#2#3#4}}
\def\UC@four#1#2#3#4#5{\MakeUppercase{#1#2#3#4#5}}
\providecommand\@firstoffour[4]{#1}
\providecommand\@secondoffour[4]{#2}
\providecommand\@thirdoffour[4]{#3}
\providecommand\@fourthoffour[4]{#4}
However accent commands are not allowed (they aren't also in the other version).
UPDATE
After a few years, here's a better implementation, thanks to new expl3
features; it works for all engines.
\documentclass[a4paper]{article}
\usepackage{ifxetex}
\ifxetex
\usepackage{fontspec}
\else
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\fi
\usepackage{xparse}
\ExplSyntaxOn
\NewDocumentCommand{\capitalize}{>{\SplitList{~}}m}
{
\seq_clear:N \l_capitalize_words_seq
\ProcessList{#1}{\CapitalizeFirst}
\seq_use:Nn \l_capitalize_words_seq { ~ }
}
\NewDocumentCommand{\CapitalizeFirst}{m}
{
\capitalize_word:n { #1 }
}
\sys_if_engine_pdftex:TF
{
\cs_set_eq:Nc \capitalize_tl_set:Nn { protected@edef }
}
{
\cs_set_eq:NN \capitalize_tl_set:Nn \tl_set:Nn
}
\cs_new_protected:Nn \capitalize_word:n
{
\capitalize_tl_set:Nn \l_capitalize_word_tl { #1 }
\seq_if_in:NfTF \g_capitalize_exceptions_seq { \tl_to_str:n { #1 } }
% exception word
{ \seq_put_right:Nn \l_capitalize_words_seq { #1 } } % exception word
% to be uppercased
{ \seq_put_right:Nx \l_capitalize_words_seq { \tl_mixed_case:V \l_capitalize_word_tl } }
}
\cs_generate_variant:Nn \tl_mixed_case:n { V }
\NewDocumentCommand{\AppendToList}{m}
{
\clist_map_inline:nn { #1 }
{
\seq_gput_right:Nx \g_capitalize_exceptions_seq { \tl_to_str:n { ##1 } }
}
}
\cs_generate_variant:Nn \seq_if_in:NnTF { Nf }
\seq_new:N \l_capitalize_words_seq
\seq_new:N \g_capitalize_exceptions_seq
\ExplSyntaxOff
\AppendToList{a,is,of,óf}
\begin{document}
X\capitalize{here is a list of words óf école}X
\end{document}
A ConTeXt solution:
You can use the command \applytosplitstringwordspaced
for this:
\def\IgnoredWords
{a,is,to,of,or,and}
\define[1]\CapitalizeWithIgnoreWord
{\doifinsetelse{#1}\IgnoredWords{#1}{\Words{#1}}}
\def\CapitalizeWithIgnore
{\applytosplitstringwordspaced\CapitalizeWithIgnoreWord}
\starttext
\CapitalizeWithIgnore{This is some of my input or another and to the end.}
\stoptext
which gives
The \applytosplitstringwordspaced
command divides the input into words and applies each word to the macro \CapitalizeWithIgnoreWord
, which takes one argument. Then I simply test, if the given word is a member of the word list and print it, or print it uppercased.
The titlecaps
package is newly introduced and demonstrated here: Headings in uppercase. It will take care of titling diacritical marks (e.g., umlauts, etc.) national symbols (e.g., oe) and is compatible with (i.e., can include in its argument) commands that change the font characteristics, such as \textit{}
, \scshape
, and \footnotesize
. Further, it allows for words to be designated as lower-cased, for example prepositions and conjunctions, which are to be screened out and not titled. The presence of punctuation should not affect the ability of the package to either capitalize a word or detect it as a pre-designated lower-cased word.