! Package inputenc Error: Unicode char (U+200B)?
There are four characters U+200B (ZERO WIDTH SPACE) between clouds
and publics
/privés
. A simple fix is to remove the space with the p
and retype them. Maybe, these characters were introduced by the editor or some copy/paste operation.
This is one of the top hits for U+200B and LaTeX, so I’ll post solutions here.
Take the following example:
\tracinglostchars=2
\documentclass{article}
\pagestyle{empty}
\begin{document}fl fl fl\end{document}
In LuaLaTeX, it compiles to:
The first fl has no ligature because I inserted U+200B, a zero-width space. The second has no ligature because I inserted U+200C, a zero-width non-joiner. These might have been in the original source you copied from intentionally: a zero-width space could mean a potential line break, such as after a slash, and a zero-width non-joiner disables a ligature. For example, the fi in Elfin or the fl in Halfling are (according to pedants like me) not supposed to be ligated, since they belong to different pieces of a compound word, but this is more common in some other languages.
If you try to compile it in PDFLaTeX, you will get the error message that brought you here:
! Package inputenc Error: Unicode character (U+200B)
(inputenc) not set up for use with LaTeX.
There are several ways to fix it.
Clean Your Source by Hand
This is what most people on this site recommend. Your editor might have a way to display special characters so you can delete them all. But really, isn’t this a job for a computer?
Clean Your Source with Perl
The following one-line Perl script will create a new source file with all zero-width spaces removed:
perl -CSD -pe "s/\N{U+200B}//gu" < U200B.tex > noU200B.tex
If it’s easier to remember, you could also write this as
perl -CSD -pe "s/\N{ZERO WIDTH SPACE}//gu" < U200B.tex > noU200B.tex
The -CSD
option selects UTF-8 unconditionally, even if you don’t have UTF-8 as your default locale. The -pe
option runs the given Perl script on the input file and prints to the output file. The s
command does substitution, the \N{...}
is a regular expression matching zero-width space, the empty field between //
means replace with nothing, and gu
means replace all instances globally in the unicode string. Then, the <
and >
operators select the input and output files.
Either of these produce a file that compiles to:
It’s also possible to automatically remove all characters outside a given subset. The script
perl -CSD -pe "s/[^\p{Word}\p{Punct}\p{Symbol}\p{Mark}\p{PerlSpace}]//gu"
allows only the following: Unicode “word” characters, punctuation, symbols, accents and a few kinds of spaces. It erases most invisible characters. An even more restrictive version would be
perl -CSD -pe "s/[^\p{ASCII}]//gu"
This cleans out all characters but for the ASCII originally allowed in TeX (including “
instead of double backtick).
And yes, we could replace zero-width space by something instead of nothing. The script
perl -CSD -pe "s/\N{ZERO WIDTH SPACE}/{\\\\hskip 0pt}/gu; s/\N{ZERO WIDTH NON-JOINER}/{}/gu"
given the above MWE as input, produces the following output:
\tracinglostchars=2
\documentclass{article}
\pagestyle{empty}
\begin{document}f{\hskip 0pt}l f{}l fl\end{document}
Teach LaTeX to Understand Zero-Width Space
If the problem is that U+200B is “not set up for use with LaTeX,” but it’s equivalent to a TeX command—\hskip 0pt
or \hspace{0pt}
are zero-width spaces that prevent ligatures and enable a potential line break—we can set the character up to use that command.
\tracinglostchars=2
\documentclass{article}
\usepackage{iftex}
\pagestyle{empty}
\ifTUTeX
\usepackage{fontspec}
\else
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc} % The default since 2018
\DeclareUnicodeCharacter{200B}{{\hskip 0pt}}
\fi
\begin{document}fl fl fl\end{document}
Although the \DeclareUnicodeCharacter
command is in inputenc
, the LaTeX kernel has loaded it by default since 2018. So, we could have skipped declaring it.
These errors are hard to debug because if you have a long document then add a citation to a bibtex entry which has a non-unicode character, it's pretty hard to know which citation is the problem and where to look.
My solution was to do the following:
Run pdflatex
or whatever, at the command line. This gives you the error in "console" format, like this:
Package inputenc Error: Unicode char ́ (U+301) not set up for use with LaTeX
I could then copy/paste the offending character (in my case, a weird accent on a bit of Spanish text) and paste it into the following command:
rgrep ́
This allowed me to find the offending item in one of my .bib files and I could make the fix and get on with my life.