Is there any reason to use inputenc?

The basic LaTeX/TeX engine expects (or perhaps is meant to process) pure ASCII input. Whenever your file uses any other characters, the inputenc package comes to the rescue, specifying to the engine how to process the symbols you're typing.

So it's quite necessary, whenever you use unicode (non ASCII) characters, to use the inputenc package, in order to have a meaningful output (or sometimes to make a successful run of (La)TeX)

The difference comes with the "naturally UTF8 compliant" engines, such as LuaTeX and XeTeX, which automatically interpret the input files as UTF8 and won't accept different input encodings: in that cases \usepackage[utf8]{inputenc} can be omitted, since it does basically nothing (and is not used internally anyway)

To put it in other terms, the programs do not check whether the file characters comply to the ASCII standards, they simply interpret them to be as such.

With the 2018 release of LaTeX the test file below produces

enter image description here

as UTF-8 is assumed as the default input encoding unless you specify a different encoding to inputenc and the BOM at the start of the file is handled gracefully (ignored in this case).

Original answer

With inputenc commented out I get

enter image description here

despite typing the input in emacs.

\documentclass{article}

\usepackage[T1]{fontenc}
%\usepackage[utf8]{inputenc}

\begin{document}

© David Carlisle and cost £2000.
\end{document}

Since there seem to be some discussion about the BOM..

If the above file is saved with the byte order mark (or any printable character) before the \documentclass then you get an error

! LaTeX Error: Missing \begin{document}.

but this is not built in to the TeX engine, it is just the default setting of the characters which can be changed depending how you call LaTeX

The commandline

pdflatex '\catcode"EF=9\catcode"BB=9\catcode"BF=9 \input' testfile

would declare the BOM safe and latex would then process the file without error and give the same bad output as shown above. The presence of the BOM in no way implies UTF-8 encoding to the system.

Here are some examples to make explicit some detail (implicit in the other answers), which may help clear up any remaining confusion.

Consider the following Unicode text añ©ⱥ which consists of:

U+0061 LATIN SMALL LETTER A, encoded in UTF-8 as 61
U+00F1 LATIN SMALL LETTER N WITH TILDE, encoded in UTF-8 as C3 B1
U+00A9 COPYRIGHT SIGN, encoded in UTF-8 as C2 A9
U+2C65 LATIN SMALL LETTER A WITH STROKE, encoded in UTF-8 as E2 B1 A5

A byte is a number from 0 to 255 in decimal, or 00 to FF in hexadecimal. So when encoded with UTF-8, the above "four-character" string corresponds to, in the file, the 8 bytes 61 C3 B1 C2 A9 E2 B1 A5.

tex/pdftex WITHOUT inputenc

The engine sees the input as a stream of bytes (8 bytes in the above example). It considers each of them as as a character, and decides to typeset the corresponding character from either the T1 (Cork) encoding or OT1 encoding (the default) or whatever is set up. Examples:

OT1 Above, OT1 has no characters for those bytes so nothing gets typeset. I hope you can see what's happening: each of the 8 bytes is treated as a character and output: e.g. the byte C3 is “Ã” in T1.

tex/pdftex WITH inputenc

With \usepackage[utf8]{inputenc}, TeX correctly sees every sequence of UTF-8 bytes as a Unicode character. For example, when TeX sees the byte sequence C3 B1, it understands that you mean the Unicode character U+00F1. (The way this is done is that bytes larger than 127 (80 to FF in hexadecimal) are set up to be active characters that expect further input — this is possible because of a useful design of UTF-8. See texdoc utf8ienc for details.)

TeX still needs to know what to do with that Unicode character. A big bunch of definitions (such as \DeclareUnicodeCharacter{00F1}{\~n} saying what to do with the character U+00F1) are included in the TeX distribution (file texmf-dist/tex/latex/base/utf8.def on TeX Live). So using \usepackage[utf8]{inputenc} will help if your characters have such definitions (again, see texdoc utf8ienc for the full list), or if you're willing to define them yourself.

inputenc

With a Unicode-aware engine (XeTeX or LuaTeX)

You don't need inputenc. The engine will expect UTF-8 (by default), and understand the input simply as Unicode characters, and for each of them it simply typesets that character from the currently selected font.

xetex

What about BOM?

With UTF-8 the BOM (byte order mark) isn't needed (it was meant for non-byte-oriented encodings, like UTF-16 and UTF-32), and is strongly discouraged. Typical “good” editors won't include it. Just forget about it; you aren't likely to encounter it in practice.

But if somehow your file does end up including it, then it's just a sequence of bytes EF BB BF (the UTF-8 encoding of U+FEFF), and I think you have enough information above to work out what would happen if those bytes were present in the file at what place.

What if my file contains only "normal" characters?

If you mean Latin-script characters without accents, then UTF-8 has the property that it coincides with ASCII on the range 0 to 127 (00 to 7F). So a file containing only those characters, encoded in UTF-8, is indistinguishable from one encoded in ASCII. Naturally, the output is identical too.

Is there any reason to use inputenc?

tex/pdftex WITHOUT inputenc

tex/pdftex WITH inputenc

With a Unicode-aware engine (XeTeX or LuaTeX)

What about BOM?

What if my file contains only "normal" characters?

Tags:

Input Encodings

Related

Recent Posts