Is there any reason to use inputenc?
The basic LaTeX/TeX engine expects (or perhaps is meant to process) pure ASCII
input. Whenever your file uses any other characters, the inputenc
package comes to the rescue, specifying to the engine how to process the symbols you're typing.
So it's quite necessary, whenever you use unicode (non ASCII
) characters, to use the inputenc
package, in order to have a meaningful output (or sometimes to make a successful run of (La)TeX)
The difference comes with the "naturally UTF8 compliant" engines, such as LuaTeX and XeTeX, which automatically interpret the input files as UTF8
and won't accept different input encodings: in that cases \usepackage[utf8]{inputenc}
can be omitted, since it does basically nothing (and is not used internally anyway)
To put it in other terms, the programs do not check whether the file characters comply to the ASCII standards, they simply interpret them to be as such.
With the 2018 release of LaTeX the test file below produces
as UTF-8 is assumed as the default input encoding unless you specify a different encoding to inputenc
and the BOM at the start of the file is handled gracefully (ignored in this case).
Original answer
With inputenc commented out I get
despite typing the input in emacs.
\documentclass{article}
\usepackage[T1]{fontenc}
%\usepackage[utf8]{inputenc}
\begin{document}
© David Carlisle and cost £2000.
\end{document}
Since there seem to be some discussion about the BOM..
If the above file is saved with the byte order mark (or any printable character) before the \documentclass
then you get an error
! LaTeX Error: Missing \begin{document}.
but this is not built in to the TeX engine, it is just the default setting of the characters which can be changed depending how you call LaTeX
The commandline
pdflatex '\catcode"EF=9\catcode"BB=9\catcode"BF=9 \input' testfile
would declare the BOM safe and latex would then process the file without error and give the same bad output as shown above. The presence of the BOM in no way implies UTF-8 encoding to the system.
Here are some examples to make explicit some detail (implicit in the other answers), which may help clear up any remaining confusion.
Consider the following Unicode text añ©ⱥ
which consists of:
- U+0061 LATIN SMALL LETTER A, encoded in UTF-8 as
61
- U+00F1 LATIN SMALL LETTER N WITH TILDE, encoded in UTF-8 as
C3 B1
- U+00A9 COPYRIGHT SIGN, encoded in UTF-8 as
C2 A9
- U+2C65 LATIN SMALL LETTER A WITH STROKE, encoded in UTF-8 as
E2 B1 A5
A byte is a number from 0 to 255 in decimal, or 00
to FF
in hexadecimal. So when encoded with UTF-8, the above "four-character" string corresponds to, in the file, the 8 bytes 61 C3 B1 C2 A9 E2 B1 A5
.
tex/pdftex WITHOUT inputenc
The engine sees the input as a stream of bytes (8 bytes in the above example). It considers each of them as as a character, and decides to typeset the corresponding character from either the T1 (Cork) encoding or OT1 encoding (the default) or whatever is set up. Examples:
Above, OT1 has no characters for those bytes so nothing gets typeset.
I hope you can see what's happening: each of the 8 bytes is treated as a character and output: e.g. the byte C3
is “Ô in T1.
tex/pdftex WITH inputenc
With \usepackage[utf8]{inputenc}
, TeX correctly sees every sequence of UTF-8 bytes as a Unicode character. For example, when TeX sees the byte sequence C3 B1
, it understands that you mean the Unicode character U+00F1. (The way this is done is that bytes larger than 127 (80
to FF
in hexadecimal) are set up to be active characters that expect further input — this is possible because of a useful design of UTF-8. See texdoc utf8ienc
for details.)
TeX still needs to know what to do with that Unicode character. A big bunch of definitions (such as \DeclareUnicodeCharacter{00F1}{\~n}
saying what to do with the character U+00F1) are included in the TeX distribution (file texmf-dist/tex/latex/base/utf8.def
on TeX Live). So using \usepackage[utf8]{inputenc}
will help if your characters have such definitions (again, see texdoc utf8ienc
for the full list), or if you're willing to define them yourself.
With a Unicode-aware engine (XeTeX or LuaTeX)
You don't need inputenc. The engine will expect UTF-8 (by default), and understand the input simply as Unicode characters, and for each of them it simply typesets that character from the currently selected font.
What about BOM?
With UTF-8 the BOM (byte order mark) isn't needed (it was meant for non-byte-oriented encodings, like UTF-16 and UTF-32), and is strongly discouraged. Typical “good” editors won't include it. Just forget about it; you aren't likely to encounter it in practice.
But if somehow your file does end up including it, then it's just a sequence of bytes EF BB BF
(the UTF-8 encoding of U+FEFF), and I think you have enough information above to work out what would happen if those bytes were present in the file at what place.
What if my file contains only "normal" characters?
If you mean Latin-script characters without accents, then UTF-8 has the property that it coincides with ASCII on the range 0 to 127 (00
to 7F
). So a file containing only those characters, encoded in UTF-8, is indistinguishable from one encoded in ASCII. Naturally, the output is identical too.