utf8x vs. utf8 (inputenc)
The simple answer is that utf8x
is to be avoided if possible. It loads the ucs
package, which for a long time was unmaintained (although there is now a new maintainer) and breaks various other things.
See egreg's answer to this question as well, which outlines how to get extra characters using the [utf8]
option of inputenc
.
Generally, however, the best way to deal with Unicode source (especially with non-latin scripts) is really XeLaTeX or LuaLaTeX.
There's an extended discussion of this here: Encoding remarks. See especially the comments by Philipp Lehman and Philipp Stephani.
In fact, utf8
may not be as restrictive as it seems: it only loads characters that can be displayed by the font encoding.
When typing
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
the font encoding is still OT1
when loading inputenc
, which has very few characters. By using
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
you will allow all displayable utf8 characters to be available as input.
Don't use utf8x
; with an up-to-date TeX distribution it could show necessary only for its most obscure features (faking characters with images from the Web, for instance).
The problem with Greek, which was probably the main reason for adopting utf8x
instead of utf8
, have since be solved and
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[polutonikogreek,english]{babel}
\begin{document}
This is english
\textgreek{Τηις ις γρεεκ}
This is english again.
\end{document}
will happily print
The occasional missing definitions can be coped with in a simple way. If you're able to input a Unicode character, such as the Welsh letters
Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï
or the Latin vowels with prosodic marks
Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ Āā Ēē Īī Ōō Ūū Ȳȳ
(y with breve is missing from Unicode, while a with breve is already defined by utf8
because it's a letter in Romanian), you can simply add the unknown ones to the list of known characters:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{newunicodechar}
% missing Welsh coverage
\newunicodechar{Ŵ}{\^W}
\newunicodechar{ŵ}{\^w}
\newunicodechar{Ŷ}{\^Y}
\newunicodechar{ŷ}{\^y}
% Latin vowels with prosodic marks
\newunicodechar{Ĕ}{\u{E}}
\newunicodechar{ĕ}{\u{e}}
\newunicodechar{Ĭ}{\u{I}}
\newunicodechar{ĭ}{\u{\i}}
\newunicodechar{Ŏ}{\u{O}}
\newunicodechar{ŏ}{\u{o}}
\newunicodechar{Ŭ}{\u{U}}
\newunicodechar{ŭ}{\u{u}}
\newunicodechar{Ā}{\=A}
\newunicodechar{ā}{\=a}
\newunicodechar{Ē}{\=E}
\newunicodechar{ē}{\=e}
\newunicodechar{Ī}{\=I}
\newunicodechar{ī}{\={\i}}
\newunicodechar{Ō}{\=O}
\newunicodechar{ō}{\=o}
\newunicodechar{Ū}{\=U}
\newunicodechar{ū}{\=u}
\newunicodechar{Ȳ}{\=Y}
\newunicodechar{ȳ}{\=y}
\begin{document}
Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï
Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ
Āā Ēē Īī Ōō Ūū Ȳȳ
\end{document}
Note that, for instance, the line
\newunicodechar{Ŵ}{\^W}
can be also input as
\DeclareUnicodeCharacter{0174}{\^W}
without the need of the newunicodechar
package, because U+0174
is the code point of LATIN CAPITAL LETTER W WITH CIRCUMFLEX; but \newunicodechar
frees from looking up in the Unicode tables.
Update, April 2016
With a recent LaTeX kernel almost none of the definitions above is necessary, because T1enc.dfu
has been updated and enriched. Of the accented letters in the last example, only Ȳ
and ȳ
need to be defined (and they'll possibly be included in next releases).
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{newunicodechar}
\newunicodechar{Ȳ}{\=Y}
\newunicodechar{ȳ}{\=y}
\begin{document}
Ââ Êê Îî Ôô Ŵŵ Ŷŷ Ïï
Ăă Ĕĕ Ĭĭ Ŏŏ Ŭŭ
Āā Ēē Īī Ōō Ūū Ȳȳ
\end{document}