Searchable and copyable small caps, ligatures and umlaut in PDF created by XeLaTex

As I wrote in the comment: with lualatex it is probably possible to patch the font. As a proof of concept:

\pdfvariable compresslevel 0
\documentclass[%
 fontsize=11pt,%
  ngerman,
]{scrbook}

\usepackage{luacode}

\begin{luacode}

local patch_cambriasc = function (fontdata)
 if fontdata.fontname == "Cambria"
 then
   fontdata.descriptions[983054]["unicode"]=109
   fontdata.descriptions[983055]["unicode"]=110
   fontdata.descriptions[983056]["unicode"]=111
   fontdata.descriptions[983213]["unicode"]=776 -- accent 0308
   fontdata.descriptions[983219]["unicode"]=778 -- accent 030A
   fontdata.descriptions[983078]["unicode"]=230 -- æ  
   fontdata.descriptions[983084]["unicode"]=231 -- ç
 end
end    

luatexbase.add_to_callback
 (
  "luaotfload.patch_font",
  patch_cambriasc,
  "change_cambria"
 )
\end{luacode}

\usepackage{fontspec}

\setmainfont[Ligatures={NoCommon}]{Cambria}
\newcommand{\name}[1]{\textsc{#1}}

\begin{document}
mno \name{omn}
\end{document}

When compiled with luatex I get in the pdf tounicode entries for all glyphs:

7 beginbfchar
<008F> <006D>
<0090> <006E>
<0091> <006F>
<0112> <006D>
<0113> <006E>
<0114> <006F>
<0373> <0031>
endbfchar

and the text copies and paste fine.

The following variant of the lua patch works too:

local patch_cambriasc = function (fontdata)
 if fontdata.fontname == "Cambria"
 then
   fontdata.characters[983054]["tounicode"]="006D"
   fontdata.characters[983055]["tounicode"]="006E"
   fontdata.characters[983056]["tounicode"]="006F"
   fontdata.characters[983213]["tounicode"]="0308" -- accent 0308
   fontdata.characters[983219]["tounicode"]="030A" -- accent 030A
   fontdata.characters[983078]["tounicode"]="00E6" -- æ
   fontdata.characters[983084]["tounicode"]="00E7" -- ç
 end
end

But I have no idea with the unicode fields works only in the description table and the tounicode only in characters.

Ulrike already provided a good answer. Let me provide some useful information.

If you are using Open Type fonts that are designed for Unicode, writing your document in Unicode, and using a Unicode-aware TeX engine with fontspec:

In a well-designed Unicode Open Type font, the character name of a ligature is of the form hello_there. The underscore connects a character named hello with one named there. More than two may be involved. So, the ligature between f and i is named f_i. A few ligatures (such as fi) have their own names, dating back decades.

Substitution of characters with ligatures is specified within the font, and understood by fontspec. However, that information is already processed by the time you have the PDF.

Ligatures can be assigned to the private/corporate user area in Unicode, and some fonts do that. But the standard prefers that ligatures do not have a code point. Instead they are found by reference.

When a good PDF reader sees the character f_i, it knows that it is supposed to see the two characters f and i when searching, and it knows that it is supposed to provide those two characters in plain text output. The reasons are that not all fonts have f_i, and those that have it may use different code points.

Unfortunately, some PDF readers (and text extractors) do not see f_i as two characters. They see it as a single character, which cannot be found by search, and cannot be exported as two characters to plain text.

Something similar applies to small caps. The word WHAT in small caps should be searched as ordinary what, and exported as ordinary what. Good PDF readers do that. But others see only the non-standard code points of the small cap letters.

The bottom line: This is not a property of XeTeX or LuaLaTeX. It is a property of the software that views the PDF.

Old-fashioned non-Unicode pdflatex, with fontenc and all that, is a different issue.

EDIT: A nasty font designer could use serdkwul as the name of the ligature between f and i. That is technically allowed, and will work as long as the ligature is defined within the font. It will also display and print properly in PDF. However, since the character name is not f_i, PDF readers will not know that it is supposed to be decomposed to f and i.

EDIT2: As for small caps, or any other variation on a character: In a well-designed Open Type font, variations of a character have the same name as the base character, followed by an extension. So, the small cap version for a might be named a.sc or a.smcp or something similar. The font's own lookup tables will state which character to use, when small caps are requested. A good PDF reader knows that a.ext is a variation on a. It will find it as a in search, and export it as a in plain text. A badly designed font will use something obscure such as asmcp (without the dot) for the lookup. That will display correctly in the PDF, but not be searched or exported as a. An inadequate PDF reader will not understand that a.sc is a variation of a. Character name a.foo.bar is also legal, as a variation on a.

Although a small cap letter is likely to have extension .sc or .smcp in the font, that is not a requirement. So, it is not reliable for a PDF reader to find small caps letters merely by looking at the extension. It may be that a high-end PDF reader can do it, if the original font is installed, so that the reader can internally inspect the font. I'm not sure.

Searchable and copyable small caps, ligatures and umlaut in PDF created by XeLaTex

Tags:

German

Xetex

Fontspec

Small Caps

Ligatures

Related

Recent Posts