How to fix missing or incorrect mappings from glyphtounicode.tex

You can add your own definitions. Eg. here an example how to copy an "a" as "A":

\documentclass[a4paper,12pt]{article}

\usepackage[ansinew]{inputenc}
\usepackage[T1]{fontenc}
\input{glyphtounicode}

\pdfglyphtounicode{a}{0041} %0041=A
\pdfgentounicode=1
\begin{document}
aaaaa 
\end{document}

The main problem is naturally to find the names of the glyphs you are using. In case you know the font you can find the names in the afm or the pfb. You can also add \pdfcompresslevel=0 to your document and then inspect the pdf. Look for lines starting with /CharSet (there will be more than one if you use more than one font). E.g. if I add \int to the example I will find /CharSet (/integraltext) and integraltext is the name of the glyph.

In case that the symbol is not a single glyph or that its name is not unique or changes from one font family to the next you will probably need to use the accsupp-package. Is it possible to provide alternative text to use when copying text from the PDF?.


The following concrete solution is based on Ulrike Fischer's answer:

Solution, part 1 (using \pdfglyphtounicode): The following lines help with the first batch of symbols:

\pdfglyphtounicode{notsubsetdbl}{22D0 0338}
\pdfglyphtounicode{simequal}{2245}
\pdfglyphtounicode{notsimequal}{2247}
\pdfglyphtounicode{uniontext}{22C3}
\pdfglyphtounicode{nelement}{2209}
\pdfglyphtounicode{nequal}{2260}
\pdfglyphtounicode{llbracket}{27E6}
\pdfglyphtounicode{rrbracket}{27E7}
\pdfglyphtounicode{llparenthesis}{0028 007C}
\pdfglyphtounicode{rrparenthesis}{007C 0029}
\pdfglyphtounicode{colonequal}{2254}

The macros \models, \Rsh, \textlengthmark, \blackdiamond, \sqbullet, \square seem to require the accsupp package. In the pdf-file, they are handled with the following glyphnames respectively: bar + equal, eacute, colon, ogonek, quotesinglbase, hungarumlaut. This explains their pasting behavior; these are names with normally different meanings, namely the ones shown by what's being pasted.

Solution, part 2 (using the package accsupp): The following code creates new "Unicode-compatible" commands. A user will of course need to replace the old commands with these new ones (\models by \Umodels etc.). The math character classes (mathord etc) used here are based on my unique needs.

\RequirePackage{accsupp} % Unicode-pastable versions of symbols
  \newcommand*{\Umodels}{\BeginAccSupp{method=hex,unicode,ActualText=22A7}\mathrel{\models}\EndAccSupp{}}
  \newcommand*{\URsh}{\BeginAccSupp{method=hex,unicode,ActualText=21B1}\mathord{\Rsh}\EndAccSupp{}}
  \newcommand*{\Utextlengthmark}{\BeginAccSupp{method=hex,unicode,ActualText=02D0}\textlengthmark\EndAccSupp{}}
  \newcommand*{\Ublackdiamond}{\BeginAccSupp{method=hex,unicode,ActualText=2B29}\mathord{\blackdiamond}\EndAccSupp{}}
  \newcommand*{\Usqbullet}{\BeginAccSupp{method=hex,unicode,ActualText=25AA}\mathord{\sqbullet}\EndAccSupp{}}
  \newcommand*{\Usquare}{\BeginAccSupp{method=hex,unicode,ActualText=25AB}\mathord{\square}\EndAccSupp{}}

(To those who are wondering, the value of ActualText can also be a space-separated list of hexadecimal UTF-16 values. Note that these are not Unicode codepoints but their UTF-16 representation (these are not identical for characters outside of Unicode's basic multilingual plane, BMP). For more information on how to paste Unicode characters outside of the BMP, see this question/answer.)

Bonus addendum: How to fix existing \pdfglyphtounicode assignments: If you would like to change an existing assignment such as U+25C1 for \lhd (glyphtounicode.tex contains the line \pdfglyphtounicode{triangleleft}{25C1}), simply reinvoke the \pdfglyphtounicode macro after the line \input glyphtounicode; for example you may write \pdfglyphtounicode{triangleleft}{22B2}, which will override the original definition.