How to find characters that LaTeX doesn't like?
I often have to perform this task as a production editor where the supplied files have mixed encodings. I wrote a small bash script called findnonascii
that just runs grep
:
#!/bin/sh
grep -n -P "[^|a-zA-Z\{\}\s%\./\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]" $@
Sample file:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
Sample character: õ
Another one: â
And again: ê
\end{document}
Output of findnonascii test.tex
:
7: Sample character: õ
9: Another one: â
11: And again: ê
Which gives the line numbers, so it narrows the search down a bit.
Edit:
Here's a Perl script that provides a platform-independent alternative:
#!/usr/bin/perl -w
use strict;
use warnings;
use feature 'unicode_strings';
if ($#ARGV == -1)
{
die "Syntax: $0 <filename>+\n";
}
foreach my $filename (@ARGV)
{
open (my $FH, $filename)
or die "Can't open '$filename' $!\n";
my $linenum = 0;
while (<$FH>)
{
$linenum++;
if (/[^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]/)
{
print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_;
}
}
close $FH;
}
1;
Edit 2:
The following is a slight modification that will highlight the characters so they're easier to see (I don't know if it will work on Windows):
#!/usr/bin/perl -w
use strict;
use warnings;
use feature 'unicode_strings';
use Term::ANSIColor;
if ($#ARGV == -1)
{
die "Syntax: $0 <filename>+\n";
}
foreach my $filename (@ARGV)
{
open (my $FH, $filename)
or die "Can't open '$filename' $!\n";
my $linenum = 0;
while (<$FH>)
{
$linenum++;
if (s/([^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]+)/&highlight($1)/eg)
{
print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_;
}
}
close $FH;
}
sub highlight{
my $text = $_[0];
colored($text, 'on_bright_red');
}
1;
The pattern used above is a subset of ASCII since TeX generally doesn't like control characters (although I rarely encounter a LaTeX file with control codes). A simpler pattern is [^ -~]
which excludes ([^
...]
) the range (start-
end) from space (, 0x20) to tilde (
~
, 0x7E). Note that this range doesn't cover the TAB character (0x09), which (La)TeX usually interprets as a space. If you also want to ignore TAB from the search then use [^ -~\t]
. Sophisticated text editors often allow regular expression searches and should accept that pattern.
I had the same problem in preparing bibliography and I managed to solve it with a text editor Sublime Text. Open the tex file and Ctrl+F, make sure the regular expression (first button) is on and type in [^\x00-\x7F]
to find. Special characters are circled.
VIM approach
I frequently have this problem when copying and pasting text. I also quite often enter accidentally an (invisibly) nonbreaking space (ALT-SPACE
on a Mac keyboard). To identify such characters, do the following:
Start with
:set hls
to let VIM highlight all search results. Then search with /[
<RANGE>
]
for characters in the ASCII code range between <128>
and <255>
. You can enter a character by its ASCII code by pressing CTRL-V
and then enter three digits for the decimal ASCII code:
/[
CTRL-V128
-
CTRL-V255
]
ENTER
All non-ASCII characters are highlighted, you can navigate between them with n
and N
as usual. To stop the highlighting of search results, use :set nohls
.