How to find characters that LaTeX doesn't like?

I often have to perform this task as a production editor where the supplied files have mixed encodings. I wrote a small bash script called findnonascii that just runs grep:

#!/bin/sh

grep -n -P "[^|a-zA-Z\{\}\s%\./\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]" $@

Sample file:

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\begin{document}
Sample character: õ

Another one: â

And again: ê

\end{document}

Output of findnonascii test.tex:

7:    Sample character: õ
9:    Another one: â
11:    And again: ê

Which gives the line numbers, so it narrows the search down a bit.

Edit:

Here's a Perl script that provides a platform-independent alternative:

#!/usr/bin/perl -w

use strict;
use warnings;
use feature 'unicode_strings';

if ($#ARGV == -1)
{
   die "Syntax: $0 <filename>+\n";
}

foreach my $filename (@ARGV)
{
   open (my $FH, $filename)
      or die "Can't open '$filename' $!\n";

   my $linenum = 0;

   while (<$FH>)
   {
      $linenum++;

      if (/[^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]/)
      {
         print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_; 
      }
   }

   close $FH;
}

1;

Edit 2:

The following is a slight modification that will highlight the characters so they're easier to see (I don't know if it will work on Windows):

#!/usr/bin/perl -w

use strict;
use warnings;
use feature 'unicode_strings';
use Term::ANSIColor;

if ($#ARGV == -1)
{
   die "Syntax: $0 <filename>+\n";
}

foreach my $filename (@ARGV)
{
   open (my $FH, $filename)
      or die "Can't open '$filename' $!\n";

   my $linenum = 0;

   while (<$FH>)
   {
      $linenum++;

      if (s/([^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]+)/&highlight($1)/eg)
      {
         print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_;
      }

   }

   close $FH;
}

sub highlight{
  my $text = $_[0];

  colored($text, 'on_bright_red');
}

1;

The pattern used above is a subset of ASCII since TeX generally doesn't like control characters (although I rarely encounter a LaTeX file with control codes). A simpler pattern is [^ -~] which excludes ([^...]) the range (start-end) from space (, 0x20) to tilde (~, 0x7E). Note that this range doesn't cover the TAB character (0x09), which (La)TeX usually interprets as a space. If you also want to ignore TAB from the search then use [^ -~\t]. Sophisticated text editors often allow regular expression searches and should accept that pattern.

I had the same problem in preparing bibliography and I managed to solve it with a text editor Sublime Text. Open the tex file and Ctrl+F, make sure the regular expression (first button) is on and type in [^\x00-\x7F] to find. Special characters are circled.

Example Here

VIM approach

I frequently have this problem when copying and pasting text. I also quite often enter accidentally an (invisibly) nonbreaking space (ALT-SPACE on a Mac keyboard). To identify such characters, do the following:

Start with :set hls to let VIM highlight all search results. Then search with /[<RANGE>] for characters in the ASCII code range between <128> and <255>. You can enter a character by its ASCII code by pressing CTRL-V and then enter three digits for the decimal ASCII code:

/[ CTRL-V128 - CTRL-V255 ] ENTER

All non-ASCII characters are highlighted, you can navigate between them with n and N as usual. To stop the highlighting of search results, use :set nohls.

How to find characters that LaTeX doesn't like?

Tags:

Unicode

Related

Recent Posts