See python script running while processing a task in QGIS?

I often have to perform this task as a production editor where the supplied files have mixed encodings. I wrote a small bash script called findnonascii that just runs grep:


grep -n -P "[^|a-zA-Z\{\}\s%\./\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]" $@

Sample file:



Sample character: õ

Another one: â

And again: ê


Output of findnonascii test.tex:

7:    Sample character: õ
9:    Another one: â
11:    And again: ê

Which gives the line numbers, so it narrows the search down a bit.


Here's a Perl script that provides a platform-independent alternative:

#!/usr/bin/perl -w

use strict;
use warnings;
use feature 'unicode_strings';

if ($#ARGV == -1)
   die "Syntax: $0 <filename>+\n";

foreach my $filename (@ARGV)
   open (my $FH, $filename)
      or die "Can't open '$filename' $!\n";

   my $linenum = 0;

   while (<$FH>)

      if (/[^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]/)
         print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_; 

   close $FH;


Edit 2:

The following is a slight modification that will highlight the characters so they're easier to see (I don't know if it will work on Windows):

#!/usr/bin/perl -w

use strict;
use warnings;
use feature 'unicode_strings';
use Term::ANSIColor;

if ($#ARGV == -1)
   die "Syntax: $0 <filename>+\n";

foreach my $filename (@ARGV)
   open (my $FH, $filename)
      or die "Can't open '$filename' $!\n";

   my $linenum = 0;

   while (<$FH>)

      if (s/([^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]+)/&highlight($1)/eg)
         print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_;


   close $FH;

sub highlight{
  my $text = $_[0];

  colored($text, 'on_bright_red');


The pattern used above is a subset of ASCII since TeX generally doesn't like control characters (although I rarely encounter a LaTeX file with control codes). A simpler pattern is [^ -~] which excludes ([^...]) the range (start-end) from space (, 0x20) to tilde (~, 0x7E). Note that this range doesn't cover the TAB character (0x09), which (La)TeX usually interprets as a space. If you also want to ignore TAB from the search then use [^ -~\t]. Sophisticated text editors often allow regular expression searches and should accept that pattern.

I had the same problem in preparing bibliography and I managed to solve it with a text editor Sublime Text. Open the tex file and Ctrl+F, make sure the regular expression (first button) is on and type in [^\x00-\x7F] to find. Special characters are circled.

Example Here

i've used the log file to help in such cases. in emacs two-window mode, with the log file in one window and the tex file in the other, i can mouse-over the unidentified character in the log, then go to the tex window, ^s to search, click the middle button to enter the search argument, then return to launch the search.

this requires a 3-button mouse, and sometimes several tries, but is the best approach i've found so far, since it doesn't require knowing what the unidentified character is, and the ^s search is repeatable.


