See python script running while processing a task in QGIS?
I often have to perform this task as a production editor where the supplied files have mixed encodings. I wrote a small bash script called findnonascii
that just runs grep
:
#!/bin/sh
grep -n -P "[^|a-zA-Z\{\}\s%\./\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]" $@
Sample file:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
Sample character: õ
Another one: â
And again: ê
\end{document}
Output of findnonascii test.tex
:
7: Sample character: õ
9: Another one: â
11: And again: ê
Which gives the line numbers, so it narrows the search down a bit.
Edit:
Here's a Perl script that provides a platform-independent alternative:
#!/usr/bin/perl -w
use strict;
use warnings;
use feature 'unicode_strings';
if ($#ARGV == -1)
{
die "Syntax: $0 <filename>+\n";
}
foreach my $filename (@ARGV)
{
open (my $FH, $filename)
or die "Can't open '$filename' $!\n";
my $linenum = 0;
while (<$FH>)
{
$linenum++;
if (/[^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]/)
{
print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_;
}
}
close $FH;
}
1;
Edit 2:
The following is a slight modification that will highlight the characters so they're easier to see (I don't know if it will work on Windows):
#!/usr/bin/perl -w
use strict;
use warnings;
use feature 'unicode_strings';
use Term::ANSIColor;
if ($#ARGV == -1)
{
die "Syntax: $0 <filename>+\n";
}
foreach my $filename (@ARGV)
{
open (my $FH, $filename)
or die "Can't open '$filename' $!\n";
my $linenum = 0;
while (<$FH>)
{
$linenum++;
if (s/([^|a-zA-Z\{\}\s%\.\/\-:;,0-9@=\\\\\"'\(\)_~\$\!&\`\?+#\^<>\[\]\*]+)/&highlight($1)/eg)
{
print $#ARGV > 0 ? "$filename " : '', "l.$linenum: ", $_;
}
}
close $FH;
}
sub highlight{
my $text = $_[0];
colored($text, 'on_bright_red');
}
1;
The pattern used above is a subset of ASCII since TeX generally doesn't like control characters (although I rarely encounter a LaTeX file with control codes). A simpler pattern is [^ -~]
which excludes ([^
...]
) the range (start-
end) from space (, 0x20) to tilde (
~
, 0x7E). Note that this range doesn't cover the TAB character (0x09), which (La)TeX usually interprets as a space. If you also want to ignore TAB from the search then use [^ -~\t]
. Sophisticated text editors often allow regular expression searches and should accept that pattern.
I had the same problem in preparing bibliography and I managed to solve it with a text editor Sublime Text. Open the tex file and Ctrl+F, make sure the regular expression (first button) is on and type in [^\x00-\x7F]
to find. Special characters are circled.
i've used the log file to help in such cases. in emacs two-window mode, with the
log file in one window and the tex file in the other, i can mouse-over the unidentified
character in the log, then go to the tex window, ^s
to search, click the middle button
to enter the search argument, then return to launch the search.
this requires a 3-button mouse, and sometimes several tries, but is the best approach
i've found so far, since it doesn't require knowing what the unidentified character is,
and the ^s
search is repeatable.