How to find character locations of a string in a file?
In current versions of Perl, you can use the @-
and @+
magic arrays to get the positions of the matches of the whole regex and any possible capture groups. The zeroth element of both arrays holds the indexes related to the whole substring, so $-[0]
is the one you are interested in.
As a one-liner:
$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;' æa
2
4
6
Or a full script:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use open ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}
e.g.
$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6
(The latter script only works for stdin. I seem to trouble forcing Perl to treat all files as UTF-8.)
With zsh
:
set -o extendedglob # for (#m) which in patterns causes the matched portion to be
# made available in $MATCH and the offset (1-based) in $MBEGIN
# (and causes the expansion of the replacement in
# ${var//pattern/replacement} to be deferred to the
# time of replacement)
haystack=aöæaæaæa
needle=æ
offsets=() i=0
: ${haystack//(#m)$needle/$((offsets[++i] = MBEGIN - 1))}
print -l $offsets
With GNU awk
or any other POSIX compliant awk
implementation (not mawk
), and correct locale set:
$ LANG='en_US.UTF-8' gawk -v pat='æa' -- '
{
s = $0;
pos = 0;
while (match(s, pat)) {
pos += RSTART-1;
print "file", FILENAME ": line", FNR, "position", pos, "matched", substr(s, RSTART, RLENGTH);
pos += RLENGTH;
s = substr(s, RSTART+RLENGTH);
}
}
' <<<'aöæaæaæa'
file -: line 1 position 2 matched æa
file -: line 1 position 4 matched æa
file -: line 1 position 6 matched æa
$
The pattern indicated in the -v pat
argument to gawk
can be any valid regular expression.