Why does my non-greedy Perl regex still match too much?
Others have mentioned how to fix this.
I'll answer how you can debug this: you can see what's happening by using more captures:
bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ;
print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
term1 = ""$tom" said blah blah blash. "$dick"" term2 = "said" term3 = ""blah blah blah""
Unfortunately "
is a peculiar-enough character to need to be treated carefully. Use:
my ($term) = /("[^"]+?" said "[^"]+?")/g;
and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.
The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see
"$tom" said blah blah blash.
and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?"
matches
"$tom" said blah blah blash. "$dick"
What you want is "[^"]+"
. This will match two quote marks enclosing anything that's not a quote mark. So the final solution:
("[^"]+" said "[^"]+")