sed: read whole file into pattern space without failing on single-line input
There are all kinds of reasons why reading a whole file into pattern space can go wrong. The logic problem in the question surrounding the last line is a common one. It is related to sed
's line cycle - when there are no more lines and sed
encounters EOF it is through - it quits processing. And so if you are on the last line and you instruct sed
to get another it's going to stop right there and do no more.
That said, if you really need to read a whole file into pattern space, then it is probably worth considering another tool anyway. The fact is, sed
is eponymously the stream editor - it is designed to work a line - or a logical data block - at a time.
There are many similar tools that are better equipped to handle full file blocks. ed
and ex
, for example, can do much of what sed
can do and with similar syntax - and much else besides - but rather than operating only on an input stream while transforming it to output as sed
does, they also maintain temporary backup files in the file-system. Their work is buffered to disk as needed, and they do not quit abruptly at end of file (and tend to implode a lot less often under buffer strain). Moreover they offer many useful functions which sed
does not - of the sort that simply do not make sense in a stream context - like line marks, undo, named buffers, join, and more.
sed
's primary strength is its ability to process data as soon as it reads it - quickly, efficiently, and in stream. When you slurp a file you throw that away and you tend to run into edge case difficulties like the last line problem you mention, and buffer overruns, and abysmal performance - as the data it parses grows in length a regexp engine's processing time when enumerating matches increases exponentially.
Regarding that last point, by the way: while I understand the example s/a/A/g
case is very likely just a naive example and is probably not the actual script you want to gather in an input for, you might might find it worth your while to familiarize yourself with y///
. If you often find yourself g
lobally substituting a single character for another, then y
could be very useful for you. It is a transformation as opposed to a substitution and is far quicker as it does not imply a regexp. This latter point can also make it useful when attempting to preserve and repeat empty //
addresses because it does not affect them but can be affected by them. In any case, y/a/A/
is a more simple means of accomplishing the same - and swaps are possible as well like: y/aA/Aa/
which would interchange all upper/lowercase as on a line for each other.
You should also note that the behavior you describe is really not what is supposed to happen anyway.
From GNU's info sed
in the COMMONLY REPORTED BUGS section:
N
command on the last lineMost versions of
sed
exit without printing anything when theN
command is issued on the last line of a file. GNUsed
prints pattern space before exiting unless of course the-n
command switch has been specified. This choice is by design.For example, the behavior of
sed N foo bar
would depend on whether foo has an even or an odd number of lines. Or, when writing a script to read the next few lines following a pattern match, traditional implementations ofsed
would force you to write something like/foo/{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N }
instead of just/foo/{ N;N;N;N;N;N;N;N;N; }
.In any case, the simplest workaround is to use
$d;N
in scripts that rely on the traditional behavior, or to set thePOSIXLY_CORRECT
variable to a non-empty value.
The POSIXLY_CORRECT
environment variable is mentioned because POSIX specifies that if sed
encounters EOF when attempting an N
it should quit without output, but the GNU version intentionally breaks with the standard in this case. Note also that even as the behavior is justified above the assumption is that the error case is one of stream-editing - not slurping a whole file into memory.
The standard defines N
's behavior thus:
N
Append the next line of input, less its terminating
\n
ewline, to the pattern space, using an embedded\n
ewline to separate the appended material from the original material. Note that the current line number changes.If no next line of input is available, the
N
command verb shall branch to the end of the script and quit without starting a new cycle or copying the pattern space to standard output.
On that note, there are some other GNU-isms demonstrated in the question - particularly the use of the :
label, b
ranch, and {
function-context brackets }
. As a rule of thumb any sed
command which accepts an arbitrary parameter is understood to delimit at a \n
ewline in the script. So the commands...
:arbitrary_label_name; ...
b to_arbitrary_label_name; ...
//{ do arbitrary list of commands } ...
...are all very likely to perform erratically depending on the sed
implementation that reads them. Portably they should be written:
...;:arbitrary_label_name
...;b to_arbitrary_label_name
//{ do arbitrary list of commands
}
The same holds true for r
, w
, t
, a
, i
, and c
(and possibly a few more that I'm forgetting at the moment). In almost every case they might also be written:
sed -e :arbitrary_label_name -e b\ to_arbitary_label_name -e \
"//{ do arbitrary list of commands" -e \}
...where the new -e
xecution statement stands in for the \n
ewline delimiter. So where the GNU info
text suggests a traditional sed
implementation would force you to do:
/foo/{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N }
...it should rather be...
/foo/{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N
}
...of course, that isn't true either. Writing the script in that way is a little silly. There are much more simple means of doing the same, like:
printf %s\\n foo . . . . . . |
sed -ne 'H;/foo/h;x;//s/\n/&/3p;tnd
//!g;x;$!d;:nd' -e 'l;$a\' \
-e 'this is the last line'
...which prints:
foo
.
.
.
foo\n.\n.\n.$
.$
this is the last line
...because the t
est command - like most sed
commands - depends on the line cycle to refresh its return register and here the line cycle is permitted to do most of the work. That's another tradeoff you make when you slurp a file - the line cycle doesn't refresh ever again, and so many tests will behave abnormally.
The above command doesn't risk over-reaching input because it just does some simple tests to verify what it reads as it reads it. With H
old all lines are appended to the hold space, but if a line matches /foo/
it overwrites h
old space. The buffers are next ex
changed, and a conditional s///
ubstitution is attempted if the contents of the buffer match the //
last pattern addressed. In other words //s/\n/&/3p
attempts to replace the third newline in hold space with itself and print the results if hold space currently matches /foo/
. If that t
ests successful the script branches to the n
ot d
elete label - which does a l
ook and wraps up the script.
In the case that both /foo/
and a third newline cannot be matched together in hold space though, then //!g
will overwrite the buffer if /foo/
is not matched, or, if it is matched, it will overwrite the buffer if a \n
ewline is not matched (thereby replacing /foo/
with itself). This little subtle test keeps the buffer from filling up unnecessarily for long stretches of no /foo/
and ensures the process stays snappy because the input does not pile on. Following on in a no /foo/
or //s/\n/&/3p
fail case the buffers are again swapped and every line but the last is there deleted.
That last - the last line $!d
- is a simple demonstration of how a top-down sed
script can be made to handle multiple cases easily. When your general method is to prune away unwanted cases starting with the most general and working toward the most specific then edge cases can be more easily handled because they are simply allowed to fall through to the end of the script with your other wanted data and when it all wraps you're left with only the data you want. Having to fetch such edge cases out of a closed loop can be far more difficult to do, though.
And so here's the last thing I have to say: if you must really pull in an entire file, then you can stand to do a little less work by relying on the line cycle to do it for you. Typically you would use N
ext and n
ext for lookahead - because they advance ahead of the line cycle. Rather than redundantly implementing a closed loop within a loop - as the sed
line cycle is just an simple read loop anyway - if your purpose is only to gather input indiscriminately, then it is probably easier to do:
sed 'H;1h;$!d;x;...'
...which will gather the entire file or go bust trying.
a side note about N
and last line behavior...
while i do not have the tools available to me to test, consider that
N
when reading and in-place editing behaves differently if the file edited is the script file for next readthrough.
It fails because the N
command comes before the pattern match $!
(not last line) and sed quits before doing any work:
N
Add a newline to the pattern space, then append the next line of input to the pattern space. If there is no more input then sed exits without processing any more commands.
This can be easily fixed to work with single-line input as well (and indeed to be more clear in any case) by simply grouping the N
and b
commands after the pattern:
sed ':a;$!{N;ba}; [commands...]'
It works as follows:
:a
create a label named 'a'$!
if not the last line, thenN
append the next line to the pattern space (or quit if there is no next line) andba
branch (go to) label 'a'
Unfortunately, it's not portable (as it relies on GNU extensions), but the following alternative (suggested by @mikeserv) is portable:
sed 'H;1h;$!d;x; [commands...]'