How to unwrap 80 character text

The answer using fmt seems to be to wrap text rather than unwrap it.

In general, this can be a difficult problem. For example, distinguishing between adjacent lines of text which are deliberately finished early (e.g bullet points) and adjacent lines of free flowing text can require some context. Distinguishing between hyphenated words split across lines and split up words is also hard.

A common form for prose, however, is adjacent wrapped lines of text forming a paragraph separated by a single empty newline.

This can be unwrapped using the following, rather involved, sed one liner:

sed -n '/.+/ H; /^$/ { x; s/\n/ /g; s/$/\n/ ; p}'

Alternatively you might prefer a tiny python script, particularly if you are going to handle some special cases:

import sys
paragraph = []

for line in sys.stdin:
    line = line.strip()
    if line:
        paragraph.append(line)
    else:
        print ' '.join(paragraph).replace('  ', ' ')
        paragraph = []
if paragraph:
    print ' '.join(paragraph).replace(' ', ' ')

If you find yourself adding special casing then you like to find the origin of your line-wrapped text and obtain it in a non line-wrapped form.


Special cases, as Att Righ said…

I found this question because I wanted to "unwrap" output from the fortune program, which annoyingly isn't even standardized — some fortune cookies are wrapped at 78 character, others at 77, 76, or even 75.
My script tries to determine if a newline has been inserted on purpose or because of the length limit by determining if the line would violate the length limit if it hadn't been broken at this exact length (i.e. if it would be too long if it also included the first word from the next line). As a useful side effect, if the next line starts with whitespace, the first word (as separated by whitespace) is the empty string, so indented paragraphs are never merged onto the line above them.

#!/usr/bin/python3

import sys
import fileinput

lines = list(fileinput.input())
lines = [l.strip('\r\n') for l in lines]

for i, l in enumerate(lines):
    # We need to account for 8-char-wide tabulators when calculating our line
    # length, but still want to print the original \t characters verbatim
    sanitized_line = l.replace('\t', ' '*8)

    # Is there a next line?
    if i+1 < len(lines):
        sanitized_next_line = lines[i+1].replace('\t', ' '*8)
    else:
        sanitized_next_line = ''

    next_line_first_word = sanitized_next_line.split(' ', 1)[0]

    if next_line_first_word != '':
        extended_line = sanitized_line + ' ' + next_line_first_word
    else:
        extended_line = sanitized_line

    if len(sanitized_line) <= 78 and len(extended_line) > 74:
        # This line was wrapped due to 78-char limit => unwrap it!
        sys.stdout.write(l + ' ')
    else:
        sys.stdout.write(l + '\n')

Tags:

Linux

Unix

Script