How to truncate file to maximum number of characters (not bytes)

Some systems have a truncate command that truncates files to a number of bytes (not characters).

I don't know of any that truncate to a number of characters, though you could resort to perl which is installed by default on most systems:

perl

perl -Mopen=locale -ne '
  BEGIN{$/ = \1234} truncate STDIN, tell STDIN; last' <> "$file"

With -Mopen=locale, we use the locale's notion of what characters are (so in locales using the UTF-8 charset, that's UTF-8 encoded characters). Replace with -CS if you want I/O to be decoded/encoded in UTF-8 regardless of the locale's charset.
$/ = \1234: we set the record separator to a reference to an integer which is a way to specify records of fixed length (in number of characters).
then upon reading the first record, we truncate stdin in place (so at the end of the first record) and exit.

GNU sed

With GNU sed, you could do (assuming the file doesn't contain NUL characters or sequences of bytes which don't form valid characters -- both of which should be true of text files):

sed -Ez -i -- 's/^(.{1234}).*/\1/' "$file"

But that's far less efficient, as it reads the file in full and stores it whole in memory, and writes a new copy.

GNU awk

Same with GNU awk:

awk -i inplace -v RS='^$' -e '{printf "%s", substr($0, 1, 1234)}' -E /dev/null "$file"

-e code -E /dev/null "$file" being one way to pass arbitrary file names to gawk
RS='^$': slurp mode.

Shell builtins

With ksh93, bash or zsh (with shells other than zsh, assuming the content doesn't contain NUL bytes):

content=$(cat < "$file" && echo .) &&
  content=${content%.} &&
  printf %s "${content:0:1234}" > "$file"

With zsh:

read -k1234 -u0 s < $file &&
  printf %s $s > $file

Or:

zmodload zsh/mapfile
mapfile[$file]=${mapfile[$file][1,1234]}

With ksh93 or bash (beware it's bogus for multi-byte characters in several versions of bash):

IFS= read -rN1234 s < "$file" &&
  printf %s "$s" > "$file"

ksh93 can also truncate the file in place instead of rewriting it with its <>; redirection operator:

IFS= read -rN1234 0<>; "$file"

iconv + head

To print the first 1234 characters, another option could be to convert to an encoding with a fixed number of bytes per character like UTF32BE/UCS-4:

iconv -t UCS-4 < "$file" | head -c "$((1234 * 4))" | iconv -f UCS-4

head -c is not standard, but fairly common. A standard equivalent would be dd bs=1 count="$((1234 * 4))" but would be less efficient, as it would read the input and write the output one byte at a time¹. iconv is a standard command but the encoding names are not standardized, so you might find systems without UCS-4

Notes

In any case, though the output would have at most 1234 characters, it may end up not being valid text, as it would possibly end in a non-delimited line.

Also note that while while those solutions wouldn't cut text in the middle of a character, they could break it in the middle of a grapheme , like a é expressed as U+0065 U+0301 (a e followed by a combining acute accent), or Hangul syllable graphemes in their decomposed forms.

^{¹ and on pipe input you can't use bs values other than 1 reliably unless you use the iflag=fullblock GNU extension, as dd could do short reads if it reads the pipe quicker than iconv fills it}

If you know that the text file contains Unicode encoded as UTF-8 you have to first decode the UTF-8 to get a sequence of Unicode character entities and split those.

I'd choose Python 3.x for the job.

With Python 3.x the function open() has an extra key-word argument encoding= for reading text-files. The description of method io.TextIOBase.read() looks promising.

So using Python 3 it would look like this:

truncated = open('/path/to/file.txt', 'rt', encoding='utf-8').read(1000)

Obviously a real tool would add command-line arguments, error handling etc.

With Python 2.x you could implement your own file-like object and decode the input file line-by-line.

How to truncate file to maximum number of characters (not bytes)

perl

GNU sed

GNU awk

Shell builtins

iconv + head

Notes

Tags:

Text Processing

Related

Recent Posts