How to truncate file to maximum number of characters (not bytes)
Some systems have a truncate
command that truncates files to a number of bytes (not characters).
I don't know of any that truncate to a number of characters, though you could resort to perl
which is installed by default on most systems:
perl
perl -Mopen=locale -ne '
BEGIN{$/ = \1234} truncate STDIN, tell STDIN; last' <> "$file"
With
-Mopen=locale
, we use the locale's notion of what characters are (so in locales using the UTF-8 charset, that's UTF-8 encoded characters). Replace with-CS
if you want I/O to be decoded/encoded in UTF-8 regardless of the locale's charset.$/ = \1234
: we set the record separator to a reference to an integer which is a way to specify records of fixed length (in number of characters).then upon reading the first record, we truncate stdin in place (so at the end of the first record) and exit.
GNU sed
With GNU sed
, you could do (assuming the file doesn't contain NUL characters or sequences of bytes which don't form valid characters -- both of which should be true of text files):
sed -Ez -i -- 's/^(.{1234}).*/\1/' "$file"
But that's far less efficient, as it reads the file in full and stores it whole in memory, and writes a new copy.
GNU awk
Same with GNU awk
:
awk -i inplace -v RS='^$' -e '{printf "%s", substr($0, 1, 1234)}' -E /dev/null "$file"
-e code -E /dev/null "$file"
being one way to pass arbitrary file names togawk
RS='^$'
: slurp mode.
Shell builtins
With ksh93
, bash
or zsh
(with shells other than zsh
, assuming the content doesn't contain NUL bytes):
content=$(cat < "$file" && echo .) &&
content=${content%.} &&
printf %s "${content:0:1234}" > "$file"
With zsh
:
read -k1234 -u0 s < $file &&
printf %s $s > $file
Or:
zmodload zsh/mapfile
mapfile[$file]=${mapfile[$file][1,1234]}
With ksh93
or bash
(beware it's bogus for multi-byte characters in several versions of bash
):
IFS= read -rN1234 s < "$file" &&
printf %s "$s" > "$file"
ksh93
can also truncate the file in place instead of rewriting it with its <>;
redirection operator:
IFS= read -rN1234 0<>; "$file"
iconv + head
To print the first 1234 characters, another option could be to convert to an encoding with a fixed number of bytes per character like UTF32BE
/UCS-4
:
iconv -t UCS-4 < "$file" | head -c "$((1234 * 4))" | iconv -f UCS-4
head -c
is not standard, but fairly common. A standard equivalent would be dd bs=1 count="$((1234 * 4))"
but would be less efficient, as it would read the input and write the output one byte at a time¹. iconv
is a standard command but the encoding names are not standardized, so you might find systems without UCS-4
Notes
In any case, though the output would have at most 1234 characters, it may end up not being valid text, as it would possibly end in a non-delimited line.
Also note that while while those solutions wouldn't cut text in the middle of a character, they could break it in the middle of a grapheme , like a é
expressed as U+0065 U+0301 (a e
followed by a combining acute accent), or Hangul syllable graphemes in their decomposed forms.
¹ and on pipe input you can't use bs
values other than 1 reliably unless you use the iflag=fullblock
GNU extension, as dd
could do short reads if it reads the pipe quicker than iconv
fills it
If you know that the text file contains Unicode encoded as UTF-8 you have to first decode the UTF-8 to get a sequence of Unicode character entities and split those.
I'd choose Python 3.x for the job.
With Python 3.x the function open() has an extra key-word argument encoding=
for reading text-files. The description of method io.TextIOBase.read() looks promising.
So using Python 3 it would look like this:
truncated = open('/path/to/file.txt', 'rt', encoding='utf-8').read(1000)
Obviously a real tool would add command-line arguments, error handling etc.
With Python 2.x you could implement your own file-like object and decode the input file line-by-line.