Why is printf "shrinking" umlaut?
POSIX requires printf
's %-20s
to count those 20 in terms of bytes not characters even though that makes little sense as printf
is to print text, formatted (see discussion at the Austin Group (POSIX) and bash
mailing lists).
The printf
builtin of bash
and most other POSIX shells honour that.
zsh
ignores that silly requirement (even in sh
emulation) so printf
works as you'd expect there. Same for the printf
builtin of fish
(not a POSIX-like shell).
The ü
character (U+00FC), when encoded in UTF-8 is made of two bytes (0xc3 and 0xbc), which explains the discrepancy.
$ printf %s 'Früchte und Gemüse' | wc -mcL
18 20 18
That string is made of 18 characters, is 18 columns wide (-L
being a GNU wc
extension to report the display width of the widest line in the input) but is encoded on 20 bytes.
In zsh
or fish
, the text would be aligned correctly.
Now, there are also characters that have 0-width (like combining characters such as U+0308, the combining diaresis) or have double-width like in many Asiatic scripts (not to mention control characters like Tab) and even zsh
wouldn't align those properly.
Example, in zsh
:
$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
In bash
:
$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
ksh93
has a %Ls
format specification to count the width in terms of display width.
$ printf '%3Ls|\n' u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
That still doesn't work if the text contains control characters like TAB (how could it? printf
would have to know how far apart the tab stops are in the output device and what position it starts printing at). It does work by accident with backspace characters (like in the roff
output where X
(bold X
) is written as X\bX
) though as ksh93
considers all control characters as having a width of -1
.
Other options
In zsh
, you can use its padding parameter expansion flags (l
for left-padding, r
for right-padding), which when combined with the m
flag considers the display width of characters (as opposed to the number of characters in the string):
$ () { printf '%s|\n' "${(ml[3])@}"; } u ü $'u\u308' $'\u1100'
u|
ü|
ü|
ᄀ|
With expand
:
printf '%s\t|\n' u ü $'u\u308' $'\u1100' | expand -t3
That works with some expand
implementations (not GNU's though).
On GNU systems, you could use GNU awk
whose printf
counts in chars (not bytes, not display-widths, so still not OK for the 0-width or 2-width characters, but OK for your sample):
gawk 'BEGIN {for (i = 1; i < ARGC; i++) printf "%-3s|\n", ARGV[i]}
' u ü $'u\u308' $'\u1100'
If the output goes to a terminal, you can also use cursor positioning escape sequences. Like:
forward21=$(tput cuf 21)
printf '%s\r%s%s\n' \
"Früchte und Gemüse" "$forward21" "foo" \
"Milchprodukte" "$forward21" "bar" \
"12345678901234567890" "$forward21" "baz"
If I change its encoding to latin-1, the alignment is correct, but the umlauts are rendered wrong:
Fr�chte und Gem�se foo Milchprodukte bar 12345678901234567890 baz
Actually, no, but your terminal doesn't speak latin-1, and therefore you get junk rather than umlauts.
You can fix this by using iconv:
printf foo bar | iconv -f ISO8859-1 -t UTF-8
(or just run the whole shell script piped into iconv)
${#var}
characters count is correct since bash3.0+.
Try (with any version of bash):
bash -c "a="$'aáíóuúüoözu\u308\u1100'';printf "%s\n" "${a} ${#a}"'
That will give the correct count since bash 3.0.
Note however that $'u\u308'
requires a bash to be 4.2+.
This makes it possible to compute a proper padding:
#!/usr/bin/env bash
strings=(
'Früchte und Gemüse'
'Milchprodukte'
'12345678901234567890'
)
# Initialize column width
cw=20
for str in "${strings[@]}"
do
# Format column1 with computed padding
printf -v col1string '%s%*s' "$str" $((cw-${#str})) ''
# Print column1 with computed padding, followed by column2
printf "%s %s\n" "$col1string" 'col2string'
done
Output:
Früchte und Gemüse col2string
Milchprodukte col2string
12345678901234567890 col2string
Working with featured alignment functions:
#!/usr/bin/env bash
# Space pad align string to width
# @params
# $1: The alignment width
# $2: The string to align
# @stdout
# aligned string
# @return:
# 1: If a string exceeds alignment width
# 2: If missing arguments
align_left ()
{
(($#==2)) || return 2
((${#2}>$1)) && return 1
printf '%s%*s' "$2" $(($1-${#2})) ''
}
align_right ()
{
(($#==2)) || return 2
((${#2}>$1)) && return 1
printf '%*s%s' $(($1-${#2})) '' "$2"
}
align_center ()
{
(($#==2)) || return 2
((${#2}>$1)) && return 1
l=$((($1-${#2})/2))
printf '%*s%s%*s' $l '' "$2" $(($1-${#2}-l)) ''
}
strings=(
'Früchte und Gemüse'
'Milchprodukte'
'12345678901234567890'
)
echo 'Left-aligned:'
for str in "${strings[@]}"
do
printf "| %s |\n" "$(align_left 20 "$str")"
done
echo
echo 'Right-aligned:'
for str in "${strings[@]}"
do
printf "| %s |\n" "$(align_right 20 "$str")"
done
echo
echo 'Center-aligned:'
for str in "${strings[@]}"
do
printf "| %s |\n" "$(align_center 20 "$str")"
done
Output:
Left-aligned:
| Früchte und Gemüse |
| Milchprodukte |
| 12345678901234567890 |
Right-aligned:
| Früchte und Gemüse |
| Milchprodukte |
| 12345678901234567890 |
Center-aligned:
| Früchte und Gemüse |
| Milchprodukte |
| 12345678901234567890 |
EDITS:
- Add ksh-93 | POSIX implementation
- More POSIXness with
expr
, now also tested working with:
- ash (Busybox 1.x)
- ksh93 Version A 2020.0.0
- zsh 5.8
- With advice from Stéphane Chazelas: replaced
expr length "$2"
byexpr " $2" : '.*' - 1
. - Updated introduction with isaac's comment.
${#var}
characters count is correct since bash3.0+.
This seems to work as well with ksh or POSIX syntax:
#!/usr/bin/env sh
# Space pad align or truncate string to width
# @params
# $1: The alignment width
# $2: The string to align
# @stdout
# The aligned string
# @return:
# 1: If the string was truncated alignment width
# 2: If missing arguments
__align_check ()
{
if [ $# -ne 2 ]; then return 2; fi
if [ "$(expr " $2" : '.*' - 1)" -gt "$1" ]; then
printf '%s' "$(expr substr "$2" 1 $1)"
return 1
fi
}
align_left ()
{
__align_check "$@" || return $?
printf '%s%*s' "$2" $(($1-$(expr " $2" : '.*' - 1))) ''
}
align_right ()
{
__align_check "$@" || return $?
printf '%*s%s' $(($1-$(expr " $2" : '.*' - 1))) '' "$2"
}
align_center ()
{
__align_check "$@" || return $?
tpl=$(($1-$(expr " $2" : '.*' - 1)))
lpl=$((tpl/2))
rpl=$((tpl-lpl))
printf '%*s%s%*s' $lpl '' "$2" $rpl ''
}
main ()
{
hr="+----------------------+----------------------+----------------------\
+------+"
echo "$hr"
printf '| %s | %s | %s | %s |\n' \
"$(align_left 20 'Left-aligned')" \
"$(align_center 20 'Center-aligned')" \
"$(align_right 20 'Right-aligned')" \
"$(align_center 4 'RC')"
echo "$hr"
for str
do
printf '| %s | %s | %s | %s |\n' \
"$(align_left 20 "$str")" \
"$(align_center 20 "$str")" \
"$(align_right 20 "$str")" \
"$(align_right 4 "$?")"
done
echo "$hr"
}
main \
'Früchte und Gemüse' \
'Milchprodukte' \
'12345678901234567890' \
'This string is much too long'
Output:
+----------------------+----------------------+----------------------+------+
| Left-aligned | Center-aligned | Right-aligned | RC |
+----------------------+----------------------+----------------------+------+
| Früchte und Gemüse | Früchte und Gemüse | Früchte und Gemüse | 0 |
| Milchprodukte | Milchprodukte | Milchprodukte | 0 |
| 12345678901234567890 | 12345678901234567890 | 12345678901234567890 | 0 |
| This string is much | This string is much | This string is much | 1 |
+----------------------+----------------------+----------------------+------+