Why is printf "shrinking" umlaut?

POSIX requires printf's %-20s to count those 20 in terms of bytes not characters even though that makes little sense as printf is to print text, formatted (see discussion at the Austin Group (POSIX) and bash mailing lists).

The printf builtin of bash and most other POSIX shells honour that.

zsh ignores that silly requirement (even in sh emulation) so printf works as you'd expect there. Same for the printf builtin of fish (not a POSIX-like shell).

The ü character (U+00FC), when encoded in UTF-8 is made of two bytes (0xc3 and 0xbc), which explains the discrepancy.

$ printf %s 'Früchte und Gemüse' | wc -mcL
    18      20      18

That string is made of 18 characters, is 18 columns wide (-L being a GNU wc extension to report the display width of the widest line in the input) but is encoded on 20 bytes.

In zsh or fish, the text would be aligned correctly.

Now, there are also characters that have 0-width (like combining characters such as U+0308, the combining diaresis) or have double-width like in many Asiatic scripts (not to mention control characters like Tab) and even zsh wouldn't align those properly.

Example, in zsh:

$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
  u|
  ü|
 ü|
  ᄀ|

In bash:

$ printf '%3s|\n' u ü $'u\u308' $'\u1100'
  u|
 ü|
ü|
ᄀ|

ksh93 has a %Ls format specification to count the width in terms of display width.

$ printf '%3Ls|\n' u ü $'u\u308' $'\u1100'
  u|
  ü|
  ü|
 ᄀ|

That still doesn't work if the text contains control characters like TAB (how could it? printf would have to know how far apart the tab stops are in the output device and what position it starts printing at). It does work by accident with backspace characters (like in the roff output where X (bold X) is written as X\bX) though as ksh93 considers all control characters as having a width of -1.

Other options

In zsh, you can use its padding parameter expansion flags (l for left-padding, r for right-padding), which when combined with the m flag considers the display width of characters (as opposed to the number of characters in the string):

$ () { printf '%s|\n' "${(ml[3])@}"; } u ü $'u\u308' $'\u1100'
  u|
  ü|
  ü|
 ᄀ|

With expand:

printf '%s\t|\n' u ü $'u\u308' $'\u1100' | expand -t3

That works with some expand implementations (not GNU's though).

On GNU systems, you could use GNU awk whose printf counts in chars (not bytes, not display-widths, so still not OK for the 0-width or 2-width characters, but OK for your sample):

gawk 'BEGIN {for (i = 1; i < ARGC; i++) printf "%-3s|\n", ARGV[i]}
     ' u ü $'u\u308' $'\u1100'

If the output goes to a terminal, you can also use cursor positioning escape sequences. Like:

forward21=$(tput cuf 21)
printf '%s\r%s%s\n' \
  "Früchte und Gemüse"    "$forward21" "foo" \
  "Milchprodukte"         "$forward21" "bar" \
  "12345678901234567890"  "$forward21" "baz"

If I change its encoding to latin-1, the alignment is correct, but the umlauts are rendered wrong:
Fr�chte und Gem�se   foo
Milchprodukte        bar
12345678901234567890 baz

Actually, no, but your terminal doesn't speak latin-1, and therefore you get junk rather than umlauts.

You can fix this by using iconv:

printf foo bar | iconv -f ISO8859-1 -t UTF-8

(or just run the whole shell script piped into iconv)

${#var} characters count is correct since bash3.0+.

Try (with any version of bash):

bash -c "a="$'aáíóuúüoözu\u308\u1100'';printf "%s\n" "${a} ${#a}"'

That will give the correct count since bash 3.0.

Note however that $'u\u308' requires a bash to be 4.2+.

This makes it possible to compute a proper padding:

#!/usr/bin/env bash

strings=(
  'Früchte und Gemüse'
  'Milchprodukte'
  '12345678901234567890'
)

# Initialize column width
cw=20

for str in "${strings[@]}"
do
  # Format column1 with computed padding
  printf -v col1string '%s%*s' "$str" $((cw-${#str})) ''

  # Print column1 with computed padding, followed by column2
  printf "%s %s\n" "$col1string" 'col2string'
done

Output:

Früchte und Gemüse   col2string
Milchprodukte        col2string
12345678901234567890 col2string

Working with featured alignment functions:

#!/usr/bin/env bash

# Space pad align string to width
# @params
# $1: The alignment width
# $2: The string to align
# @stdout
# aligned string
# @return:
# 1: If a string exceeds alignment width
# 2: If missing arguments
align_left ()
{
  (($#==2)) || return 2
  ((${#2}>$1)) && return 1
  printf '%s%*s' "$2" $(($1-${#2})) ''
}
align_right ()
{
  (($#==2)) || return 2
  ((${#2}>$1)) && return 1
  printf '%*s%s' $(($1-${#2})) '' "$2"
}
align_center ()
{
  (($#==2)) || return 2
  ((${#2}>$1)) && return 1
  l=$((($1-${#2})/2))
  printf '%*s%s%*s' $l '' "$2" $(($1-${#2}-l)) ''
}

strings=(
  'Früchte und Gemüse'
  'Milchprodukte'
  '12345678901234567890'
)

echo 'Left-aligned:'
for str in "${strings[@]}"
do
  printf "| %s |\n" "$(align_left 20 "$str")"
done
echo
echo 'Right-aligned:'
for str in "${strings[@]}"
do
  printf "| %s |\n" "$(align_right 20 "$str")"
done
echo
echo 'Center-aligned:'
for str in "${strings[@]}"
do
  printf "| %s |\n" "$(align_center 20 "$str")"
done

Output:

Left-aligned:
| Früchte und Gemüse   |
| Milchprodukte        |
| 12345678901234567890 |

Right-aligned:
|   Früchte und Gemüse |
|        Milchprodukte |
| 12345678901234567890 |

Center-aligned:
|  Früchte und Gemüse  |
|    Milchprodukte     |
| 12345678901234567890 |

EDITS:

Add ksh-93 | POSIX implementation
More POSIXness with expr, now also tested working with:

ash (Busybox 1.x)
ksh93 Version A 2020.0.0
zsh 5.8

With advice from Stéphane Chazelas: replaced expr length "$2" by expr " $2" : '.*' - 1.
Updated introduction with isaac's comment.

${#var} characters count is correct since bash3.0+.

This seems to work as well with ksh or POSIX syntax:

#!/usr/bin/env sh

# Space pad align or truncate string to width
# @params
# $1: The alignment width
# $2: The string to align
# @stdout
# The aligned string
# @return:
# 1: If the string was truncated alignment width
# 2: If missing arguments
__align_check ()
{
  if [ $# -ne 2 ]; then return 2; fi
  if [ "$(expr " $2" : '.*' - 1)" -gt "$1" ]; then
    printf '%s' "$(expr substr "$2" 1 $1)"
    return 1
  fi
}
align_left ()
{
  __align_check "$@" || return $?
  printf '%s%*s' "$2" $(($1-$(expr " $2" : '.*' - 1))) ''
}
align_right ()
{
  __align_check "$@" || return $?
  printf '%*s%s' $(($1-$(expr " $2" : '.*' - 1))) '' "$2"
}
align_center ()
{
  __align_check "$@" || return $?
  tpl=$(($1-$(expr " $2" : '.*' - 1)))
  lpl=$((tpl/2))
  rpl=$((tpl-lpl))
  printf '%*s%s%*s' $lpl '' "$2" $rpl ''
}

main ()
{
  hr="+----------------------+----------------------+----------------------\
+------+"
  echo "$hr"
  printf '| %s | %s | %s | %s |\n' \
    "$(align_left 20 'Left-aligned')" \
    "$(align_center 20 'Center-aligned')" \
    "$(align_right 20 'Right-aligned')" \
    "$(align_center 4 'RC')"
  echo "$hr"

  for str
  do
    printf '| %s | %s | %s | %s |\n' \
      "$(align_left 20 "$str")" \
      "$(align_center 20 "$str")" \
      "$(align_right 20 "$str")" \
      "$(align_right 4 "$?")"
  done
  echo "$hr"
}

main \
  'Früchte und Gemüse' \
  'Milchprodukte' \
  '12345678901234567890' \
  'This string is much too long'

Output:

+----------------------+----------------------+----------------------+------+
| Left-aligned         |    Center-aligned    |        Right-aligned |  RC  |
+----------------------+----------------------+----------------------+------+
| Früchte und Gemüse   |  Früchte und Gemüse  |   Früchte und Gemüse |    0 |
| Milchprodukte        |    Milchprodukte     |        Milchprodukte |    0 |
| 12345678901234567890 | 12345678901234567890 | 12345678901234567890 |    0 |
| This string is much  | This string is much  | This string is much  |    1 |
+----------------------+----------------------+----------------------+------+

Why is printf "shrinking" umlaut?

Other options

Tags:

Printf

Unicode

Bash

Related

Recent Posts