Python: How can I replace full-width characters with half-width characters?

The built-in unicodedata module can do it:

>>> import unicodedata
>>> foo = u'１２３４５６７８９０'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'

The “NFKC” stands for “Normalization Form KC [Compatibility Decomposition, followed by Canonical Composition]”, and replaces full-width characters by half-width ones, which are Unicode equivalent.

Note that it also normalizes all sorts of other things at the same time, like separate accent marks and Roman numeral symbols.

In Python3, you can use the following snippet. It makes a map between all ASCII characters and corresponding fullwidth characters. Best of all, this doesn't need you to hard code the ascii sequence, which is error prone.

 FULL2HALF = dict((i + 0xFEE0, i) for i in range(0x21, 0x7F))
 FULL2HALF[0x3000] = 0x20
      
 def halfen(s):
     '''
     Convert full-width characters to ASCII counterpart
     '''
     return str(s).translate(FULL2HALF)

Also, with same logic, you can convert halfwidth characters to fullwidth, with the following code:

 HALF2FULL = dict((i, i + 0xFEE0) for i in range(0x21, 0x7F))
 HALF2FULL[0x20] = 0x3000
      
 def fullen(s):
     '''
     Convert all ASCII characters to the full-width counterpart.
     '''
     return str(s).translate(HALF2FULL)

Note: These two snippets only consider ASCII characters, and does not convert any japanese/korean fullwidth characters.

For completeness, from wikipedia:

Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled by U+3000 "ideographic space."

Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul characters.

Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.

A python2 solution can be found at gist/jcayzac.

Python: How can I replace full-width characters with half-width characters?

Tags:

Python

Unicode

Translation

Related

Recent Posts