How can I detect certain Unicode characters in a string in Ruby?

I've written a little gem that packages up the approach in steenslag's answer above:

https://github.com/jpatokal/script_detector

It can also take a stab at differentiating between Japanese, Korean, simplified Chinese and traditional Chinese, although due to the complexities of Han unification it only works reliably with large slabs of text.

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

Given my Ruby 1.8.7 constraint, this is the best I could do:

class String
  CJKV_RANGES = [
      (0xe2ba80..0xe2bbbf),
      (0xe2bfb0..0xe2bfbf),
      (0xe38080..0xe380bf),
      (0xe38180..0xe383bf),
      (0xe38480..0xe386bf),
      (0xe38780..0xe387bf),
      (0xe38880..0xe38bbf),
      (0xe38c80..0xe38fbf),
      (0xe39080..0xe4b6bf),
      (0xe4b780..0xe4b7bf),
      (0xe4b880..0xe9bfbf),
      (0xea8080..0xea98bf),
      (0xeaa080..0xeaaebf),
      (0xeaaf80..0xefbfbf),
  ]

  def contains_cjkv?
    each_char do |ch|
      return true if CJKV_RANGES.any? {|range| range.member? ch.unpack('H*').first.hex }
    end
    false
  end
end


strings = ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each {|s| puts s.contains_cjkv? }

#true
#true
#true
#false

Pretty hacktacular, but it works. It actually detects a variety of Indic scripts as well, so it should probably really be called contains_asian?

Maybe I should gem this up for other poor I18N hackers stuck with Ruby 1.8.

How can I detect certain Unicode characters in a string in Ruby?

Tags:

Ruby

Unicode

Encoding

Character Encoding

Cjk

Related

Recent Posts