Algorithm to check for combining characters in Unicode

These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT):

300-36F
483-489
7EB-7F3
135F-135F
1A7F-1A7F
1B6B-1B73
1DC0-1DE6
1DFD-1DFF
20D0-20F0
2CEF-2CF1
2DE0-2DFF
3099-309A
A66F-A672
A67C-A67D
A6F0-A6F1
A8E0-A8F1
FE20-FE26
101FD-101FD
1D165-1D169
1D16D-1D172
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
1D242-1D244

I compiled this list with a Python script, making use of the unicodedata module. I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.

However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.


OK I did hack up something similar recently. Enjoy!

  public static List<String> stringToCharacterWithCombiningChars(String fullText) {
    Pattern splitWithCombiningChars = Pattern.compile("(\\p{M}+|\\P{M}\\p{M}*)"); // {M} is any kind of 'mark' http://stackoverflow.com/questions/29110887/detect-any-combining-character-in-java/29111105
    Matcher matcher = splitWithCombiningChars.matcher(fullText);
    ArrayList<String> outGoing = new ArrayList<>();
    while(matcher.find()) {
      outGoing.add(matcher.group());
    }
    return outGoing;
  }

With its accompanying (passing) unit test if it's of worth to followers: https://gist.github.com/rdp/0014de502f37abd64ffd

Tags:

Unicode