Whitespace Matching Regex - Java

Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

Yeah, you need to grab the result of matcher.replaceAll():

String result = matcher.replaceAll(" ");
System.out.println(result);

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode defines 26 code points as \p{White_Space}: 20 of them are various sorts of \pZ GeneralCategory=Separator, and the remaining 6 are \p{Cc} GeneralCategory=Control.

White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as the pattern in your replaceAll.

Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.

And if you think white space is bad, you should see what you have to do to get \w and \b to finally behave properly!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.

Whitespace Matching Regex - Java

Tags:

Java

Regex

Whitespace

Related

Recent Posts