split a string in java into equal length substrings while maintaining word boundaries

If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)

String data = "Hello there, my name is not importnant right now."
        + " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
    System.out.println(m.group(1));

Output:

Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.

Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:

(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)

  • \G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
  • \s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
  • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
    • . represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
    • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
    • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
    • ( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
  • (?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:

    • space (\\s)

      OR (written as |)

    • end of the string $ after it.

So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).


Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:

private String justify(String s, int limit) {
    StringBuilder justifiedText = new StringBuilder();
    StringBuilder justifiedLine = new StringBuilder();
    String[] words = s.split(" ");
    for (int i = 0; i < words.length; i++) {
        justifiedLine.append(words[i]).append(" ");
        if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
            justifiedLine.deleteCharAt(justifiedLine.length() - 1);
            justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
            justifiedLine = new StringBuilder();
        }
    }
    return justifiedText.toString();
}

Test:

String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));

Output:

Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.

It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).

PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

Tags:

Java

String