How to calculate syllables in text with regex and Java

This question is from a Java Course of UCSD, am I right?

I think you should provide enough information for this question, so that it won't confused people who want to offer some help. And here I have my own solution, which already been tested by the test case from the local program, also the OJ from UCSD.

You missed some important information about the definition of syllable in this question. Actually I think the key point of this problem is how should you deal with the e. For example, let's say there is a combination of te. And if you put te in the middle of a word, of course it should be counted as a syllable; However if it's at the end of a word, the e should be thought as a silent e in English, so it should not be thought as a syllable.

That's it. And I would like to write down my thought with some pseudo code:

  if(last character is e) {
        if(it is silent e at the end of this word) {
           remove the  silent e;
           count the rest part as regular;
        } else {
           count++;
  } else {
        count it as regular;
  }
}

You may find that I am not only using regex to deal with this problem. Actually I have thought about it: can this question really be done only using regex? My answer is: nope, I don't think so. At least now, with the knowledge UCSD gives us, it's too difficult to do that. Regex is a powerful tool, it can map the desired characters very fast. However regex is missing some functionality. Take the te as example again, regex won't be able to think twice when it is facing the word like teate (I made up this word just for example). If our regex pattern would count the first te as syllable, then why the last te not?

Meanwhile, UCSD actually have talked about it on the assignment paper:

If you find yourself doing mental gymnastics to come up with a single regex to count syllables directly, that's usually an indication that there's a simpler solution (hint: consider a loop over characters--see the next hint below). Just because a piece of code (e.g. a regex) is shorter does not mean it is always better.

The hint here is that, you should think this problem together with some loop, combining with regex.

OK, I should finally show my code now:

protected int countSyllables(String word)
{
    // TODO: Implement this method so that you can call it from the 
    // getNumSyllables method in BasicDocument (module 1) and 
    // EfficientDocument (module 2).
    int count = 0;
    word = word.toLowerCase();

    if (word.charAt(word.length()-1) == 'e') {
        if (silente(word)){
            String newword = word.substring(0, word.length()-1);
            count = count + countit(newword);
        } else {
            count++;
        }
    } else {
        count = count + countit(word);
    }
    return count;
}

private int countit(String word) {
    int count = 0;
    Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+");
    Matcher m = splitter.matcher(word);

    while (m.find()) {
        count++;
    }
    return count;
}

private boolean silente(String word) {
    word = word.substring(0, word.length()-1);

    Pattern yup = Pattern.compile("[aeiouy]");
    Matcher m = yup.matcher(word);

    if (m.find()) {
        return true;
    } else
        return false;
}

You may find that besides from the given method countSyllables, I also create two additional methods countit and silente. countit is for counting the syllables inside the word, silente is trying to figure it out that is this word end with a silent e. And it should also be noticed that the definition of not silent e. For example, the should be consider not silent e, while ate is considered silent e.

And here is the status my code has already passed the test, from both local test case and OJ from UCSD:

from local test case

And from OJ the test result:

from Coursera OJ

P.S: It should be fine to use something like [^aeiouy] directly, because the word is parsed before we call this method. Also change to lowercase is necessary, that would save a lot of work dealing with the uppercase. What we want is only the number of syllables. Talking about number, an elegant way is to define count as static, so the private method could directly use count++ inside. But now it's fine.

Feel free to contact me if you still don't get the method of this question :)


Using the concept of user5500105, I have developed the following method to calculate the number of Syllables in a word. The rules are:

  • consecutive vowels are counted as 1 syllable. eg. "ae" "ou" are 1 syllable

  • Y is considered as a vowel

  • e at the end is counted as syllable if e is the only vowel: eg: "the" is one syllable, since "e" at the end is the only vowel while "there" is also 1 syllable because "e" is at the end and there is another vowel in the word.

     public int countSyllables(String word) {
         ArrayList<String> tokens = new ArrayList<String>();
         String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
         Pattern p = Pattern.compile(regexp);
         Matcher m = p.matcher(word.toLowerCase());
    
    
         while (m.find()) {
             tokens.add(m.group());
         }
    
     //check if e is at last and e is not the only vowel or not
         if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e")  )
             return tokens.size()-1; // e is at last and not the only vowel so total syllable -1 
         return tokens.size(); 
     }
    

This gives you a number of syllables vowels in a word:

public int getNumVowels(String word) {

    String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
    Pattern p = Pattern.compile(regexp);
    Matcher m = p.matcher(word.toLowerCase());

    int count = 0;

    while (m.find()) {
        count++;
    }
    return count;
}

You can call it on every word in your string array:

    String[] words = getText().split("\\s+");
    for (String word : words ) {
      System.out.println("Word: " + word + ", vowels: " + getNumVowels(word));
    }

Update: as freerunner noted, calculating the number of syllables is more complicated than just counting vowels. One need to take into account combinations like ou, ui, oo, the final silent e and possibly something else. As I am not a native English speaker, I am not sure what the correct algorithm would be.