Converting grammar to Chomsky Normal Form?

Alternative answer: The grammar can only produce a finite number of strings, namely 6.

 S -> aabbbaab | aabbb | bbaab | bb | abbaab | abb.

You can now condense this back to Chomsky Normal Form by hand.


By substitution, we can find the set of all strings produced. Your initial rules:

S -> AB | aB.
A -> aab | lambda.
B -> bbA.

First split up the S rule:

S -> AB.
S -> aB.

Now substitute what A and B expand into:

S -> AB
  -> (aab | lambda) bbA
  -> (aab | lambda) bb (aab | lambda).
S -> aB
  -> abbA
  -> abb (aab | lambda).

Expand these again to get:

S -> aabbbaab.
S -> aabbb.
S -> bbaab.
S -> bb.
S -> abbaab.
S -> abb.

To change this finite set to Chomsky Normal Form, it suffices to do it by brute force without any intelligent factoring. First we introduce two terminal rules:

X -> a.
Y -> b.

Now for each string, we consume the first letter with a terminal variable and the remaining letters with a new variables. For example, like this:

S -> aabbb. (initial rule, not in Chomsky Normal Form)

S -> XC, where X->a and C->abbb.
C -> XD, where X->a and D->bbb.
D -> YE, where Y->b and E->bb.
E -> YY, where Y->b and Y->b.

We just go through this process for all 6 strings, generating a lot of new intermediate variables.


Wikipedia says:

In computer science, a context-free grammar is said to be in Chomsky normal form if all of its production rules are of the form:

  • A -> BC, or
  • A -> α, or
  • S -> ε

where A, B, C are nonterminal symbols, α is a terminal symbol, S is the start symbol, and ε is the empty string. Also, neither B nor C may be the start symbol.

Continuing your work:

S0 -> S
S -> AB | aB | B
A -> aab
B -> bbA | bb

Instead of using | to denote different choices, split a rule into multiple rules.

S0 -> S
S -> AB
S -> aB
S -> B
A -> aab
B -> bbA
B -> bb

Create new rules Y -> a and Z -> b because we will need them soon.

S0 -> S
S -> AB
S -> aB
S -> B
A -> aab
B -> bbA
B -> bb
Y -> a
Z -> b

S -> aB is not of the form S -> BC because a is a terminal. So change a into Y:

S0 -> S
S -> AB
S -> YB
S -> B
A -> aab
B -> bbA
B -> bb
Y -> a
Z -> b

Do the same for the B -> bb rule:

S0 -> S
S -> AB
S -> YB
S -> B
A -> aab
B -> bbA
B -> ZZ
Y -> a
Z -> b

For A -> aab, create C -> YY; for B -> bbA, create D -> ZZ:

S0 -> S
S -> AB
S -> YB
S -> B
A -> CZ
C -> YY
B -> DA
D -> ZZ
B -> ZZ
Y -> a
Z -> b

For S -> B, duplicate the one rule where S occurs on the right hand side and inline the rule:

S0 -> B
S0 -> S
S -> AB
S -> YB
A -> CZ
C -> YY
B -> DA
D -> ZZ
B -> ZZ
Y -> a
Z -> b

Deal with the rules S0 -> B and S0 -> S by joining the right hand side to the left hand sides of other rules. Also, delete the orphaned rules (where the LHS symbol never gets used on RHS):

S0 -> DA
S0 -> ZZ
S0 -> AB
S0 -> YB
A -> CZ
C -> YY
B -> DA
D -> ZZ
B -> ZZ
Y -> a
Z -> b

And we're done. Phew!


Without getting into too much theory and proofs(you could look at this in Wikipedia), there are a few things you must do when converting a Context Free Grammar to Chomsky Normal Form, you generally have to perform four Normal-Form Transformations. First, you need to identify all the variables that can yield the empty string(lambda/epsilon), directly or indirectly - (Lambda-Free form). Second, you need to remove unit productions - (Unit-Free form). Third, you need to find all the variables that are live/useful (Usefulness). Four, you need to find all the reachable symbols (Reachable). At each step you might or might not have a new grammar. So for your problem this is what I came up with...


Context-Free Grammar

G(Variables = { A B S }
Start = S 
Alphabet = { a b lamda}

Production Rules = { 
S  ->  |  AB  |  aB  |  
A  ->  |  aab  |  lamda  |  
B  ->  |  bbA  |   } )

Remove lambda/epsilon

ERRASABLE(G) = { A }

G(Variables = { A S B }
Start = S
Alphabet = { a b }

Production Rules = { 
S  ->  |  AB  |  aB  |  B  | 
B  ->  |  bbA  |  bb  |   } )

Remove unit produtions

UNIT(A) { A }
UNIT(B) { B }
UNIT(S) { B S }
G (Variables = { A B S }
Start = S 
Alphabet = { a b }

Production Rules = { 
S  ->  |  AB  |  aB  |  bb  |  bbA  |  
A  ->  |  aab  |  
B  ->  |  bbA  |  bb  |   })

Determine live symbols

LIVE(G) = { b A B S a }

G(Variables = { A B S }
Start = S
Alphabet = { a b }

Production Rules = { 
S  ->  |  AB  |  aB  |  bb  |  bbA  |  
A  ->  |  aab  |  
B  ->  |  bbA  |  bb  |   })

Remove unreachable

REACHABLE (G) = { b A B S a }
G(Variables = { A B S }
Start = S 
Alphabet = { a b }

Production Rules = { 
S  ->  |  AB  |  aB  |  bb  |  bbA  |  
A  ->  |  aab  |  
B  ->  |  bbA  |  bb  |   })

Replace all mixed strings with solid nonterminals

G( Variables = { A S B R I }
Start = S
Alphabet = { a b }

Production Rules = { 
S  ->  |  AB  |  RB  |  II  |  IIA  |  
A  ->  |  RRI  |  
B  ->  |  IIA  |  II  |  
R  ->  |  a  |  
I  ->  |  b  |   })

Chomsky Normal Form

G( Variables = { V A B S R L I Z }
Start = S 
Alphabet = { a b }

Production Rules = { 
S  ->  |  AB  |  RB  |  II  |  IV  |  
A  ->  |  RL  |  
B  ->  |  IZ  |  II  |  
R  ->  |  a  |  
I  ->  |  b  |  
L  ->  |  RI  |  
Z  ->  |  IA  |  
V  ->  |  IA  |   })