How to implement a regular grammar?

The following code defines the function randomSentence. It uses recursive descent to generate a sentence from a grammar expresssed as rules:

ClearAll[randomSentence, randomCount]
randomCount[] := RandomVariate[GeometricDistribution[0.5]]

SetAttributes[randomSentence, HoldAll]
randomSentence[rules_, expr_] :=
  Module[{generate}
  , SetAttributes[generate, HoldAll]
  ; Replace[#[[1, 1]] :> #& /@ GatherBy[rules, First], (_ :> r_) :> r, {3}] /.
      (a_ :> {b___}) :> (generate[a] := generate[Alternatives[b]])
  ; generate[a_Alternatives] :=
      generate @@ RandomChoice[List @@ Hold /@ Unevaluated @ a]
  ; generate[(r:(Repeated|RepeatedNull))[e_]] :=
      Hold @@ Evaluate @ ConstantArray[0, randomCount[] + Boole[r===Repeated]] /.
        0 :> generate[e] /.
        Hold -> Composition[generate, List]
  ; generate[l_List] := StringJoin @@ generate /@ Unevaluated @ l
  ; generate[x:_[___]] := generate /@ Unevaluated @ x
  ; generate[x_] := x
  ; generate[expr]
  ]

Here is the species grammar, adapted (sic) to the form required by randomSentence:

ClearAll[f, g, h, i]
f[x___] := x <> "er";
g[x___] := x <> "ed";
h[x___] := x <> "ing";
i[x___] := x <> "y";

$rules =
  { "Species" :> "Animal" | "Plant"
  , "Species" :> f@"Action"
  , "Species" :> {"Color" | "Type", "-", "Species"}

  , "Attribute" :> "Type" | "Color"
  , "Attribute" :> {"Animal", "-", f@"Action"}
  , "Attribute" :> {"Color" | "Type", "-", g@"Part"}
  , "Attribute" :> i@"Plant"
  , "Attribute" :> {"Plant" | "Animal", "-", h@"Action"}

  , "Animal" :> "warble"|"shrew"|"whale"|"caiman"|"babuin"|"bat"|"bug"
  , "Plant" :> "bush"|"moss"|"fern"|"grass"|"squash"|"seed"
  , "Part" :> "back"|"head"|"finger"|"tail"|"ear"|"wing"|"thorn"
  , "Color" :> "black"|"red"|"white"|"blue"|"silver"|"crimson"|"dark"
  , "Type" :>  "long"|"cross"|"sharp"|"thick"|"heavy"|"fluffy"|"big"|"wild"
  , "Action" :> "jump"|"kill"|"stalk"|"sting"|"climb"|"crawl"|"eat"
  };

Here is a sample use of the function (note that there is no need to explicitly specify the terminals, non-terminals and functions):

Table[randomSentence[$rules, {{"Attribute", " "}..., "Species"}], {10}] // Column

bug
bat-climbing moss
red-heavy-cross-dark-thick-stalker
babuin
squashy heavy-killer
killer
seed
squashy white-fingered heavy-red-whale
climber
squashy thick-winged stinger

How It Works

The trickiest part of this problem is to make sure that none of the function expressions in the grammar are evaluated until they are needed -- and even then they may require some preprocessing of their arguments. To make this possible, randomSentence must not evaluate the rules or sentence form passed to it:

SetAttributes[randomSentence, HoldAll]
randomSentence[rules_, expr_] :=

We are going to use a helper function called generate. Again, it must not evaluate any expressions prematurely:

  Module[{generate}
  , SetAttributes[generate, HoldAll]

generate will be able to convert any grammar expression into a string. There are many types of grammar expressions. First, we will teach generate how to deal with each of the non-terminal symbols. We group all of the rules into lists, one for each nonterminal. Then we convert each of those groups into definitions for generate as if they had been specified using Alternatives in the original grammar:

  ; Replace[#[[1, 1]] :> #& /@ GatherBy[rules, First], (_ :> r_) :> r, {3}] /.
      (a_ :> {b___}) :> (generate[a] := generate[Alternatives[b]])

Alternatives expressions are processed by selecting a random alternative from the list and then applying generate to that choice:

  ; generate[a_Alternatives] :=
      generate @@ RandomChoice[List @@ Hold /@ Unevaluated @ a]

Repeated expressions are processed by generating the repeated expression a random number of times. RepeatedNull allows zero occurrences whereas Repeated will have at least one. The tortured logic in this operation is due to the need to make sure that each repetition is not evaluated before it has been duly interpreted as a grammar expression. Also, it is important to ensure that each repetition is generated independently:

  ; generate[(r:(Repeated|RepeatedNull))[e_]] :=
      Hold @@ Evaluate @ ConstantArray[0, randomCount[] + Boole[r===Repeated]] /.
        0 :> generate[e] /.
        Hold -> Composition[generate, List]

Each grammar expression in a list is evaluated independently and then the results are joined together:

  ; generate[l_List] := StringJoin @@ generate /@ Unevaluated @ l

Any function call must be made after the arguments have been individually generated:

  ; generate[x:_[___]] := generate /@ Unevaluated @ x

Anything else is passed unchanged (presumably strings):

  ; generate[x_] := x

Now that generate is defined, all that remains is to use it:

  ; generate[expr]
  ]

randomSentence uses the helper function randomCount to generate a random repetition counts. For this example, we are using a geometric distribution where the probability of each successive count is half that of its predecessor. Adjust this distribution to suit your taste.

randomCount[] := RandomVariate[GeometricDistribution[0.5]]

Since no full solution has arrived, I've created the code based on the different sources you have provided.

The function applyRulesRepeated scans through the initial sentence and replaces the first nonterminal symbol it founds while holding all external functions (f, g, ..., for suffixing or alike). Here I used the holding part of Archimedes' answer. This replacement is done one-by-one for each found nonterminal. At the end, when there are only terminals, I substitute these terminals with random choices from the appropriate lexicon variables. I refer to the variables by their symbolname (in $Replacements, solution given by WReach). In the full version, I use a specific function to choose the words by removing them from the original list to prevent repetition (a specific RandomChoice with HoldAll attribute), thus it was necessary to rely on such complicated methods of word-choice (I actually used the Trott-Strzebonski method in the more complicated version). Also I reverted rules to use List instead of StringExpression due to the fact that "a"~~"b" evaluates to "ab" which is unwanted (thanks Archimedes). I was also inspired by Daniel Lichtblau's parser and its more evolved version in Sal Mangano's cookbook.

Here it is, presented as a toy experiment with species names, I hope you will enjoy playing with/extending it:

Regular grammar producing hypothetic animal names

(* Lexicon of different types of words *)
animal = {"warble", "shrew", "whale", "caiman", "babuin", "bat", "bug"};
plant  = {"bush", "moss", "fern", "grass", "squash", "seed"};
part   = {"back", "head", "finger", "tail", "ear", "wing", "thorn"};
color  = {"black", "red", "white", "blue", "silver", "crimson", "dark"};
type   = {"long", "cross", "sharp", "thick", "heavy", "fluffy", "big", "wild"};
action = {"jump", "kill", "stalk", "sting", "climb", "crawl", "eat"};

(* symbols of the grammar *)
$Terminals = {"animal", "plant", "color", "part", "type", "action"};
$NonTerminals = {"Species", "Attribute"};

(* functions not to be evaluated during sentence-generation *)
f[x___] := x <> "er";
g[x___] := x <> "ed";
h[x___] := x <> "ing";
i[x___] := x <> "y";
$Functions = {f, g, h, i};

(* initial sentence specification *)
$Sentence = "S" :> {{"Attribute", " "} .., "Species"};

(* allowed replacements in the grammar *)
$Rules = {
   "Species" :> "animal" | "plant",
   "Species" :> f@"action",
   "Species" :> {"color" | "type", "Species"},

   "Attribute" :> "type" | "color",
   "Attribute" :> {"animal", f@"action"},
   "Attribute" :> {"color" | "type", g@"part"},
   "Attribute" :> i@"plant",
   "Attribute" :> {"plant" | "animal", h@"action"}
   };

(* define replacements for injecting actual words into sentence *)
$Replacements = # -> (ToExpression[#, InputForm, Hold] /. Hold[v_] :> Hold@RandomChoice@v) & /@ $Terminals;

(* parameters *)
$MaxRecursion = 100; (* max number of repeats when resolving Repeated(Null) *)
$MaxIteration = 2; (* max number of recursions when resolving nonterminals *)

(* Aux. function to replace only the first n occasion for each rule, successively *)
replaceEach[expr_, rep_List, args___] := Fold[replaceEach[#1, #2, args] &, expr, rep];
replaceEach[expr_, rule_[lhs_, rhs_], level_: {0, \[Infinity]}, n_: 1,
     opts : OptionsPattern[]] /; MemberQ[{Rule, RuleDelayed}, rule] :=
   ReplacePart[expr, rule[#, rhs] & /@ Position[expr, lhs, level, n, opts]];


(* applyRulesRepeated replaces nonterminals one-by-one in a recursive manner *)
applyRulesRepeated[pattern_] := applyRulesRepeated[pattern, 1];
applyRulesRepeated[pattern_, level_] := 
  Module[{patt, terminals, terminalReplace, nonTerminals, 
    nonTerminalReplace, temp},

   (* resolve : |, .., ... *)
   patt = pattern /. {
      (Verbatim@Alternatives)[x__] :> RandomChoice@List@x,
      (Verbatim@Repeated)[x_] :> Table[x, {RandomInteger@{1, $MaxIteration}}],
      (Verbatim@RepeatedNull)[x_] :> Table[x, {RandomInteger@{0, $MaxIteration}}]
      };

   (* find FIRST nonterminal symbol in sentence *)
   nonTerminals = 
    Cases[patt, _?(MemberQ[$NonTerminals, #] &), \[Infinity], 1];

   (* if no nonterminal was found (or max recursion level is reached) sentence is cosidered ready *)
   If[nonTerminals === {} \[Or] level === $MaxRecursion,
    (* if sentence is ready, replace terminals with actual words and return sentence AND chosen words *)
    terminals = Cases[patt, _?(MemberQ[$Terminals, #] &), \[Infinity]];
    terminalReplace = 
     Thread[terminals -> ReleaseHold[terminals /. $Replacements]];
    ReleaseHold@replaceEach[patt, terminalReplace]
    ,
    (* if there is still any unresolved nonterminal, use a rule on it *)
    nonTerminalReplace = 
     Thread[nonTerminals -> (nonTerminals /. 
         RandomChoice@
          Cases[$Rules, _?(MatchQ[First@nonTerminals, First@#] &)])];
    applyRulesRepeated[
     replaceEach[patt, nonTerminalReplace, {0, \[Infinity]}, 1], 
     level + 1]
    ]
   ];

(* Apply rules repeatedly while the necessary functions are held by Block *)
generateSentence[init_, term_, nonTerm_, rules_, {held___}] := 
  StringJoin@Block[{held}, applyRulesRepeated@Last@init];

Let's generate some random sentences:

Column@Table[
    generateSentence[$Sentence, $Terminals, $NonTerminals, $Rules, $Functions],
{10}]
crimson warble
fluffywinged ferny caiman
bugkiller shrew
whiteeared darkbat
grassclimbing seedy bug
thick whalekilling crimsonmoss
redwinged babuinstinger caiman
whaleeater whale
fluffy warblejumper babuin
caimankiller squash

As you can see, the output is sometimes nonsensical, sometimes quite realistic, but there was no semantic consideration behind the project. Each sentence is well-formed and complies to the (implicit) rules of the grammar. It is nice to realize though that (whale-)cannibalism is simply nature's way of recursion.


Does this help?

First, let's get rid of StringExpression because it acts as StringJoin and messes things up.

fix[expr_] := expr /. StringExpression -> List

initSentence = fix[initSentence]
rules = fix[rules]

I used initSentence = {g["A"], "B"} here.

Then let's gather the rules into Alternatives expressions:

gatherRules[rules_] := #[[1, 1]] -> Alternatives @@ #[[All, 2]] & /@ 
  GatherBy[rules, First]

define Alternatives temporarily to be RandomChoice (which I don't like, but | is indeed a convenient notation), and do a replacement:

repl[initSentence_, rules_, {heldFunctions___}] :=
 Block[{Alternatives = RandomChoice[{##}] &, heldFunctions},
  initSentence /. gatherRules[rules]
 ]

repl[initSentence, rules, {f,g}]

This will prevent f and g from evaluating util the replacement is done.

We can also apply the replacements several times:

Nest[repl[#, rules, {}] &, initSentence, 10]

The reason I didn't use //. is that once the expression gets big, it's very likely it'll never reach a terminal state.

Edit

If you want a function to evaluate in some rules, and not evaluate in other rules, then instead of this solution you can wrap them in some head to prevent evaluation, i.e. write H[f]["a","b"] instead of f["a", "b"]. Then when you are ready to evaluate them, just /. H -> Identity.