Why the pattern match don't work as expected
You commented on July 6th:
but I don't understand still why the
?
cannot work totally and give wholestring
.
As MarcoB already quoted:
In a form such as __?test, every element in the sequence matched by __ must yield True when test is applied.
You can easily see for yourself that this is true.
words = {"is", "a", "problem"};
StringCases["What is the best approach to a problem?", __?(MemberQ[words, #] &)]
{"a", "a", "a", "a"}
More explicitly we can use Print
or Sow
as the test function(1) to see exactly which expressions are being tested:
Reap[ StringCases["Mathematica", __?Sow] ][[2, 1]]
{"M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "t", "t", "t", "t", "t", "t", "t", "t", "t", "h", "h", "h", "h", "h", "h", "h", "h", "e", "e", "e", "e", "e", "e", "e", "m", "m", "m", "m", "m", "m", "a", "a", "a", "a", "a", "t", "t", "t", "t", "i", "i", "i", "c", "c", "a"}
Observe that:
- Only single letter strings are ever tested
- 66 matches are attempted due to every test failing (11 + 10 + 9 + 8 ...)
The first point is actually very useful behavior and I direct you to my own answer Using a PatternTest versus a Condition for pattern matching for additional examples.
The second point is the deleterious consequence of extremely flexible pattern matching used in Mathematica which allows the test function itself to be stateful. I personally feel that there should be a more efficient matching scheme available as an alternative as many uses do not require this level of generality.
Contrast this with Condition
(short form /;
)
Reap[ StringCases["Mathematica", x__ /; Sow[x]] ][[2, 1]]
{"Mathematica", "Mathematic", "Mathemati", "Mathemat", "Mathema", "Mathem",
"Mathe", "Math", "Mat", "Ma", "M", "athematica", "athematic", "athemati",
"athemat", "athema", "athem", "athe", "ath", "at", "a", "thematica", "thematic",
"themati", "themat", "thema", "them", "the", "th", "t", "hematica", "hematic",
"hemati", "hemat", "hema", "hem", "he", "h", "ematica", "ematic", "emati", "emat",
"ema", "em", "e", "matica", "matic", "mati", "mat", "ma", "m", "atica", "atic",
"ati", "at", "a", "tica", "tic", "ti", "t", "ica", "ic", "i", "ca", "c", "a"}
Here we see that every possible alignment is tried, with the entire candidate sequence passed to the test function each time.
I suspect that your test doesn't work because, according to the docs for PatternTest
, "In a form such as __?test
, every element in the sequence matched by __
must yield True
when test is applied."
Instead, a conditional pattern using /;
will work as I think you intended with your definition of the words
wordlist:
StringCases[ToLowerCase@string, word__ /; MemberQ[words, word]]
(* Out: {"what", "is", "thebes", "tap", "pro", "a", "c", "h", "to", "a", "problem", "like",
"this", "in", "math", "em", "at", "ic", "a"} *)
Nevertheless, I'd suggest a bit of cleanup of the word list::
words = DeleteDuplicates@
Select[
DeleteCases[
DeleteMissing@words,
string_ /; StringContainsQ[string, "'" | "-"]
],
StringLength[#] > 1 &
];
StringCases[ToLowerCase@string, word__ /; MemberQ[words, word]]
(* Out: {"what", "is", "thebes", "tap", "pro", "to", "problem", "like", \
"this", "in", "math", "em", "at", "ic"} *)
... and an easier way to look for all possible matches:
StringCases[ToLowerCase@string, words, Overlaps -> True]
(* Out:
{"what", "ha", "hat", "at", "ti", "tis", "is", "the", "thebe", "thebes", "he", "be",
"best", "es", "ta", "tap", "appro", "approach", "pro", "roach", "to", "pro", "problem",
"rob", "roble", "em", "ml", "li", "like", "et", "this", "hi", "his", "is", "si", "sin",
"in", "nm", "ma", "mat", "math", "at", "the", "them", "thematic", "he", "hem", "hematic",
"em", "ma", "mat", "at", "ti", "tic", "ic"}
*)
These approaches will still run into trouble, though. String matching is greedy by default, which is not always good: for instance, instead of "the best", the underlying sequence is interpreted greedily as "thebes" + "tap".
I really don't know that one could simply switch between greedy and lazy matching as appropriate without writing a full-fledged natural language recognition engine. If you came up with anything of the sort, quite a few people would be very interested...