Separate elements of CamelCase words
If your grep
implementation supports -o
(and is not the ast-open implementation which chokes with -o
for regexps that match the empty string):
grep -o '[[:upper:]]*[[:lower:]]*'
With GNU grep, using Unicode character properties and zero-width assertions:
grep -Po '((?<!=\p{Lu})\p{Lu}|(?<!=\pL)\pL)\p{Ll}*'
$ echo 'IamHelloTest forYou PickTest;' | grep -Po '((?<!=\p{Lu})\p{Lu}|(?<!=\pL)\pL)\p{Ll}*'
Iam
Hello
Test
for
You
Pick
Test
$ echo 'АямГеллоТест форЮ ПикТест' | grep -Po '((?<!=\p{Lu})\p{Lu}|(?<!=\pL)\pL)\p{Ll}*'
Аям
Гелло
Тест
фор
Ю
Пик
Тест
To deal with your second example, a suggest a more "rule based" approach.
Consider the following Perl script (camelcaseproc
):
#!/usr/bin/perl -CSDA -p
s{ \W+ # break on non-word
| _ # break on "_"
| (?<=\p{Ll})(?=\p{Lu}) # ...aB... → ...a-B...
| (?<=\p{Lu})(?=\p{Lu}\p{Ll}) # ..ABCd.. → ...AB-Cd.
| (?<=I)(?=am) # exceptions rules
}{-}xg #
- Line 1: use Unicode (to process accents, Cyrillic)
- Line 2: substitute non-letters by "\n"
- line 3,4,5: break-intraWord rules (defined by left context, rigth context)
- line 5: exception rules for "Iam"
- line 5:
x
option makes possible to add comments in regular expressions
After the usual chmod +x camelcaseproc
we can use it as:
$ camelcaseproc <<< "IamTestECHO TEST PickFoo BARFull"
I-am-Test-ECHO-TEST-Pick-Foo-BAR-Full
$ camelcaseproc input-file
$ echo "IamTestECHO TEST PickFoo BARFull" | camelcaseproc