Separate elements of CamelCase words

If your grep implementation supports -o (and is not the ast-open implementation which chokes with -o for regexps that match the empty string):

Click to copy

grep -o '[[:upper:]]*[[:lower:]]*'

With GNU grep, using Unicode character properties and zero-width assertions:

Click to copy

grep -Po '((?<!=\p{Lu})\p{Lu}|(?<!=\pL)\pL)\p{Ll}*'

$ echo 'IamHelloTest forYou PickTest;' | grep -Po '((?<!=\p{Lu})\p{Lu}|(?<!=\pL)\pL)\p{Ll}*'
Iam
Hello
Test
for
You
Pick
Test
$ echo 'АямГеллоТест форЮ ПикТест' | grep -Po '((?<!=\p{Lu})\p{Lu}|(?<!=\pL)\pL)\p{Ll}*'
Аям
Гелло
Тест
фор
Ю
Пик
Тест

To deal with your second example, a suggest a more "rule based" approach. Consider the following Perl script (camelcaseproc):

Click to copy

#!/usr/bin/perl -CSDA -p

s{  \W+                                     # break on non-word
 |  _                                       # break on "_"
 |  (?<=\p{Ll})(?=\p{Lu})                   # ...aB... → ...a-B...
 |  (?<=\p{Lu})(?=\p{Lu}\p{Ll})             # ..ABCd.. → ...AB-Cd.
 |  (?<=I)(?=am)                            # exceptions rules
 }{-}xg                                     #

Line 1: use Unicode (to process accents, Cyrillic)
Line 2: substitute non-letters by "\n"
line 3,4,5: break-intraWord rules (defined by left context, rigth context)
line 5: exception rules for "Iam"
line 5: x option makes possible to add comments in regular expressions

After the usual chmod +x camelcaseproc we can use it as:

Click to copy

$ camelcaseproc <<< "IamTestECHO TEST PickFoo BARFull"
I-am-Test-ECHO-TEST-Pick-Foo-BAR-Full

$ camelcaseproc input-file

$ echo "IamTestECHO TEST PickFoo BARFull" | camelcaseproc

Separate elements of CamelCase words

Tags:

Text Processing

Related

Recent Posts