Remove the Salutations
Retina, 68% 72.8% (old) 74.8% 77.5% (new test battery)
i`^h(a[iy]|eya?|i(h?i|ya|)|ello)[ ,]+
T`l`L`^.
Try it online! Edit: Gained 4.8% (old) 2.7% (new) coverage with help from @MartinEnder's tips.
GNU sed, 78% 100%
/^\w*[wd]\b/!s/^[dghs][eruaio]\w*\W\+//i
s/./\U&/
(49 bytes)
The test battery is quite limited: we can count which words appear first on each line:
$ sed -e 's/[ ,].*//' inputs.txt | sort | uniq -ic
40 aight
33 alright
33 dear
33 g'd
41 good
36 greetings
35 guys
31 hai
33 hay
27 hello
33 hey
37 heya
43 hi
34 hihi
29 hii
35 hiya
45 hola
79 how
37 howdy
33 kowabunga
39 salutations
32 speak
34 sweet
40 talk
36 wassup
34 what's
38 yo
The salutations to be removed begin with d
, g
, h
or s
(or uppercase versions thereof); the non-salutations beginning with those letters are
33 g'd
41 good
79 how
32 speak
34 sweet
Ignoring lines where they appear alone, that's 220 false-positives. So let's just remove initial words beginning with any of those four letters.
When we see an initial word beginning with any of those (/ ^[dghs]\w*
), case-insensitively (/i
), and followed by at least one non-word character (\W\+
), then replace with an empty string. Then, replace the first character with its uppercase equivalent (s/./\U&/
).
That gives us
s/^[dghs]\w*\W\+//i
s/./\U&/
We can now refine this a bit:
The largest set of false-positives is
how
, so we make the substitution conditional by prefixing with a negative test:/^[Hh]ow\b/!
We can also filter on the second letter, to eliminate
g'd
,speak
andsweet
:s/^[dghs][eruaio]\w*\W\+//i
That leaves only
good
as a false positive. We can adjust the prefix test to eliminate words ending in eitherw
ord
:/^\w*[wd]\b/!
Demonstration
$ diff -u <(./123478.sed inputs.txt) replaced.txt | grep ^- | wc -l
0
PHP, 60.6%
50 Bytes
<?=ucfirst(preg_replace("#^[dh]\w+.#i","",$argn));
Try it online!
PHP, 59.4%
49 Bytes
<?=ucfirst(preg_replace("#^h\w+,? #i","",$argn));
Try it online!
PHP, 58.4%
50 Bytes
<?=ucfirst(preg_replace("#^[gh]\w+.#i","",$argn));
Try it online!