Corpus Analysis for Salesforce?
I won't share the Python
script I used here, because it is not a pretty thing. But I can describe my approach. At its essence, all I did was look for consecutive word frequency. I played around with the expressions a fair amount and ended up settling on the following:
expression = '.*'.join(['[\w\'@]+'] * word_count)
one word [\w'@]+
two words [\w'@]+.*[\w'@]+
three words [\w'@]+.*[\w'@]+.*[\w'@]+
etc.
I ran up to 7 words, but only found useful data from 1-5.
From this expression, I generated a set
of all results for each email body. Then I counted how many emails a given set element appears in for each category. This gave me a basic data structure like:
phrase support billing
from 53595 16514
message 41649 15372
your 41493 16534
this 37288 13067
Not super useful. But, I know that support has 91.2k records and billing has 31.2k records, so I can make this a little more valuable by adding the percentages.
phrase support billing support % billing %
from 53595 16514 58.77% 52.93%
message 41649 15372 45.67% 49.27%
your 41493 16534 45.50% 52.99%
this 37288 13067 40.89% 41.88%
From there, I can deduce the ratio of support %/billing %
and vice versa and use this metric to estimate predictive power.
phrase support % billing % support ratio billing ratio
origin 9.60% 0.47% 20.52 0.05
persons 8.07% 0.70% 11.55 0.09
entities 8.01% 0.66% 12.08 0.08
retransmission 7.99% 0.58% 13.69 0.07
hesitate 0.54% 6.54% 0.08 12.16
postal 0.03% 5.28% 0.01 155.39
postallog 0.00% 5.17% 0 1000
I filtered on everything over 10 for both, but that ended up predicting billing quite poorly. So I increased the threshold to only use a billing ratio of at least 100.
I then used these expressions to categorize the existing data. My expression would just be an or join on all of the predictors, e.g. (?si)(origin|persons|entities)
. The results:
Category % Emails Matched Accuracy
Support 50.3% 89.4%
Billing 7.8% 95.6%
Unmatched 41.9% 0%