Text classification beyond the keyword dependency and inferring the actual meaning
If the data you posted is representative of the classes you're trying to distinguish, keyword based features might not be the most effective. It looks like some terms that are sometimes treated as stop-words will be very good cues as to what is Private and what is Public.
You mention pronouns, I think that's likely still a good avenue forward. If you're using unigram/bag-of-words kinds of features, make sure your vectorizer is not removing them.
Doing a count of instances of first person pronouns (I
, my
, I've
, mine
) gives 13 for the Private case and 2 for the Public case.
The Public example has second person pronouns (e.g. you
) where the first example doesn't. So maybe features about counts or smoothed ratios of first to second person pronouns would be effective.
If you have syntactic structure or are keeping track of positional information through n-grams or a similar representation, then features involving first-person pronouns and your keywords may be effective.
Also, verb-initial sentence structures (Don't be ...
, Having an...
) are characteristic of second-person directed language and may show up more in the public than the private text.
One last speculative thought: The sentiment of the two passages is pretty different, so if you have access to sentiment analysis, that might provide additional cues. I would expect the Public class would be more neutral that the Private class.
Plugging your Public example into the Watson Tone Analyzer demo gives this notable result:
{
"sentence_id": 3,
"text": "I am now scared and afraid of cancer.",
"tones": [
{
"score": 0.991397,
"tone_id": "fear",
"tone_name": "Fear"
}
]
},
The Public statement also contains a fear-tagged sentence, but it's not scored as highly, is accompanied by other annotations, and contains an explicit negation in the sentence. So it might be worthwhile to leverage those as features as well.
"sentences_tone": [
{
"sentence_id": 0,
"text": "Don’t be scared and do not assume anything bad as cancer.",
"tones": [
{
"score": 0.874498,
"tone_id": "fear",
"tone_name": "Fear"
},
{
"score": 0.786991,
"tone_id": "tentative",
"tone_name": "Tentative"
},
{
"score": 0.653099,
"tone_id": "analytical",
"tone_name": "Analytical"
}
]
},
Those are only vaguely described, as whole process is task specific. You may want to look at those and take some inspiration though.
General tips
- Start with simpler models (as you seem to be doing) and gradually increase their complexity if the results are unsatisfactory. You may want to try well-known Random Forest and xgboost before jumping towards neural networks
Data tips
Few quick points that might help you:
- You don't have too many data points. If possible, I would advise you to gather more data from the same (or at least very similar) source/distribution, it would help you the most in my opinion.
- Improve representation of your data (more details below), second/first best option.
- You could try stemming/lemmatization (from nltk or spaCy but I don't think it will help in this case, might leave this one out.
Data representation
I assume you current representation is Bag Of Words or TF-IDF. If you haven't tried the second one, I advise you to do it before delving into more complicated (or is it?) stuff. You could easily do it with sklearn's TfidfVectorizer.
If the results are unsatisfactory (and you have tried Random Forest/xgboost (or similar like LightGBM from Microsoft), you should move on to semantic representation in my opinion.
Semantic representation
As you mentioned, there is a representation created by word2vec or Doc2Vec algorithms (I would leave the second one, it will not help probably).
You may want to separate your examples into sentences and add token like <eos>
to represent the of sentence, it might help neural network learn.
On the other hand, there are others, which would probably be a better fit for your task like BERT. This one is context dependent, meaning a token I
would be represented slightly different based on the words around it (as this representation is trainable, it should fit your task well).
Flair library offers nice and intuitive approach to this problem if you wish to go with PyTorch. If you are on the Tensorflow side, they have Tensorflow Hub, which also has State Of The Art embeddings for you to use easily.
Neural Networks
If it comes to the neural networks, start with simple recurrent model classifier and use either GRU or LSTM cell (depending on framework of choice, their semantics differ a bit).
If this approach is still unsatisfactory, you should look at Attention Networks, Hierarchical Attention Networks (one attention level per sentence, and another one for the whole document) or convolution based approaches.
Those approaches will take you a while and span quite some topics for you to try, one combination of those (or more) will probably work nicely with your task.