Tools for text simplification (Java)
You are unlikely to solve this problem using any known algorithm in the general case - this is getting into strong AI territory. Even humans can't parse grammar very well!
Note that the problem is quite ambiguous regarding how far you simplify and what assumptions you are willing to make. You could take your example further and say:
John is assumed to be the name of a being. The race of John is unknown. John played golf at some point in the past. Golf is assumed to refer to the ball game called golf, but the variant of golf that John played is unknown. At some point in the past John was the CEO of a company. CEO is assumed to mean "Chief Executive Officer" in the context of a company but this is not specified. The company is unknown.
In case the lesson is not obvious: the more you try to determine the exact meaning of words, the more cans of worms you start to open up...... it takes human-like levels of judgement and interpretation to know when to stop.
You may be able to solve some simpler cases using various Java-based NLP tools: see Is there a good natural language processing library
I think one can design a very simple algorithm for the basic cases of this situation, while real world cases may be too many, that such an approach will become unruly :)
Still I thought I should think aloud and write my approach and maybe add some python code. My basic idea is that derive a solution from first principles, mostly by explicitly exposing our model of what is really happening. And not to rely on other theories, models, libraries BEFORE we do one by HAND and from SCRATCH.
Goal: given a sentence, extract subsentences from it.
Example: John, who was the ceo of the company, played Golf.
Expected output: John was the CEO of the company. John played Golf.
Here is my model of what is happening here written out in the form of model assumptions: (axioms?)
MA1. Simple sentences can be expanded by inserting subsentences. MA2. A subsentence is a qualification/modification(additional information) on one or more of the entities. MA3. To insert a subsentence, we put a comma right next to the entity we want to expand on (provide more information on) and attach the subsentence, I am going to call it an extension - and place another comma when the extension ends.
Given this model, the algorithm can be straightforward at least to address the simple cases first.
- DETECT: Given a sentence, detect if it has an extension clause, by looking for a pair of commas in the sentence.
- EXTRACT: If you find two commas, generate two sentences: 2.1 EXTRACT-BASE: base sentence: delete everything out between the two commas, You get the base sentence. 2.2 EXTRACT-EXTENSION: extension sentence: take everything inside the extension sentence, replace 'who' with the word right before it. That is your second sentence.
- PRINT: In fact you should print the extension sentence first, because the base sentence depends on it.
Well, that is our algorithm. Yes it sounds like a hack. It is. But something I am learning now, is that, if you use a trick in one program it is a hack, if it can handle more stuff, it is a technique.
So let us expand and complicate the situation a bit.
Compounding cases: Example 2. John, who was the CEO of the company, played Golf with Ram, the CFO.
As I am writing it, I noticed that I had omitted the 'who was' phrase for the CFO! That brings us to the complicating case that our algorithm will fail. Before going there, let me create a simpler version of 2 that WILL work.
Example 3. John, who was the CEO of the company, played Golf with Ram, who was the CFO.
Example 4. John, the CEO of the company, played Golf with Ram, the CFO.
Wait we are not done yet!
Example 5. John, who is the CEO and Ram, who was the CFO at that time, played Golf, which is an engaging game.
To allow for this I need to extend my model assumptions:
MA4. More than one entities may be expanded likewise, but should not cause confusion because the extension clause occurs right next to the entity being informed about. (accounts for example 3)
MA5. The 'who was' phrase may be omitted since it can be inferred by the listener. (accounts for example 4)
MA6. Some entities are persons, they will be extended using a 'who' and some entities are things, extended using a 'which'. Either of these extension heads may be omitted.
Now how do we handle these complications in our algorithm?
Try this:
SPLIT-SENTENCE-INTO-BASE-AND-EXTENSIONS: If sentence contains a comma, look for the following comma, and extract whatever is in between into extension sentence. Continue until you find no more closing comma or opening comma left. At this point you should have list with base sentence and one or more extension sentences.
PROCESS_EXTENSIONS: For each extension, if it has 'who is' or 'which is', replace it by name before the extension headword. If extension does not have a 'who is' or 'which is', place the leading word and and an is.
PRINT: all extension sentences first and then the base sentences.
Not scary.
When I get some time in the next few days, I will add a python implementation.
Thank you
Ravi Annaswamy
I see your problem as a task of converting complex or compound sentence into simple sentences.
Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.
Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:
nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, played-11)
dobj(played-11, golf-12)
A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.
- John CEO
- John played
After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.
By the way, your question might be related with Finding meaningful sub-sentences from a sentence
Answer to 3rd comment:
Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1)
, get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.
Based on example, nsubj(CEO-6, John-1)
, if you start traversing from John-1
, you'll get nsubj(played-11, John-1)
but you should ignore it since its category is subject.
Next step is traversing from CEO-6
part. You'll get
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9
in either head or dependent).
Now your dependencies are
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)
In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1)
. Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.
John was the CEO a company
Our new sentence is missing one part, i.e of
. This part is hidden in prep_of(CEO-6, company-9)
. If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of
is hidden and how to get the word order of this hidden part.
With same approach, you'll get second sentence
John played golf
I believe AlchemyApi is your best option. Still it will require a lot of work on your side to do exactly what you need, and how the most commentators have alredy told you, most probably you'll not get 100% quality results.