Stanford typed dependencies using coreNLP in python
It is to note the python library stanfordnlp is not just a python wrapper for StanfordCoreNLP.
1. Difference StanfordNLP / CoreNLP
As said on the stanfordnlp Github repo:
The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server.
Stanfordnlp contains a new set of neural networks models, trained on the CONLL 2018 shared task. The online parser is based on the CoreNLP 3.9.2 java library. Those are two different pipelines and sets of models, as explained here.
Your code only accesses their neural pipeline trained on CONLL 2018 data. This explains the differences you saw compared to the online version. Those are basically two different models.
What adds to the confusion I believe is that both repositories belong to the user named stanfordnlp (which is the team name). Don't be fooled between the java stanfordnlp/CoreNLP and the python stanfordnlp/stanfordnlp.
Concerning your 'neg' issue, it seems that in the python libabry stanfordnlp, they decided to consider the negation with an 'advmod' annotation altogether. At least that is what I ran into for a few example sentences.
2. Using CoreNLP via stanfordnlp package
However, you can still get access to the CoreNLP through the stanfordnlp package. It requires a few more steps, though. Citing the Github repo,
There are a few initial setup steps.
- Download Stanford CoreNLP and models for the language you wish to use. (you can download CoreNLP and the language models here)
- Put the model jars in the distribution folder
- Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05
Once that is done, you can start a client, with code that can be found in the demo :
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
print('---')
print('dependency parse of first sentence')
dependency_parse = sentence.basicDependencies
print(dependency_parse)
#get the tokens of the first sentence
#note that 1 token is 1 node in the parse tree, nodes start at 1
print('---')
print('Tokens of first sentence')
for token in sentence.token :
print(token)
Your sentence will therefore be parsed if you specify the 'depparse' annotator (as well as the prerequisite annotators tokenize, ssplit, and pos). Reading the demo, it feels that we can only access basicDependencies. I have not managed to make Enhanced++ dependencies work via stanfordnlp.
But the negations will still appear if you use basicDependencies !
Here is the output I obtained using stanfordnlp and your example sentence. It is a DependencyGraph object, not pretty, but it is unfortunately always the case when we use the very deep CoreNLP tools. You will see that between nodes 4 and 5 ('not' and 'born'), there is and edge 'neg'.
node {
sentenceIndex: 0
index: 1
}
node {
sentenceIndex: 0
index: 2
}
node {
sentenceIndex: 0
index: 3
}
node {
sentenceIndex: 0
index: 4
}
node {
sentenceIndex: 0
index: 5
}
node {
sentenceIndex: 0
index: 6
}
node {
sentenceIndex: 0
index: 7
}
node {
sentenceIndex: 0
index: 8
}
edge {
source: 2
target: 1
dep: "compound"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 2
dep: "nsubjpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 3
dep: "auxpass"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 4
dep: "neg"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 7
dep: "nmod"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 5
target: 8
dep: "punct"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
edge {
source: 7
target: 6
dep: "case"
isExtra: false
sourceCopy: 0
targetCopy: 0
language: UniversalEnglish
}
root: 5
---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false
word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false
word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false
word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false
word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false
word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false
2. Using CoreNLP via NLTK package
I will not go into details on this one, but there is also a solution to access the CoreNLP server via the NLTK library , if all else fails. It does output the negations, but requires a little more work to start the servers. Details on this page
EDIT
I figured I could also share with you the code to get the DependencyGraph into a nice list of 'dependency, argument1, argument2' in a shape similar to what stanfordnlp outputs.
from stanfordnlp.server import CoreNLPClient
text = "Barack Obama was not born in Hawaii."
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
# get the first sentence
sentence = ann.sentence[0]
# get the dependency parse of the first sentence
dependency_parse = sentence.basicDependencies
#print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
#print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
#print(dir(dependency_parse.edge))
#get a dictionary associating each token/node with its label
token_dict = {}
for i in range(0, len(sentence.token)) :
token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word
#get a list of the dependencies with the words they connect
list_dep=[]
for i in range(0, len(dependency_parse.edge)):
source_node = dependency_parse.edge[i].source
source_name = token_dict[source_node]
target_node = dependency_parse.edge[i].target
target_name = token_dict[target_node]
dep = dependency_parse.edge[i].dep
list_dep.append((dep,
str(source_node)+'-'+source_name,
str(target_node)+'-'+target_name))
print(list_dep)
It ouputs the following
[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'depparse'], timeout=60000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
offset = 0 # keeps track of token offset for each sentence
for sentence in ann.sentence:
print('___________________')
print('dependency parse:')
# extract dependency parse
dp = sentence.basicDependencies
# build a helper dict to associate token index and label
token_dict = {sentence.token[i].tokenEndIndex-offset : sentence.token[i].word for i in range(0, len(sentence.token))}
offset += len(sentence.token)
# build list of (source, target) pairs
out_parse = [(dp.edge[i].source, dp.edge[i].target) for i in range(0, len(dp.edge))]
for source, target in out_parse:
print(source, token_dict[source], '->', target, token_dict[target])
print('\nTokens \t POS \t NER')
for token in sentence.token:
print (token.word, '\t', token.pos, '\t', token.ner)
This outputs the following for the first sentence:
___________________
dependency parse:
2 Obama -> 1 Barack
4 born -> 2 Obama
4 born -> 3 was
4 born -> 6 Hawaii
4 born -> 7 .
6 Hawaii -> 5 in
Tokens POS NER
Barack NNP PERSON
Obama NNP PERSON
was VBD O
born VBN O
in IN O
Hawaii NNP STATE_OR_PROVINCE
. . O