Extracting a part of a Spacy document as a new document

My personal preference would slicing by characters. Spacy's sentence segmentation is pretty good for structured text, but for poorly structured text grabbing a bunch of text at a fixed rate (i.e. by character) is a little more predictable:

char_end = 200
subdoc = nlp(doc.text[:char_end])

There's a nicer solution using as_doc() on a Span object (https://spacy.io/api/span#as_doc):

nlp = spacy.load('en_core_web_lg')
content = "This is my sentence. And here's another one."
doc = nlp(content)
for i, sent in enumerate(doc.sents):
    print(i, "a", sent, type(sent))
    doc_sent = sent.as_doc()
    print(i, "b", doc_sent, type(doc_sent))

Gives output:

0 a This is my sentence. <class 'spacy.tokens.span.Span'>   
0 b This is my sentence.  <class 'spacy.tokens.doc.Doc'>   
1 a And here's another one.  <class 'spacy.tokens.span.Span'>   
1 b And here's another one.  <class 'spacy.tokens.doc.Doc'>

(code snippet wrote out in full for clarity - can be further shortened ofcourse)

A rather ugly way to achieve your purpose is to construct a list of sentences and build a new document from a subset of sentences.

sentences = [sent.string.strip() for sent in doc.sents][:100]
minidoc = nlp(' '.join(sentences))

It feels like there should be a better solution, but I guess this at least works.

Extracting a part of a Spacy document as a new document

Tags:

Python

Nlp

Document

Spacy

Related

Recent Posts