What is CoNLL data format?
There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _
s indicate empty values. Mate-Parser's manual says that it uses the first 12 columns of CoNLL 2009:
ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL
The definition of some of these columns come from earlier shared tasks (the CoNLL-X format used in 2006 and 2007):
ID
(index in sentence, starting at 1)FORM
(word form itself)LEMMA
(word's lemma or stem)POS
(part of speech)FEAT
(list of morphological features separated by |)HEAD
(index of syntactic parent, 0 forROOT
)DEPREL
(syntactic relationship betweenHEAD
and this word)
There are variants of those columns (e.g., PPOS
but not POS
) that start with P
indicate that the value was automatically predicted rather a gold standard value.
Update: There is now a CoNLL-U data format as well which extends the CoNLL-X format.
As update to @dmcc's answer:
- CoNLL is the conventional name for TSV formats in NLP (TSV - tab-separated values, i.e., CSV with
<TAB>
as separator) - It originates from a series of shared tasks organized at the Conferences of Natural Language Learning (hence the name)
- Not all of these tasks use "CoNLL" formats, some tasks had JSON or XML formats
- There are "CoNLL" formats that developed independently from CoNLL, most notably CoNLL-U
- CoNLL formats differ in the choice and order of columns
In CoNLL formats,
- every word (token) is represented in one line.
- every sentence is separated from the next by an empty line
- every column represents one annotation
- every word in a sentence has the same number of columns (in some formats: every word in the corpus has the same number of columns)
- an annotation is a string value about a particular word
- annotations that span over multiple words sometimes use special notations, e.g., round brackets (indicating begin and end of a phrase) or the IOBES-annotation (e.g., B-NP: begin of NP, I-NP: in the middle of NP, E-NP: end of NP, S-NP: NP begins and ends at the current word, O: no NP annotation)
- some CoNLL formats have one or multiple columns of numerical identifiers as the first column, the next column after these (or the first if there are no IDs) usually contains the WORD
- the ID of the first word in the sentence is 1. If no ID column is provided, the ID is the number of preceding words within the sentence plus 1.
- in dependency syntax, grammatical relations hold between words, the dependent is marked for the HEAD (= ID of the parent word) and the EDGE/DEP[endency] (= grammatical relation), both in separate columns
- if a word in dependency syntax does not have a parent (i.e., it is the syntactic root), set its HEAD to 0
Be careful when working with tools or libraries that claim to support (some) "CoNLL format". Different CoNLL formats have different order of columns and the developer might not be aware of that. So, it is likely that they don't work as expected if they get data from another (or unspecified) CoNLL format.
For converting between different CoNLL formats, you can consider using CoNLL-RDF (https://github.com/acoli-repo/conll-rdf), resp., CoNLL-Transform (https://github.com/acoli-repo/conll-transform) (Disclaimer: Developed by my lab.)