Logstash parsing xml document containing multiple log entries
Alright, I found a solution that does work for me. The biggest problem with the solution is that the XML plugin is ... not quite unstable, but either poorly documented and buggy or poorly and incorrectly documented.
TLDR
Bash command line:
gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf
Logstash config:
input {
stdin {}
}
filter {
# add all lines that have more indentation than double-space to the previous line
multiline {
pattern => "^\s\s(\s\s|\<\/entry\>)"
what => previous
}
# multiline filter adds the tag "multiline" only to lines spanning multiple lines
# We _only_ want those here.
if "multiline" in [tags] {
# Add the encoding line here. Could in theory extract this from the
# first line with a clever filter. Not worth the effort at the moment.
mutate {
replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
}
# This filter exports the hierarchy into the field "entry". This will
# create a very deep structure that elasticsearch does not really like.
# Which is why I used add_field to flatten it.
xml {
target => entry
source => message
add_field => {
fieldx => "%{[entry][fieldx]}"
fieldy => "%{[entry][fieldy]}"
fieldz => "%{[entry][fieldz]}"
# With deeper nested fields, the xml converter actually creates
# an array containing hashes, which is why you need the [0]
# -- took me ages to find out.
fielda => "%{[entry][fieldarray][0][fielda]}"
fieldb => "%{[entry][fieldarray][0][fieldb]}"
fieldc => "%{[entry][fieldarray][0][fieldc]}"
}
}
# Remove the intermediate fields before output. "message" contains the
# original message (XML). You may or may-not want to keep that.
mutate {
remove_field => ["message"]
remove_field => ["entry"]
}
}
}
output {
...
}
Detailed
My solution works because at least until the entry
level, my XML input is very uniform and thus can be handled by some kind of pattern matching.
Since the export is basically one really long line of XML, and the logstash xml plugin essentially works only with fields (read: columns in lines) that contain XML data, I had to change the data into a more useful format.
Shell: Preparing the file
gzcat -d file.xml.gz |
: Was just too much data -- obviously you can skip thattr -d "\n\r" |
: Remove line-breaks inside XML elements: Some of the elements can contain line breaks as character data. The next step requires that these are removed, or encoded in some way. Even though it assumed that at this point you have all XML code in one massive line, it does not matter if this command removes any white space between elementsxmllint --format - |
: Format the XML with xmllint (comes with libxml)Here the single huge spaghetti line of XML (
<root><entry><fieldx>...</fieldx></entry></root>
) Is properly formatted:<root> <entry> <fieldx>...</fieldx> <fieldy>...</fieldy> <fieldz>...</fieldz> <fieldarray> <fielda>...</fielda> <fieldb>...</fieldb> ... </fieldarray> </entry> <entry> ... </entry> ... </root>
Logstash
logstash -f logstash-csv.conf
(See full content of the .conf
file in the TL;DR section.)
Here, the multiline
filter does the trick. It can merge multiple lines into a single log message. And this is why the formatting with xmllint
was necessary:
filter {
# add all lines that have more indentation than double-space to the previous line
multiline {
pattern => "^\s\s(\s\s|\<\/entry\>)"
what => previous
}
}
This basically says that every line with indentation that is more than two spaces (or is </entry>
/ xmllint does indentation with two spaces by default) belongs to a previous line. This also means character data must not contain newlines (stripped with tr
in shell) and that the xml must be normalised (xmllint)