Extraction of data from a simple XML file

 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"

Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

I recommend xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.


Using xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

As a previous poster has alluded to - xml_grep is available. That's actually a tool based off the XML::Twig perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

E.g.:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

First way:

Use twig handlers that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purge or flush:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <> to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each job element, it'll extract and print any text associated. (You might want print $_ -> text,"\n" to insert a linefeed).

Because it's matching on 'job' elements, it'll also match on nested job elements:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /job instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

Alternatively - parse first, and 'print' based on structure:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As job is your root element, all we need do is print the text of it.

But we can be a bit more discerning, and look for job or /job and print that specifically instead:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_print option to reformat your XML too:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

Tags:

Xml

Bash

Grep

Awk

Sed