Parser for Wikipedia
I do not know how exactly looks xml format of Wikipedia dump. But, if a part of the text is in Wikipedia markup, I suggest to investigate http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html. This is one of the classes of a Wikipedia package for apache lucene. I didn't use it but apache lucene is a quite mature project, so it is worth to try its -- in this case experimental -- package.
See java-wikipedia-parser. I have never used it but according to the docs :
The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the
be.devijver.wikipedia.Visitor
interface.