Extract Links from a sitemap(xml)
If you're on a Linux box or something with the grep tool, you can just run:
grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml
You can use python script here
This script get any links started with http
import re
f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('>(http:\/\/.+)<',d)
for i in data:
print i
And in your case next script find all data wraped in tags
import re
f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
for i in data:
print i
Here nice tool to play with regexp if you not familiar with it.
if you need to load remote file you can use next code
import urllib2 as ur
import re
f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
for i in data:
print i
This could be accomplished by a single sed command, which seems to be more solid than the grep solution:
sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile
(found at: linuxquestions.org)