How to get an attribute value using BeautifulSoup and Python?
The problem is that find_all('tag')
returns the whole html block entitled tag
:
>>> results.find_all('tag')
[<tag>
<stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>
<stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>
<stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>
</tag>]
Your intention is to collect each of the stat
blocks, so you should be using results.find_all('stat')
:
>>> stat_blocks = results.find_all('stat')
[<stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>, <stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>, <stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>]
From there, it is trivial to fix the code to condense 'pass' into a list:
>>> passes = [s['pass'] if s is not None else None for s in stat_blocks]
>>> passes
['1', '1', '1']
Or print:
>>> for s in stat_blocks:
... print(s['pass'])
...
1
1
1
In python, it's really important to test results because the typing is way too dynamic to trust your memory. I often include a static test
function in classes and modules to ensure that the return types and values are what I expect them to be.
Please consider this approach:
from bs4 import BeautifulSoup
with open('test.xml') as raw_resuls:
results = BeautifulSoup(raw_resuls, 'lxml')
for element in results.find_all("tag"):
for stat in element.find_all("stat"):
print(stat['pass'])
The problem of your solution is that pass is contained in stat and not in tag where you search for it.
This solution searches for all tag and in these tag it searches for stat. From these results it gets pass.
For the XML file
<tag>
<stat fail="0" pass="1">TR=111111 Sandbox=3000613</stat>
<stat fail="0" pass="1">TR=121212 Sandbox=3000618</stat>
<stat fail="0" pass="1">TR=999999 Sandbox=3000617</stat>
</tag>
the script above gets the output
1
1
1
Addition
Since some detailes still seemed to be unclear (see comments) consider this complete workaround using BeautifulSoup
to get everything you want. This solution using dictionaries as elements of lists might not be perfect if you face performance issues. But since you seem to have some troubles using the Python and Soup i thought I create this example as easy as possible by giving the possibility to access all relevant information by name and not by an index.
from bs4 import BeautifulSoup
# Parses a string of form 'TR=abc123 Sandbox=abc123' and stores it in a dictionary with the following
# structure: {'TR': abc123, 'Sandbox': abc123}. Returns this dictionary.
def parseTestID(testid):
dict = {'TR': testid.split(" ")[0].split("=")[1], 'Sandbox': testid.split(" ")[1].split("=")[1]}
return dict
# Parses the XML content of 'rawdata' and stores pass value, TR-ID and Sandbox-ID in a dictionary of the
# following form: {'Pass': pasvalue, TR': TR-ID, 'Sandbox': Sandbox-ID}. This dictionary is appended to
# a list that is returned.
def getTestState(rawdata):
# initialize parser
soup = BeautifulSoup(rawdata,'lxml')
parsedData= []
# parse for tags
for tag in soup.find_all("tag"):
# parse tags for stat
for stat in tag.find_all("stat"):
# store everthing in a dictionary
dict = {'Pass': stat['pass'], 'TR': parseTestID(stat.string)['TR'], 'Sandbox': parseTestID(stat.string)['Sandbox']}
# append dictionary to list
parsedData.append(dict)
# return list
return parsedData
You can use the script above as follows to do whatever you want (e.g. just print out)
# open file
with open('test.xml') as raw_resuls:
# get list of parsed data
data = getTestState(raw_resuls)
# print parsed data
for element in data:
print("TR = {0}\tSandbox = {1}\tPass = {2}".format(element['TR'],element['Sandbox'],element['Pass']))
The output looks like this
TR = 111111 Sandbox = 3000613 Pass = 1
TR = 121212 Sandbox = 3000618 Pass = 1
TR = 222222 Sandbox = 3000612 Pass = 1
TR = 232323 Sandbox = 3000618 Pass = 1
TR = 333333 Sandbox = 3000605 Pass = 1
TR = 343434 Sandbox = ZZZZZZ Pass = 1
TR = 444444 Sandbox = 3000604 Pass = 1
TR = 454545 Sandbox = 3000608 Pass = 1
TR = 545454 Sandbox = XXXXXX Pass = 1
TR = 555555 Sandbox = 3000617 Pass = 1
TR = 565656 Sandbox = 3000615 Pass = 1
TR = 626262 Sandbox = 3000602 Pass = 1
TR = 666666 Sandbox = 3000616 Pass = 1
TR = 676767 Sandbox = 3000599 Pass = 1
TR = 737373 Sandbox = 3000603 Pass = 1
TR = 777777 Sandbox = 3000611 Pass = 1
TR = 787878 Sandbox = 3000614 Pass = 1
TR = 828282 Sandbox = 3000600 Pass = 1
TR = 888888 Sandbox = 3000610 Pass = 1
TR = 999999 Sandbox = 3000617 Pass = 1
Let's summerize the core elements that are used:
Finding XML tags
To find XML tags you use soup.find("tag")
which returns the first matched tag or soup.find_all("tag")
which finds all matching tags and stores them in a list. The single tags can easily be accessed by iterating over the list.
Finding nested tags
To find nested tags you can use find()
or find_all()
again by applying it to the result of the first find_all()
.
Accessing the content of a tag
To access the content of a tag you apply string
to a single tag. For example if tag = <tag>I love Soup!</tag>
tag.string = "I love Soup!"
.
Finding values of attributes
To get the values of attributes you can use the subscript notation. For example if tag = <tag color=red>I love Soup!</tag>
tag['color']="red"
.
For parsing strings of form "TR=abc123 Sandbox=abc123"
I used common Python string splitting. You can read more about it here: How can I split and parse a string in Python?