BeautifulSoup: extract text from anchor tag
I would suggest going the lxml route and using xpath.
from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')
This will help:
from bs4 import BeautifulSoup
data = '''<div class="image">
<a href="http://www.example.com/eg1">Content1<img
src="http://image.example.com/img1.jpg" /></a>
</div>
<div class="image">
<a href="http://www.example.com/eg2">Content2<img
src="http://image.example.com/img2.jpg" /> </a>
</div>'''
soup = BeautifulSoup(data)
for div in soup.findAll('div', attrs={'class':'image'}):
print(div.find('a')['href'])
print(div.find('a').contents[0])
print(div.find('img')['src'])
If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.
In my case, it worked like that:
from BeautifulSoup import BeautifulSoup as bs
url="http://blabla.com"
soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.string
Hope it helps!