Extract content of <script> with BeautifulSoup

extract remove tag from the dom. That's why you get empty list.

Find script with the type="application/ld+json" attribute and decode it using json.loads. Then, you can access the data like Python data structure. (dict for the given data)

import json
import urllib2

from bs4 import BeautifulSoup

URL = ("http://www.reuters.com/video/2014/08/30/"
       "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']

From the documentation:

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

So basically the accepted answer from falsetru above is all good, but use .string instead of .text with newer versions of Beautiful Soup, or you'll be puzzled as I was by .text always returning None for <script> tags.

Thanks for the inspiration. I've been trying for hours how to do it. But let me tell you that since Python3 doesn't work with urllib2 anymore, we must use the requests library instead urllib2. I just drop here the updated version. Enjoy ;)

import json
import requests
from bs4 import BeautifulSoup

url = input('Enter url:')
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')

data = json.loads(soup.find('script', type='application/ld+json').text)
print(data['articleBody'])

Extract content of <script> with BeautifulSoup

Tags:

Python

Python 2.7

Beautifulsoup

Related

Recent Posts