Extract content of <script> with BeautifulSoup
extract
remove tag from the dom. That's why you get empty list.
Find script
with the type="application/ld+json"
attribute and decode it using json.loads
. Then, you can access the data like Python data structure. (dict
for the given data)
import json
import urllib2
from bs4 import BeautifulSoup
URL = ("http://www.reuters.com/video/2014/08/30/"
"woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)
data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']
From the documentation:
As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of
<script>
,<style>
, and<template>
tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.
So basically the accepted answer from falsetru above is all good, but use .string
instead of .text
with newer versions of Beautiful Soup, or you'll be puzzled as I was by .text
always returning None
for <script>
tags.
Thanks for the inspiration. I've been trying for hours how to do it. But let me tell you that since Python3 doesn't work with urllib2 anymore, we must use the requests library instead urllib2. I just drop here the updated version. Enjoy ;)
import json
import requests
from bs4 import BeautifulSoup
url = input('Enter url:')
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
data = json.loads(soup.find('script', type='application/ld+json').text)
print(data['articleBody'])