How to get html with javascript rendered sourcecode by using selenium

You will need to get get the document via javascript you can use seleniums execute_script function

from time import sleep # this should go at the top of the file

sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html

That will get everything inside of the <html> tag

It's not necessary to use that workaround, you can use instead:

driver = webdriver.PhantomJS()
driver.get('http://www.google.com/')
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

I am thinking that you are getting the source code before the JavaScript has rendered the dynamic HTML.

Initially try putting a few seconds sleep between the navigate and get page source.

If this works, then you can change to a different wait strategy.

I have same problem about getting Javascript sourcecode from Internet, and I solved it using above Victory's suggestion.

*First: execute_script

driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
#print(driver.page_source)

*Second: parse html using beautifulsoup (You can Downloaded beautifulsoup by pip command)

 import bs4    #import beautifulsoup
 import re
 from time import sleep

 sleep(1)      #wait one second 
 root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup
 viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})   #find the value which you need.

*Third: print out the value you need

 for span in viewcount:
    print(span.string)

*Full code

from selenium import webdriver
import lxml

urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2"

driver = webdriver.PhantomJS()


##driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
##print(driver.page_source)

import bs4
import re
from time import sleep

sleep(1)
root=bs4.BeautifulSoup(innerHTML,"lxml")
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})


for span in viewcount:
print(span.string)

driver.quit()

How to get html with javascript rendered sourcecode by using selenium

Tags:

Python

Javascript

Selenium

Related

Recent Posts