Getting information with web scraping from multiple screen web page
RSelenium Coded Example
Here is a self-contained code example, using the web-site referenced in the question.
Observation: Please do not run this code.
Why? Having 1k Stack users hit the web-site is a DDOS attack.
##Introduction Prerequisites
The code below will install RSelenium, before running the code you need to:
- Install Firefox
- Add the Selenium IDE Plugin
- https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
- Install RStudio [Recommendation]
- Create a project and open the code file below
The code below will take you from the second page [http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul] through to the final page where the information you are interested in is...
Useful References:
If you are interested in using RSelenium I strongly recommend you read the following references, thanks go to John Harrison for developing the RSelenium package.
- RSelenium Basics
http://rpubs.com/johndharrison/12843
- RSelenium Headless Browsing
http://rpubs.com/johndharrison/RSelenium-headless
- RSelenium Vignette
https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html
Code Example
# We want to make this as easy as possible to use
# So we need to install required packages for the user...
#
if (!require(RSelenium)) install.packages("RSelenium")
if (!require(XML)) install.packages("XML")
if (!require(RJSONIO)) install.packages("RSJONIO")
if (!require(stringr)) install.packages("stringr")
# Data
#
mainPage <- "http://appscvs.supercias.gob.ec/portalInformacion/sector_societario.zul"
businessPage <- "http://appscvs.supercias.gob.ec/portaldeinformacion/consulta_cia_param.zul"
# StartServer
# We assume RSelenium is not setup, so we check if the RSelenium
# server is available, if not we install RSelenium server.
checkForServer()
# OK. now we start the server
RSelenium::startServer()
remDr <- RSelenium::remoteDriver$new()
# We assume the user has installed Firefox and the Selenium IDE
# https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
#
# Ok we open firefix
remDr$open(silent = T) # Open up a firefox window...
# Now we open the browser and required URL...
# This is the page that matters...
remDr$navigate(businessPage)
# First things first on the first page, lets get the id's for the radio_button,
# name Element, and button. We need all three.
#
radioButton <- remDr$findElements(using = 'css selector', ".z-radio-cnt")
nameElement <- remDr$findElements(using = 'css selector', ".z-combobox-inp")
searchButton <- remDr$findElements(using = 'css selector', ".z-button-cm")
# Optional: we can highlight the radio elements returned
# lapply(radioButton, function(x){x$highlightElement()})
# Optional: we can highlight the nameElement returned
# lapply(nameElement, function(x){x$highlightElement()})
# Optional: we can highlight the searchButton returned
# lapply(searchButton, function(x){x$highlightElement()})
# Now we can select and press the third radio button
radioButton[[3]]$clickElement()
# We fill in the required name...
nameElement[[1]]$sendKeysToElement(list("PROAÑO & ASOCIADOS CIA. LTDA."))
# This is subtle but required the page triggers a drop down list, so rather than
# hitting the searchButton, we first select, and hit enter in the drop down menu...
selectElement <- remDr$findElements(using = 'css selector', ".z-comboitem-text")
selectElement[[1]]$clickElement()
# OK, now we can click the search button, which will cause the next page to open
searchButton[[1]]$clickElement()
# New Page opens...
#
# Ok, so now we first pull the list of buttons...
finPageButton <- remDr$findElements(using = 'class name', "m_iconos")
# Now we can press the required button to open the page we want to get too...
finPageButton[[9]]$clickElement()
# We are now on the required page.
we are now on the target page [See image]
Extracting the table values...
The next step is to extract the table values. To do this, we pull the .z-listitem
css-selector
data. Now we can check to confirm if we see the lines of data. We do, so we can now extract the values returned and populate either a list or Dataframe.
# Ok, now we need to extract the table, we identify and pull out the
# '.z-listitem' and assign to modalWindow
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
# Now we can extract the lines from modalWindow... Now that each line is
# returned as a single line of text, so we split into three based on the
# line marker "/n'
lineText <- str_split(modalWindow[[1]]$getElementText()[1], '\n')
lineText
here, is the result:
> lineText <- stringr::str_split(modalWindow[[1]]$getElementText()[1], '\n')
> lineText
[[1]]
[1] "10"
[2] "OPERACIONES DE INGRESO CON PARTES RELACIONADAS EN PARAÍSOS FISCALES, JURISDICCIONES DE MENOR IMPOSICIÓN Y REGÍMENES FISCALES PREFERENTES"
[3] "0.00"
Dealing with Hidden Data.
The Selenium WebDriver and thus RSelenium only interact with visible elements of a web page. If we try to read the entire table, we will only return table items that are visible (unhidden).
We can navigate this issue by scrolling to the bottom of the table. We force the table to populate due to the scroll action. We can then extract the complete table.
# Select the .z-listbox-body
modalWindow <- remDr$findElements(using = 'css selector', ".z-listbox-body")
# Now we tell the window we want to scroll to the bottom of the table
# This triggers the table to populate all the rows
modalWindow[[1]]$executeScript("window.scrollTo(0, document.body.scrollHeight)")
# Now we can extract the complete table
modalWindow <- remDr$findElements(using = 'css selector', ".z-listitem")
lineText <- stringr::str_split(modalWindow[[9]]$getElementText(), '\n')
lineText
###What the code does.
The code example above is meant to be self-contained. By that I mean it should install everything you need including required packages. Once the dependent R packages install, the R code will call checkForServer()
, if Selenium is not installed, the call will install it. This may take some time
My recommendation is you step through the code as I have not incorporated any delays (in production you would want to), note also I have not optimised for speed but rather for a modicum of clarity [from my perspective]...
The code was shown to work on:
- Mac OS X 10.11.5
- RStudio 0.99.893
- R version 3.2.4 (2016-03-10) -- "Very Secure Dishes"
Check out RSelenium
First, install RSelenium and use the above linked vignette to get familiar with the basics
Then see this webinar on using RSelenium, which goes through some detailed scraping step-by-step and is quite easy to follow: http://johndharrison.blogspot.hk/2014/05/orange-county-r-users-group-oc-rug.html