Screen scraping pages that use CSS for layout and formatting...how to scrape the CSS applicable to the html?
Today I needed to scrape Facebook share dialogs to be used as dynamic preview samples in our app builder for facebook apps. I've taken Firebug 1.5 codebase and added a new context menu option "Copy HTML with inlined styles". I've copied their getElementHTML function from lib.js and modified it to do this:
- remove class, id and style attributes
- remove onclick and similar javascript handlers
- remove all data-something attributes
- remove explicit hrefs and replace them with "#"
- replace all block level elements with div and inline element with span (to prevent inheriting styles on target page)
- absolutize relative urls
- inline all applied non-default css atributes into brand new style attribute
- reduce inline style bloat by considering styling parent/child inheritance by traversion DOM tree up
- indent output
It works well for simpler pages, but the solution is not 100% robust because of bugs in Firebug (or Firefox?). But it is definitely usable when operated by a web developer who can debug and fix all quirks.
Problems I've found so far:
- sometimes clear css property is not emitted (it breaks layout pretty badly)
- :hover and other pseudo-classes cannot be captured this way
- firefox keeps only mozilla specific css properties/values in it's model, so for example you lose -webkit-border-radius, because this was skipped by CSS parser
Anyway, this solution saved lot of my time. Originally I was manually selecting pieces of their stylesheets and doing manual selection and postprocessing. It was slow, boring and polluted our class namespace. Now I'm able to scrape facebook markup in minutes instead of hours and exported markup does not interfere with the rest of the page.
A good start would be the following: make a pass through the patch of HTML you plan to extract, collecting each element (and its ID/classes/inline styles) to an array. Grab the styles for those element IDs & classes from the page's stylesheets immediately.
Then, from the outermost element(s) in the target patch, work your way up through the rest of the elements in the DOM in a similar fashion, eventually all the way up to the body and HTML elements, comparing against your initial array and collecting any styles that weren't declared within the target patch or its applied styles.
You'll also want to check for any * declarations and grab those as well. Then, make sure when you're reapplying the styles to your eventual output you do so in the right order, as you collected them from low-to-high in the DOM hierarchy and they'll need to be reapplied high-to-low.
A quick hack would be to pull down their CSS file and apply it to the page you are using to display the data. To avoid any interference you could load the page into an IFrame wherever you need to display it. Of course, I have to question the intention of this code. Are you allowed to republish the information you are scraping?