Open webpage and parse it using JavaScript

Whatever Origin is an open source library that allows you to use purely Javascript to do scraping. It also solves the "same-domain-origin" problem. http://www.whateverorigin.org/

$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('http://google.com') + '&callback=?', function(data){
    alert(data.contents);
});

You can use an XMLHttpRequest object to do this. Here's a simple example

var req = new XMLHttpRequest();  
req.open('GET', 'http://www.mydomain.com/', false);   
req.send(null);  
if(req.status == 200)  
   dump(req.responseText);

Once loaded, you can perform your parsing/scraping by using javascript regular expressions on the req.responseText member.

More detail...

In practice you need to do a little more to get the XMLHttpRequest object in a cross platform manner, e.g.:

var ua = navigator.userAgent.toLowerCase();
if (!window.ActiveXObject)
  req = new XMLHttpRequest();
else if (ua.indexOf('msie 5') == -1)
  req = new ActiveXObject("Msxml2.XMLHTTP");
else
  req = new ActiveXObject("Microsoft.XMLHTTP");

Or use a library...

Alternatively, you can save yourself all the bother and just use a library like jQuery or Prototype to take care of this for you.

Same-origin policy may bite you though...

Note that due to the same-origin policy, the page you request must be from the same domain as the page making the request. If you want to request a remote page, you will have to proxy that via a server side script.

Another possible workaround is to use Flash to make the request, which does allow cross-domain requests if the target site grants permission with a suitably configured crossdomain.xml file.

Here's a nice article on the subject of the same-origin policy:

  • Same-Origin Policy Part 1: Why we’re stuck with things like XSS and XSRF/CSRF