Android Web Scraping with a Headless Browser
Ok after 2 weeks I admit defeat and are using a workaround which works great for me at the moment.
The problem:
It is too difficult to port HTMLUnit to Android (or at least with my level of expertise). I am sure its a worthwhile project (and not that time consuming for experienced java programmer) . I emailed the guys at HTMLUnit and they commented that they are not looking into a port or what effort will be involved but suggested anyone who wants to start with such a project should send an message to their mailing list to get more developers involved (http://htmlunit.sourceforge.net/mail-lists.html).
The workaround:
I used android's built in WebView and overrided the onPageFinished method of Webview class to inject Javascript that grabs all the html after the page has fully loaded. Webview can also be used to called futher javascript actions, clicking buttons, filling in forms etc.
Code:
webView.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
webView.addJavascriptInterface(jInterface, "HtmlViewer");
webView.setWebViewClient(new WebViewClient() {
@Override
public void onPageFinished(WebView view, String url) {
//Load HTML
webView.loadUrl("javascript:window.HtmlViewer.showHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
}
}
webView.loadUrl(StartURL);
ParseHtml(jInterface.html);
public class MyJavaScriptInterface {
public String html;
@JavascriptInterface
public void showHTML(String _html) {
html = _html;
}
}