Spider|HtmlUnit抓取js渲染页面

需求:
需要采集js渲染的页面,有些网站的页面是js渲染的
实现:
基于HtmlUnit实现:

public static void getAjaxPage() throws Exception{ WebClient webClient = new WebClient(); webClient.setJavaScriptEnabled(true); webClient.setCssEnabled(false); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.setTimeout(Integer.MAX_VALUE); webClient.setThrowExceptionOnScriptError(false); HtmlPage rootPage = webClient.getPage("http://tt.mop.com/read_14304066_1_0.html"); System.out.println(rootPage.asXml()); }

maven依赖:
net.sourceforge.htmlunit htmlunit-core-js 2.9 compile net.sourceforge.htmlunit htmlunit 2.9 compile

【Spider|HtmlUnit抓取js渲染页面】说明:
Nutch插件:nutch-htmlunit用于替换Nutch自身的Http Fetch组件

    推荐阅读