springboot+WebMagic+MyBatis爬虫框架的使用
目录
- 1.添加maven依赖
- 2.项目配置文件 application.properties
- 3.数据库表结构
- 4.实体类
- 5.mapper接口
- 6.CrawlerMapper.xml文件
- 7.知乎页面内容处理类ZhihuPageProcessor
- 8.知乎数据处理类ZhihuPipeline
- 9.知乎爬虫任务类ZhihuTask
- 10.Spring boot程序启动类
本文是对spring boot+WebMagic+MyBatis做了整合,使用WebMagic爬取数据,然后通过MyBatis持久化爬取的数据到mysql数据库。本文提供的源代码可以作为java爬虫项目的脚手架。
文章图片
1.添加maven依赖
4.0.0 hyzx qbasic-crawler1.0.0 org.springframework.boot spring-boot-starter-parent1.5.21.RELEASE UTF-8 true 1.8 3.8.1 3.1.0 5.1.47 1.1.17 1.3.4 1.2.58 3.9 2.10.2 0.7.3 org.springframework.boot spring-boot-devtoolsruntimetrue org.springframework.boot spring-boot-starter-testtestorg.springframework.boot spring-boot-configuration-processortrue mysql mysql-connector-java${mysql.connector.version} com.alibaba druid-spring-boot-starter${druid.spring.boot.starter.version} org.mybatis.spring.boot mybatis-spring-boot-starter${mybatis.spring.boot.starter.version} com.alibaba fastjson${fastjson.version} org.apache.commons commons-lang3${commons.lang3.version} joda-time joda-time${joda.time.version} us.codecraft webmagic-core${webmagic.core.version} org.slf4j slf4j-log4j12org.apache.maven.plugins maven-compiler-plugin${maven.compiler.plugin.version} ${java.version} ${java.version} ${project.build.sourceEncoding} org.apache.maven.plugins maven-resources-plugin${maven.resources.plugin.version} ${project.build.sourceEncoding} org.springframework.boot spring-boot-maven-plugintrue truerepackage public aliyun nexus http://maven.aliyun.com/nexus/content/groups/public/ true public aliyun nexus http://maven.aliyun.com/nexus/content/groups/public/ true false
2.项目配置文件 application.properties
配置mysql数据源,druid数据库连接池以及MyBatis的mapper文件的位置。
# mysql数据源配置spring.datasource.name=mysqlspring.datasource.type=com.alibaba.druid.pool.DruidDataSourcespring.datasource.driver-class-name=com.mysql.jdbc.Driverspring.datasource.url=jdbc:mysql://192.168.0.63:3306/gjhzjl?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=truespring.datasource.username=rootspring.datasource.password=root# druid数据库连接池配置spring.datasource.druid.initial-size=5spring.datasource.druid.min-idle=5spring.datasource.druid.max-active=10spring.datasource.druid.max-wait=60000spring.datasource.druid.validation-query=SELECT 1 FROM DUALspring.datasource.druid.test-on-borrow=falsespring.datasource.druid.test-on-return=falsespring.datasource.druid.test-while-idle=truespring.datasource.druid.time-between-eviction-runs-millis=60000spring.datasource.druid.min-evictable-idle-time-millis=300000spring.datasource.druid.max-evictable-idle-time-millis=600000# mybatis配置mybatis.mapperLocations=classpath:mapper/**/*.xml
3.数据库表结构
CREATE TABLE `cms_content` (`contentId` varchar(40) NOT NULL COMMENT '内容ID',`title` varchar(150) NOT NULL COMMENT '标题',`content` longtext COMMENT '文章内容',`releaseDate` datetime NOT NULL COMMENT '发布日期',PRIMARY KEY (`contentId`)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS内容表';
4.实体类
import java.util.Date; public class CmsContentPO {private String contentId; private String title; private String content; private Date releaseDate; public String getContentId() {return contentId; }public void setContentId(String contentId) {this.contentId = contentId; }public String getTitle() {return title; }public void setTitle(String title) {this.title = title; }public String getContent() {return content; }public void setContent(String content) {this.content = content; }public Date getReleaseDate() {return releaseDate; }public void setReleaseDate(Date releaseDate) {this.releaseDate = releaseDate; }}
【springboot+WebMagic+MyBatis爬虫框架的使用】
5.mapper接口
public interface CrawlerMapper {int addCmsContent(CmsContentPO record); }
6.CrawlerMapper.xml文件
insert into cms_content (contentId,title,releaseDate,content)values (#{contentId,jdbcType=VARCHAR},#{title,jdbcType=VARCHAR},#{releaseDate,jdbcType=TIMESTAMP},#{content,jdbcType=LONGVARCHAR})
7.知乎页面内容处理类ZhihuPageProcessor
主要用于解析爬取到的知乎html页面。
@Componentpublic class ZhihuPageProcessor implements PageProcessor {private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Overridepublic void process(Page page) {page.addTargetRequests(page.getHtml().links().regex("https://www\\.zhihu\\.com/question/\\d+/answer/\\d+.*").all()); page.putField("title", page.getHtml().xpath("//h1[@class='QuestionHeader-title']/text()").toString()); page.putField("answer", page.getHtml().xpath("//div[@class='QuestionAnswer-content']/tidyText()").toString()); if (page.getResultItems().get("title") == null) {// 如果是列表页,跳过此页,pipeline不进行后续处理page.setSkip(true); }}@Overridepublic Site getSite() {return site; }}
8.知乎数据处理类ZhihuPipeline
主要用于将知乎html页面解析出的数据存储到mysql数据库。
@Componentpublic class ZhihuPipeline implements Pipeline {private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class); @Autowiredprivate CrawlerMapper crawlerMapper; public void process(ResultItems resultItems, Task task) {String title = resultItems.get("title"); String answer = resultItems.get("answer"); CmsContentPO contentPO = new CmsContentPO(); contentPO.setContentId(UUID.randomUUID().toString()); contentPO.setTitle(title); contentPO.setReleaseDate(new Date()); contentPO.setContent(answer); try {boolean success = crawlerMapper.addCmsContent(contentPO) > 0; LOGGER.info("保存知乎文章成功:{}", title); } catch (Exception ex) {LOGGER.error("保存知乎文章失败", ex); }}}
9.知乎爬虫任务类ZhihuTask
每十分钟启动一次爬虫。
@Componentpublic class ZhihuTask {private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class); @Autowiredprivate ZhihuPipeline zhihuPipeline; @Autowiredprivate ZhihuPageProcessor zhihuPageProcessor; private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor(); public void crawl() {// 定时任务,每10分钟爬取一次timer.scheduleWithFixedDelay(() -> {Thread.currentThread().setName("zhihuCrawlerThread"); try {Spider.create(zhihuPageProcessor)// 从https://www.zhihu.com/explore开始抓.addUrl("https://www.zhihu.com/explore")// 抓取到的数据存数据库.addPipeline(zhihuPipeline)// 开启2个线程抓取.thread(2)// 异步启动爬虫.start(); } catch (Exception ex) {LOGGER.error("定时抓取知乎数据线程执行异常", ex); }}, 0, 10, TimeUnit.MINUTES); }}
10.Spring boot程序启动类
@SpringBootApplication@MapperScan(basePackages = "com.hyzx.qbasic.dao")public class Application implements CommandLineRunner {@Autowiredprivate ZhihuTask zhihuTask; public static void main(String[] args) throws IOException {SpringApplication.run(Application.class, args); }@Overridepublic void run(String... strings) throws Exception {// 爬取知乎数据zhihuTask.crawl(); }}
到此这篇关于springboot+WebMagic+MyBatis爬虫框架的使用的文章就介绍到这了,更多相关springboot+WebMagic+MyBatis爬虫内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!
推荐阅读
- Python爬虫|Python爬虫 --- 1.4 正则表达式(re库)
- 爬虫数据处理HTML转义字符
- 2018-12-05爬虫
- Python爬虫技术要学到什么程度才可以找到工作()
- python|python 爬虫抓取图片
- 爬虫|若想拿下爬虫大单,怎能不会逆向爬虫,价值过万的逆向爬虫教程限时分享
- 分布式|《Python3网络爬虫开发实战(第二版)》内容介绍
- 【入门】Python网络爬虫与信息提取1
- python|尚硅谷python爬虫(二)-解析方法
- 爬虫实战|爬虫实战| 8w+网友亲身体验,告诉你充气娃娃有多爽()