第一种,使用spider爬取 首先选择一个初始的小说链接,例如小说第一章的链接https://www.zwdu.com/book/11029/2297440.html
我爬去的是这个网站,链接中的小说
首先,创建一个项目
【Python|scrapy两种方法爬取网站小说】scrapy startproject novel
创建spider
scrapy genspider spider https://www.zwdu.com/book/11029/2297440.html
然后开始分析网站,因为该网站小说的链接没有规律,所以我选择通过下一页链接爬取下一章节的内容,然后到了最后一个章节发现最后的下一页链接不包含.html,可以设置一个if语句,来结束爬取。
这是下一章节链接在的地方。可以通过xpath来提取链接
文章图片
这里是最后一章,链接不包括.html,可以用来作为停止爬取的条件
文章图片
这里是我爬取的源码
spider.py
# -*- coding: utf-8 -*-
from scrapy import Request,Spider
import jsonclass SpiderSpider(Spider):
name = 'spider'
#allowed_domains = ['www.zwdu.com']
start_urls = ['https://www.zwdu.com/book/26215/8210673.html']def parse(self, response):
title = response.xpath("//h1/text()").extract_first()
content = ''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('', '\n')
yield {
'title':title,
'content':content
}
next_url = response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first()
base_url = 'https://www.81zw.com'+next_url
if next_url.find('.html') != -1:
yield Request(base_url,callback=self.parse)
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass XiaoshuoPipeline(object):
def open_spider(self,spider):
self.file = open('xiuzhen.txt','w',encoding='utf-8')def process_item(self, item, spider):
title = item['title']
content = item['content']
info = title + '\n' + content + '\n'
self.file.write(info)
return itemdef close_spider(self,spider):
self.file.close()
然后在settings中设置User-Agent,和ItemPipeline,执行scrapy crawl spider即可爬取这个小说。
第二种方法CrawlSpider 创建spider文件,scrapy genspider -t crawl example exampile.com
他使用Rule自动按规则提取链接进行爬取
文章图片
LinkExtractor()有两个常用参数
allow:通过正则表达式来匹配
restrict_xpaths:通过xpath来匹配
因为crawlspider不会解析start_urls,所以需要从整个章节目录开始爬取,第一个Rule用来爬取第一章,第二个用来爬取接下来的每一章。解析方法和第一种方法一样,pipelines也不变。
spider2.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Ruleclass ZwwSpider(CrawlSpider):
name = 'spider2'
#allowed_domains = ['https://www.zwdu.com']
start_urls = ['https://www.zwdu.com/book/11029/']rules = (
Rule(LinkExtractor(restrict_xpaths=r'//*[@id="list"]/dl/dd[1]/a'), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths=r'//div[@class="bottem1"]/a[3]'), callback='parse_item', follow=True),
)def parse_item(self, response):
title = response.xpath("//h1/text()").extract_first()
content = ''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('', '\n')yield {
'title': title,
'content': content
}
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass XiaoshuoPipeline(object):
def open_spider(self,spider):
self.file = open('xiuzhen.txt','w',encoding='utf-8')def process_item(self, item, spider):
title = item['title']
content = item['content']
info = title + '\n' + content + '\n'
self.file.write(info)
return itemdef close_spider(self,spider):
self.file.close()
settings.py
# -*- coding: utf-8 -*-# Scrapy settings for xiaoshuo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#https://doc.scrapy.org/en/latest/topics/settings.html
#https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'xiaoshuo'SPIDER_MODULES = ['xiaoshuo.spiders']
NEWSPIDER_MODULE = 'xiaoshuo.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1;
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#'Accept': 'text/html,application/xhtml+xml,application/xml;
q=0.9,*/*;
q=0.8',
#'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#'xiaoshuo.middlewares.XiaoshuoSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#'xiaoshuo.middlewares.XiaoshuoDownloaderMiddleware': 543,
# }# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'xiaoshuo.pipelines.XiaoshuoPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
以上就是两种爬取网站小说的方法,有不对的还请多多指教。
推荐阅读
- 推荐系统论文进阶|CTR预估 论文精读(十一)--Deep Interest Evolution Network(DIEN)
- Python专栏|数据分析的常规流程
- Python|Win10下 Python开发环境搭建(PyCharm + Anaconda) && 环境变量配置 && 常用工具安装配置
- Python绘制小红花
- Pytorch学习|sklearn-SVM 模型保存、交叉验证与网格搜索
- OpenCV|OpenCV-Python实战(18)——深度学习简介与入门示例
- python|8. 文件系统——文件的删除、移动、复制过程以及链接文件
- 爬虫|若想拿下爬虫大单,怎能不会逆向爬虫,价值过万的逆向爬虫教程限时分享
- 分布式|《Python3网络爬虫开发实战(第二版)》内容介绍
- java|微软认真聆听了开源 .NET 开发社区的炮轰( 通过CLI 支持 Hot Reload 功能)