low-level~python|low-level~python scrapy多级页面爬取并存储为JSON格式 low-level~pythonscrapy多级页面爬取

在上一篇scrapy(low-level~python scrapy自动爬取网页的爬虫)[https://www.jianshu.com/p/9b07e556216e]中我们实现了翻页操作，但是这种操作不利于改动。这次改进为分模块编程的思想。
思路：
第一步:提取每页的链接
第二步:提取每页商品的链接
第三步:提取每页商品的具体信息
这里的难点在于

for i in range(1,3): url ="http://category.dangdang.com/pg"+str(i)+"-cp01.05.16.00.00.00.html #通过yield返回Reques,并指定要爬取的网址和回调函数 yield Request(url,callback=self.parse)

Request函数的参数callback，这个参数决定着接下来执行什么操作。
首先我们知道spider中初始的request是通过调用start_requests()来获取，start_requests()读取start_urls中的URL，所以我们每页链接的提取放在start_requests()中处理

def start_requests(self): start_urls = ['http://category.dangdang.com/pg1-cp01.05.16.00.00.00.html',] start,end=1,3 for i in range(start,end): print(str(i)) url="http://category.dangdang.com/pg"+str(i)+"-cp01.05.16.00.00.00.html" print(url) yield scrapy.http.Request(url,self.parse)

yield scrapy.http.Request(url,self.parse)含义:

Request请求url链接，回调parse()函数

接下来我们定义函数parse()，这个函数负责提取每个页面商品的链接。分析页面代码

low-level~python|low-level~python scrapy多级页面爬取并存储为JSON格式

文章图片
image.png 如图所示，所有的商品信息都在

标签下，而这些li标签在

def parse(self,response): urls = response.xpath("//*[@id='component_0__0__6612']/li") for url in urls: href=https://www.it610.com/article/url.xpath("a[@class='pic']/@href").extract_first() print(href) request=scrapy.http.Request(href,callback=self.parseArticle) yield request

def parseArticle(self,response): item =AutopjtItem() item['name']=response.xpath('// [@id="product_info"]/div[1]/h2/span[1]/@title').extract() print(item['name']) yield item

【low-level~python|low-level~python scrapy多级页面爬取并存储为JSON格式】

# -*- coding: utf-8 -*- 2 import scrapy 3 from autopjt.items import AutopjtItem 4 from scrapy.http import Request 5 from scrapy.selector import Selector 6 #创建一个爬虫类AutospdSpider，该类继承了scrapy.Spider基类 7 class AutospdSpider(scrapy.Spider): 8name = "autospd" 9# urlList = [] 10#name属性代表的是爬虫名称 11#allowed_domains属性代表的是允许爬行的域名 12allowed_domains = ["dangdang.com"] 13#爬行的起始网址 14# start_urls = ['http://category.dangdang.com/pg1-cp01.05.16.00.00.00.html',] 15def start_requests(self): 16start_urls = ['http://category.dangdang.com/pg1-cp01.05.16.00.00.00.html',] 17start,end=1,3 18for i in range(start,end): 19print(str(i)) 20url="http://category.dangdang.com/pg"+str(i)+"-cp01.05.16.00.00.00.html" 21print(url) 22yield scrapy.http.Request(url,self.parse) 23 #得到页链接，对每个页链接找到每一个商品链接 24def parse(self,response): 25urls = response.xpath("//*[@id='component_0__0__6612']/li") 26for url in urls: 27href=https://www.it610.com/article/url.xpath("a[@class='pic']/@href").extract_first() 28print(href) 29request=scrapy.http.Request(href,callback=self.parseArticle) 30yield request 31def parseArticle(self,response): 32item =AutopjtItem() 33item['name']=response.xpath('//*[@id="product_info"]/div[1]/h2/span[1]/@title').extract() 34yield item

class AutopjtPipeline(object): def __init__(self): self.file = codecs.open("mydata.json","wb",encoding="utf-8") #这个方法必须返回一个Item对象，参数item:被爬取的item,spider爬取该item的spider def process_item(self, item, spider): #打开JSON文件，向里面以dumps的方式吸入数据，其中ensure_ascii=False，不然数据会直接为utf编码的方式存入 i = json.dumps(dict(item),ensure_ascii=False) #每条数据后加上换行 line = i +'\n' #数据写到mydata.json文件中 self.file.write(line) return item def close_spider(self,spider): self.file.close()