python实战项目scrapy管道学习爬取在行高手数据 python实战项目scrapy管道学习爬

爬取目标站点分析
编码时间

爬取结果展示

爬取目标站点分析本次采集的目标站点为：https://www.zaih.com/falcon/mentors，目标数据为在行高手数据。

文章图片

本次数据保存到 MySQL 数据库中，基于目标数据，设计表结构如下所示。

文章图片

对比表结构，可以直接将 scrapy 中的 items.py 文件编写完毕。

class ZaihangItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field()# 姓名city = scrapy.Field()# 城市industry = scrapy.Field()# 行业price = scrapy.Field()# 价格chat_nums = scrapy.Field()# 聊天人数score = scrapy.Field()# 评分

编码时间项目的创建过程参考上一案例即可，本文直接从采集文件开发进行编写，该文件为 zh.py。
本次目标数据分页地址需要手动拼接，所以提前声明一个实例变量（字段），该字段为 page，每次响应之后，判断数据是否为空，如果不为空，则执行 +1 操作。
请求地址模板如下：

https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=心理&page={}

当页码超过最大页数时，返回如下页面状态，所以数据为空状态，只需要判断是否存在 class=empty 的 section 即可。

文章图片

解析数据与数据清晰直接参考下述代码即可。

import scrapyfrom zaihang_spider.items import ZaihangItemclass ZhSpider(scrapy.Spider):name = 'zh'allowed_domains = ['www.zaih.com']page = 1# 起始页码url_format = 'https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=%E5%BF%83%E7%90%86&page={}'# 模板start_urls = [url_format.format(page)]def parse(self, response):empty = response.css("section.empty") # 判断数据是否为空if len(empty) > 0:return # 存在空标签，直接返回mentors = response.css(".mentor-board a") # 所有高手的超链接for m in mentors:item = ZaihangItem() # 实例化一个对象name = m.css(".mentor-card__name::text").extract_first()city = m.css(".mentor-card__location::text").extract_first()industry = m.css(".mentor-card__title::text").extract_first()price = self.replace_space(m.css(".mentor-card__price::text").extract_first())chat_nums = self.replace_space(m.css(".mentor-card__number::text").extract()[0])score = self.replace_space(m.css(".mentor-card__number::text").extract()[1])# 格式化数据item["name"] = nameitem["city"] = cityitem["industry"] = industryitem["price"] = priceitem["chat_nums"] = chat_numsitem["score"] = scoreyield item# 再次生成一个请求self.page += 1next_url = format(self.url_format.format(self.page))yield scrapy.Request(url=next_url, callback=self.parse)def replace_space(self, in_str):in_str = in_str.replace("\n", "").replace("\r", "").replace("￥", "")return in_str.strip()

开启 settings.py 文件中的 ITEM_PIPELINES，注意类名有修改

ITEM_PIPELINES = {'zaihang_spider.pipelines.ZaihangMySQLPipeline': 300,}

修改 pipelines.py 文件，使其能将数据保存到 MySQL 数据库中
在下述代码中，首先需要了解类方法 from_crawler，该方法是 __init__ 的一个代理，如果其存在，类被初始化时会被调用，并得到全局的 crawler，然后通过 crawler 就可以获取 settings.py 中的各个配置项。
除此之外，还存在一个 from_settings 方法，一般在官方插件中也有应用，示例如下所示。

@classmethoddef from_settings(cls, settings):host= settings.get('HOST')return cls(host)@classmethoddef from_crawler(cls, crawler):# FIXME: for now, stats are only supported from this constructorreturn cls.from_settings(crawler.settings)

在编写下述代码前，需要提前在 settings.py 中写好配置项。
settings.py 文件代码

HOST = "127.0.0.1"PORT = 3306USER = "root"PASSWORD = "123456"DB = "zaihang"

pipelines.py 文件代码

import pymysqlclass ZaihangMySQLPipeline:def __init__(self, host, port, user, password, db):self.host = hostself.port = portself.user = userself.password = passwordself.db = dbself.conn = Noneself.cursor = None@classmethoddef from_crawler(cls, crawler):return cls(host=crawler.settings.get('HOST'),port=crawler.settings.get('PORT'),user=crawler.settings.get('USER'),password=crawler.settings.get('PASSWORD'),db=crawler.settings.get('DB'))def open_spider(self, spider):self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, password=self.password, db=self.db)def process_item(self, item, spider):# print(item)# 存储到 MySQLname = item["name"]city = item["city"]industry = item["industry"]price = item["price"]chat_nums = item["chat_nums"]score = item["score"]sql = "insert into users(name,city,industry,price,chat_nums,score) values ('%s','%s','%s',%.1f,%d,%.1f)" % (name, city, industry, float(price), int(chat_nums), float(score))print(sql)self.cursor = self.conn.cursor()# 设置游标try:self.cursor.execute(sql)# 执行 sqlself.conn.commit()except Exception as e:print(e)self.conn.rollback()return itemdef close_spider(self, spider):self.cursor.close()self.conn.close()

管道文件中三个重要函数，分别是 open_spider，process_item，close_spider。

# 爬虫开启时执行，只执行一次def open_spider(self, spider):# spider.name = "橡皮擦"# spider对象动态添加实例变量，可以在spider模块中获取该变量值，比如在 parse(self, response) 函数中通过self 获取属性# 一些初始化动作pass# 处理提取的数据，数据保存代码编写位置def process_item(self, item, spider):pass# 爬虫关闭时执行，只执行一次，如果爬虫运行过程中发生异常崩溃，close_spider 不会执行def close_spider(self, spider):# 关闭数据库，释放资源pass

爬取结果展示