爬虫之Scrapy系列文章
欢迎点赞评论学习交流~
各位看官多多关注哦~
目录
项目介绍
【爬虫学习|爬虫系列(某当网图书数据采集)】 项目演示
spider.py
items.py
pipelines.py
结束
项目介绍我们需要对某当网图书数据(标题,作者,价格,评论,出版社)采集并入库。
文章图片
项目演示 spider.py
'''
coding:utf-8
@Software:PyCharm
@Time:2022/6/25 18:21
@Author:小久
'''import scrapy
from ..items import DangdangItem
import reclass DangbookSpider(scrapy.Spider):
name = 'dangbook'
allowed_domains = ['dangdang.com']
# start_urls = ['http://category.dangdang.com/cp01.03.00.00.00.00.html']def start_requests(self):
for i in range(9):
start_urls = 'http://category.dangdang.com/pg{}-cp01.03.00.00.00.00.html'.format(i)
# print(start_urls)
yield scrapy.Request(url=start_urls,callback=self.parse)def parse(self, response):
items = DangdangItem()
titles = response.xpath('//p[@class="name"]/a/@title').extract()
prices = response.xpath('//p[@class="price"]/span[@class="search_now_price"]/text()').extract()
authors = response.xpath('//p[@class="search_book_author"]/span[1]/a[1]/@title').extract()
comments = response.xpath('//p[@class="search_star_line"]/a/text()').extract()
publishes = response.xpath('//p[5]/span[3]/a/text()').extract()
for title,price,author,comment,publish in zip(titles,prices,authors,comments,publishes):
items['title'] = title
price = price.split('¥')[1]
items['price'] = price
items['author'] = author
comment = re.findall('(.*?)条评论',comment)[0]
items['comment'] = comment
items['publish'] = publish
# print(publish)
yield items
用xpath对 标题,作者,价格,评论,出版社进行定位,这里我们可以发现了链接的规律,我们直接写一个基础的url,然后遍历翻页就行了。
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DangdangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
picture = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
author = scrapy.Field()
comment = scrapy.Field()
publish = scrapy.Field()
将数据存入items容器中,等下有利于入库。
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.utils.project import get_project_settings
import pymysqlclass DangdangMysql:def open_spider(self,spider):
setting = get_project_settings()
self.host = setting['HOST']
self.user = setting['USER']
self.password = setting['PASSWORD']
self.charset = setting['CHARSET']
self.db = setting['DATABASE']
self.port = setting['PORT']
self.connect()def connect(self):
self.conn = pymysql.connect(host=self.host,user=self.user,
password=self.password,charset=self.charset,
db=self.db,port=self.port)def process_item(self,item,spider):
self.cursor = self.conn.cursor()
sql = 'insert into dangbooks(id,title,author,price,publish,comment) values("{}","{}","{}","{}","{}","{}")'\
.format(0,item['title'],item['author'],item['price'],item['publish'],item['comment'])
try:
self.cursor.execute(sql)
self.conn.commit()
except Exception as e:
self.conn.rollback()
return itemdef close_process(self,spider):
self.cursor.close()
self.conn.close()
将数据存入mysql数据库。
结束
文章图片
推荐阅读
- 爬虫学习|爬虫系列(Scrapy的三种解析方式你都清楚吗())
- 笔记|Python爬虫(五)
- python|通过Python爬虫技术获取小说信息
- 心得|记录下最近七天一点研究
- 人工智能|终结 Python 原生字典(这个库要逆天改命了)
- python|终结 Python 原生字典(这个库真的要逆天改命了)
- python|全国大学生计算机等级考试计算机二级python真题
- 数据分析|豆瓣电影Top250数据分析
- 最后得倔强,使用Java进行爬虫