python爬虫Scrapy框架:媒体管道原理学习分析
目录
- 一、媒体管道
- 1.1、媒体管道的特性
- 媒体管道实现了以下特性:
- 图像管道具有一些额外的图像处理功能:
- 1.2、媒体管道的设置
- 二、ImagesPipeline类简介
- 三、小案例:使用图片管道爬取百度图片
- 3.1、spider文件
- 3.2、items文件
- 3.3、settings文件
- 3.4、pipelines文件
一、媒体管道
1.1、媒体管道的特性
媒体管道实现了以下特性:
- 避免重新下载最近下载的媒体
- 指定存储位置(文件系统目录,Amazon S3 bucket,谷歌云存储bucket)
图像管道具有一些额外的图像处理功能:
- 将所有下载的图片转换为通用格式(JPG)和模式(RGB)
- 生成缩略图
- 检查图像的宽度/高度,进行最小尺寸过滤
1.2、媒体管道的设置
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 120}启用FILES_STORE = '/path/to/valid/dir'文件管道存放位置IMAGES_STORE = '/path/to/valid/dir'图片管道存放位置FILES_URLS_FIELD = 'field_name_for_your_files_urls'自定义文件url字段FILES_RESULT_FIELD = 'field_name_for_your_processed_files'自定义结果字段IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'自定义图片url字段IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'结果字段FILES_EXPIRES = 90文件过期时间默认90天IMAGES_EXPIRES = 90图片过期时间默认90天IMAGES_THUMBS = {'small': (50, 50), 'big':(270, 270)}缩略图尺寸IMAGES_MIN_HEIGHT = 110过滤最小高度IMAGES_MIN_WIDTH = 110过滤最小宽度MEDIA_ALLOW_REDIRECTS = True是否重定向
二、ImagesPipeline类简介
#解析settings里的配置字段def __init__(self, store_uri, download_func=None, settings=None)#图片下载def image_downloaded(self, response, request, info)#图片获取图片大小的过滤#缩略图的生成def get_images(self, response, request, info)#转化图片格式def convert_image(self, image, size=None)#生成媒体请求可重写def get_media_requests(self, item, info) return [Request(x) for x in item.get(self.images_urls_field, [])] #得到图片url变成请求发给引擎#此方法获取文件名进行改写def item_completed(self, results, item, info)#文件路径 def file_path(self, request, response=None, info=None)#缩略图的存储路径def thumb_path(self, request, thumb_id, response=None, info=None):
三、小案例:使用图片管道爬取百度图片 (当然不使用图片管道的话也是可以爬取百度图片的,但这还需要我们去分析网页的代码,还是有点麻烦,使用图片管道就可以省去这个步骤了)
3.1、spider文件
注意:由于需要添加所有的请求头,所以我们要重写start_requests函数
import reimport scrapyfrom ..items import DbimgItemclass DbSpider(scrapy.Spider):name = 'db'# allowed_domains = ['xxx.com']start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111110&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%8B%97&oq=%E7%8B%97&rsp=-1']def start_requests(self):#因为需要添加所有的请求头,所以我们要重写start_requests函数# url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111110&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%8B%97&oq=%E7%8B%97&rsp=-1'headers = {"Accept": "text/html,application/xhtml+xml,application/xml; q=0.9,image/avif,image/webp,image/apng,*/*; q=0.8,application/signed-exchange; v=b3; q=0.9","Accept-Encoding": "gzip, deflate, br","Accept-Language": "zh-CN,zh; q=0.9","Cache-Control": "max-age=0","Connection": "keep-alive","Cookie": "BIDUPSID=4B61D634D704A324E3C7E274BF11F280; PSTM=1624157516; BAIDUID=4B61D634D704A324C7EA5BA47BA5886E:FG=1; __yjs_duid=1_f7116f04cddf75093b9236654a2d70931624173362209; BAIDUID_BFESS=101022AEE931E08A9B9A3BA623709CFE:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; cleanHistoryStatus=0; H_PS_PSSID=34099_33969_34222_31660_34226_33848_34113_34073_33607_34107_34134_34118_26350_22159; delPer=0; PSINO=6; BA_HECTOR=24ak842ka421210koq1gdtj070r; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; userFrom=www.baidu.com; firstShowTip=1; indexPageSugList=%5B%22%E7%8B%97%22%2C%22%E7%8C%AB%E5%92%AA%22%2C%22%E5%B0%8F%E9%80%8F%E6%98%8E%22%5D; ab_sr=1.0.1_OGYwMTZiMjg5ZTNiYmUxODIxOTgyYTllZGMyMzhjODE2ZWE5OGY4YmEyZWVjOGZhOWIxM2NlM2FhZTQxMmFjODY0OWZiNzQxMjVlMWIyODVlZWFiZjY2NTQyMTZhY2NjNTM5NDNmYTFmZjgxMTlkOGYxYTUzYTIzMzA0NDE3MGNmZDhkYTBkZmJiMmJhZmFkZDNmZTM1ZmI2MWZkNzYyYQ==","Host": "image.baidu.com","Referer": "https://image.baidu.com/","sec-ch-ua": '" Not; A Brand"; v="99", "Google Chrome"; v="91", "Chromium"; v="91"',"sec-ch-ua-mobile": "?0","Sec-Fetch-Dest": "document","Sec-Fetch-Mode": "navigate","Sec-Fetch-Site": "same-origin","Sec-Fetch-User": "?1","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"}for url in self.start_urls:yield scrapy.Request(url,headers=headers,callback=self.parse,dont_filter=True)def parse(self, response):img_urls = re.findall('"thumbURL":"(.*?)"', response.text)# print(img_urls)item = DbimgItem()item['image_urls'] = img_urlsyield item
3.2、items文件
import scrapyclass DbimgItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()image_urls = scrapy.Field()
3.3、settings文件
ROBOTSTXT_OBEY = False#打开我们写的管道ITEM_PIPELINES = {# 'dbimg.pipelines.DbimgPipeline': 300,'dbimg.pipelines.ImgPipe': 300,}#图片存放位置IMAGES_STORE = 'D:/python test/爬虫/scrapy6/dbimg/imgs'
3.4、pipelines文件
import osfrom itemadapter import ItemAdapterfrom scrapy.pipelines.images import ImagesPipelineimport settings"""def item_completed(self, results, item, info):with suppress(KeyError):ItemAdapter(item)[self.images_result_field] = [x for ok, x in results if ok]return item"""class ImgPipe(ImagesPipeline):num=0#重写此函数修改获取的图片的名字不然图片名称就是一串数字字母def item_completed(self, results, item, info):images_path = [x['path'] for ok, x in results if ok]#print('results: ',results)先查看下results的数据格式,然后才能获取到我们需要的值for image_path in images_path:os.rename(settings.IMAGES_STORE + "/" + image_path, settings.IMAGES_STORE + "/" + str(self.num) + ".jpg")self.num += 1
结果:
![python爬虫Scrapy框架:媒体管道原理学习分析](https://img.it610.com/image/info11/370354fe3a44455f8ec17a6a4c66b133.jpg)
文章图片
以上就是python爬虫Scrapy框架:媒体管道原理学习分析的详细内容,更多关于python爬虫Scrapy框架的资料请关注脚本之家其它相关文章!
推荐阅读
- python学习之|python学习之 实现QQ自动发送消息
- 逻辑回归的理解与python示例
- python自定义封装带颜色的logging模块
- 【Leetcode/Python】001-Two|【Leetcode/Python】001-Two Sum
- Python基础|Python基础 - 练习1
- Python爬虫|Python爬虫 --- 1.4 正则表达式(re库)
- Python(pathlib模块)
- python青少年编程比赛_第十一届蓝桥杯大赛青少年创意编程组比赛细则
- Python数据分析(一)(Matplotlib使用)
- 爬虫数据处理HTML转义字符