python反反爬虫技术限制连续请求时间处理 python反反爬虫技术限制连续请

前言
用勾子函数根据缓存行为设置访问时间
爬虫相关库

1. 爬虫常用的测试网站：httpbin.org
2. requests-cache

为原有代码微创式添加缓存功能

缓存的清空和识别

自定义设置缓存的形式

自定义设置缓存的例子1：设置缓存文件类型
自定义设置缓存的例子2：设置缓存保存内容

前言一般的反爬措施是在多次请求之间增加随机的间隔时间，即设置一定的延时。但如果请求后存在缓存，就可以省略设置延迟，这样一定程度地缩短了爬虫程序的耗时。
下面利用requests_cache实现模拟浏览器缓存行为来访问网站，具体逻辑如下：存在缓存，就直接走，不存在缓存，就停一下再走
示例代码

用勾子函数根据缓存行为设置访问时间

import requests_cacheimport timerequests_cache.install_cache()#默认按照浏览器的缓存进行requests_cache.clear()def make_throttle_hook(timeout=0.1):def hook(response, *args, **kwargs):print(response.text)# 判断没有缓存时就添加延时if not getattr(response, 'from_cache', False):print(f'Wait {timeout} s!')time.sleep(timeout)else:print(f'exists cache: {response.from_cache}')return responsereturn hookif __name__ == '__main__':requests_cache.install_cache()requests_cache.clear()session = requests_cache.CachedSession() # 创建缓存会话session.hooks = {'response': make_throttle_hook(2)} # 配置钩子函数print('first requests'.center(50,'*'))session.get('http://httpbin.org/get')print('second requests'.center(50,'*'))session.get('http://httpbin.org/get')

有关requests_cache的更多用法，参考下面requests_cache说明

爬虫相关库
1. 爬虫常用的测试网站：httpbin.org
httpbin.org 这个网站能测试 HTTP 请求和响应的各种信息，比如 cookie、ip、headers 和登录验证等，且支持 GET、POST 等多种方法，对 web 开发和测试很有帮助。它用 Python + Flask 编写，是一个开源项目。

文章图片

2. requests-cache
requests-cache，是 requests 库的一个扩展包，利用它可以非常方便地实现请求的缓存，直接得到对应的爬取结果。
作用和使用场景
1.在爬取过程中，它可以根据浏览器的缓存机制来选择缓存内容。从请求行为上看与浏览器更加相似，起到反反爬的效果。
2.另外，还可以自定义缓存机制，在爬虫项目中，优化性能。
requests-cache库只能对requests的请求实现缓存功能，而且requests要以session方式进行请求。单独的requests.get、requests.post 不能被缓存。
requests
使用方法
安装：

$ pip install requests-cache

与普通的代码比较
在爬取一个域名下的多个url时，使用requests.session.get或requests.session.post会比单纯的requests.get、requests.post更高效。因为它只建立了一个会话，并在上面做多次请求。同时还支持登录信息cookie等的传递。
下面比较一下缓存代码的写法没有缓存的代码：
普通的requests session爬取

import requestsimport timestart = time.time()session = requests.Session()for i in range(10):session.get('http://httpbin.org/delay/1')print(f'Finished {i + 1} requests')end = time.time()print('Cost time', end - start)

该代码是访问了httpbin.org网站，该网站会解析delay/1，在1秒后返回。
有缓存的代码：
带缓存的requests session爬取

import requests_cache #pip install requests_cacheimport timestart = time.time()session = requests_cache.CachedSession('demo_cache')for i in range(10):session.get('http://httpbin.org/delay/1')print(f'Finished {i + 1} requests')end = time.time()print('Cost time', end - start)

为原有代码微创式添加缓存功能只需要添加一句requests_cache.install_cache('demo_cache')即可。
微创式添加缓存功能

import requests_cache #pip install requests_cacherequests_cache.install_cache('demo_cache')#demo_cache.sqlite 做缓存import requestsimport timestart = time.time()session = requests.Session()for i in range(10):session.get('http://httpbin.org/delay/1')print(f'Finished {i + 1} requests')end = time.time()print('Cost time', end - start)

缓存的清空和识别如果需要清空缓存，可以调用：requests_cache.clear() # 清空缓存代码
通过res.from_cache可以判断该值是否是缓存值：

import requests_cacheimport requestsrequests_cache.install_cache() # 设置缓存requests_cache.clear() # 清空缓存url = 'http://httpbin.org/get'res = requests.get(url)print(f'cache exists: {res.from_cache}')# cache exists: False # 不存在缓存res = requests.get(url)print(f'exists cache: {res.from_cache}')# exists cache: True # 存在缓存

自定义设置缓存的形式 requests_cache.install_cache默认的方式是与浏览器的缓存行为一致的。如果要自定义可以先了解该函数的参数：
requests_cache.install_cache定义

requests_cache.install_cache(cache_name='cache',backend=None,expire_after=None,allowable_codes=(200,),allowable_methods=('GET',),filter_fn= at 0x11c927f80>,session_factory=,**backend_options,)

该参数说明如下： - cache_name：缓存文件名称。

backend：设置缓存的存储机制，默认使用sqlite进行存储。
支持四种不同的存储机制，分别为memory、sqlite、mongoDB、redis。在设置存储机制为mongoDB、redis时需要提前安装对应的模块。pip install pymongo; pip install redies。
memory：以字典的形式将缓存存储在内存当中，程序运行完以后缓存将被销毁
sqlite：将缓存存储在sqlite数据库中
mongoDB：将缓存存储在mongoDB数据库中
redis：将缓存存储在redis中
expire_after：设置缓存的有效时间，默认永久有效。
allowable_codes：设置状态码。
allowable_methods：设置请求方式，默认get，表示只有get请求才可以生成缓存。
session_factory：设置缓存执行的对象，需要实现CachedSession类。
**backend_options：如果缓存的存储方式为sqlit、mongo、redis数据库，该参数表示设置数据库的连接方式。

自定义设置缓存的例子1：设置缓存文件类型
设置缓存文件类型的代码如下：

#设置缓存：任选其一requests_cache.install_cache('demo_cache')#demo_cache.sqlite 做缓存#demo_cache文件夹做缓存，删除及表示清空缓存requests_cache.install_cache('demo_cache', backend='filesystem')#缓存文件夹便会使用系统的临时目录，而不会在代码区创建缓存文件夹。requests_cache.install_cache('demo_cache', backend='filesystem', use_temp=True)#缓存文件夹便会使用系统的专用缓存文件夹，而不会在代码区创建缓存文件夹requests_cache.install_cache('demo_cache', backend='filesystem', use_cache_dir=True)#Redis，需要安装redis-pypip install rediesbackend = requests_cache.RedisCache(host='localhost', port=6379)requests_cache.install_cache('demo_cache', backend=backend)

其他不同格式：
MongoDB 安装pymongo pip install pymongo;
调用requests_cache.MongoCache 保存为’mongodb’
gridfs 安装pymongo
调用requests_cache.GridFSCache 保存为’gridfs’
DynamoDB boto3 调用requests_cache.DynamoDbCache 保存为’dynamodb’
Memory 以字典的形式将缓存存储在内存当中，程序运行完以后缓存将被销毁调用requests_cache.BaseCache 保存为’memory’

自定义设置缓存的例子2：设置缓存保存内容
具体例子代码如下：

import timeimport requestsimport requests_cache#只缓存postrequests_cache.install_cache('demo_cache2', allowable_methods=['POST'])#只缓存200返回值的请求requests_cache.install_cache('demo_cache2', allowable_codes=(200,))

只缓存200返回值的请求
设置缓存的过期时间：

#site1.com 的内容就会缓存 30 秒，site2.com/static 的内容就永远不会过期urls_expire_after = {'*.site1.com': 30, 'site2.com/static': -1}requests_cache.install_cache('demo_cache2', urls_expire_after=urls_expire_after)

在响应头中，浏览器会根据cache_control参数来确定是否保存缓存，在设置requests_cache缓存时，可以对cache_control参数设置，使其保存浏览器不需要保存的内容。

# 保存头中，cache_control设为不保存的请求requests_cache.install_cache('demo_cache3', cache_control=True)start = time.time()session = requests.Session()for i in range(10):session.get('http://httpbin.org/delay/1')print(f'Finished {i + 1} requests')end = time.time()print('Cost time for get', end - start)start = time.time()for i in range(10):session.post('http://httpbin.org/delay/1')print(f'Finished {i + 1} requests')end = time.time()print('Cost time for post', end - start)

在 Request Headers 里面加上了 Cache-Control 为 no-store，这样的话，即使我们声明了缓存那也不会生效