Python爬虫 Day 5
Requests模块下
处理不被信任证书的网站
1.需求:向一个不被SSL信任的网站发起请求 爬取数据
2.目标url:https://inv-veri.chinatax.gov...
3.什么是SSL?
(1)定义:SSL证书是数字证书的一种,配置于服务器上
https = http + ssl
(2)特点:SSL证书遵循了SSL协议 由受信任的数字证书颁发机构验证身份后颁发的证书 如是公司自己制作 尽管显示https 但仍然是不被信任的
(3)功能:SSL证书同时具有服务器身份验证和数据传输加密功能
cookie
1.定义
cookie通过在客户端记录的信息确定用户身份
HTTP是一种无连接协议,客户端和服务器交互仅限于请求或响应过程,结束后断开,下一次请求时,服务器会认为是一个新的客户端,为了维护它们之间的连接,让服务器知道这是前一个用户发起的请求,必须在一个地方保存客户端信息
2.作用
(1)反反爬
(2)模拟登录
补充请求与响应
1.服务器渲染:能够在网页源码中看到数据
2.客户端渲染:不能在网页源码中看到数据
文章图片
代码
(1)代码website_ssl
import requests# 目标url
url = 'https://inv-veri.chinatax.gov.cn/'header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
res = requests.get(url, headers=header, verify=False)
print(res.content.decode('utf-8'))"""
html = res.content.decode('utf-8')
filename = 'gov' + '.html'
with open(filename, 'w', encoding='utf-8') as g:
g.write(html)
# 瞎玩
"""
(2)代码qzone-模拟登录
import requests# 目标url
url = 'https://user.qzone.qq.com/xxxxxxxxxx'
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'cookie': 'RK=5lj83q5PSd;
ptcz=1fa7f93cc2c18f43147a7189627ae6080740f72c3df2d08eb012e46db14a477b;
pgv_pvid=2420772099;
fqm_pvqid=73e03e61-6f91-4107-bc29-e1ac0449e88f;
tmeLoginType=2;
psrf_qqrefresh_token=53D67C7167BD9A1CAB39187D4C792C97;
psrf_qqunionid=;
wxunionid=;
psrf_qqaccess_token=9CBCE82B15E8671851EFD4B1290DB131;
wxopenid=;
psrf_access_token_expiresAt=1630838824;
psrf_qqopenid=24FAC6F91E334941373749413AE8BBB0;
wxrefresh_token=;
euin=oK45NK6FNeCq7n**;
pac_uid=1_519188694;
iip=0;
pgv_info=ssid=s4897751060;
o_cookie=1519188694;
eas_sid=O1o6S2P8b5s65402b7L2L9p980;
pvpqqcomrouteLine=wallpaper_wallpaper_wallpaper;
_qpsvr_localtk=0.12057115483060432;
welcomeflash=1519188694_96544;
qz_screen=1536x864;
1519188694_todaycount=0;
1519188694_totalcount=34134;
QZ_FE_WEBP_SUPPORT=1;
cpu_performance_v8=6;
__Q_w_s__QZN_TodoMsgCnt=1;
zzpaneluin=;
zzpanelkey=;
_qz_referrer=i.qq.com;
uin=o1519188694;
skey=@SJO1fKDSM;
p_uin=o1519188694;
pt4_token=Aqy-OiLi1f7tedwzlg1wUy*laYC9M8AwdlPLQ-BIcHw_;
p_skey=LeobMj5DrbZe75MiQKQXphzo6O-d3OqX25A8MIXGvNo_;
qzone_check=1519188694_1628837790',
}res = requests.get(url, headers=header)
html = res.content.decode('utf-8')with open('qzone1.html', 'w', encoding='utf-8') as f:
f.write(html)
print(html)
【Python爬虫 Day 5】(3)代码 12306-反反爬
import requests
import json# 目标url
url = 'https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2021-08-18&leftTicketDTO.from_station=BJP&leftTicketDTO.to_station=CQW&purpose_codes=ADULT'header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Cookie': '_uab_collina=162859214484913672330819;
JSESSIONID=3C9DB3104DC02AC8A7C9F0D62730C089;
_jc_save_wfdc_flag=dc;
RAIL_EXPIRATION=1629178843189;
RAIL_DEVICEID=YVHwYLGf-RmKXq__VM-j4-kE5zSKDK92l_0zwONP7fBhWgqOrFnEF7hIUWCvkHM9NvfBzP_ske3ujsIS24pju38UI2R12sKWoTA7fQC7tvGPBrsKdW0hgcgVi_0aLunJ9RTtDmY02BcH4ZCmj8l7h44hhmfDHsd8;
BIGipServerotn=3671523594.50210.0000;
BIGipServerpassport=971505930.50215.0000;
route=6f50b51faa11b987e576cdb301e545c4;
_jc_save_fromStation=%u5317%u4EAC%2CBJP;
_jc_save_toDate=2021-08-16;
_jc_save_toStation=%u91CD%u5E86%2CCQW;
_jc_save_fromDate=2021-08-18'
}# 获取网页源码
res = requests.get(url, headers=header)
# html_str = res.content.decode('utf-8')
# html_dict = res.json()
# print(type(html_str), type(html_dict))
# print(html_dict)
# 观察上述打印数据 这样才阔以进行数据解析html_str = res.content.decode('utf-8')
html_dict = json.loads(html_str)
# print(html_dict)
# 观察上述打印数据 这样才阔以进行数据解析一些除了关键数字的符号是会变化的# 解析数据
results = html_dict['data']['result']
# print(results)
for result in results:
# print(result)
# print('*' * 100)
data_lst = result.split('|')
# flag = 0
# for d in data_lst:
#print(flag, d)
#flag += 1
# print('*'* 50)
# 我们猜测 特等座的信息是下表索引32的数据d[32] 车次是在下表索引为3的数据d[3]
t_name = data_lst[3]
t_number = data_lst[32]
# print(t_number, t_number)# 进行判断
if t_number != '' and t_number != '无':
print(t_name, '有票')
else:
print(t_name, '无票')
推荐阅读
- 赢在人生六项精进二阶Day3复盘
- 继续努力,自主学习家庭Day135(20181015)
- python学习之|python学习之 实现QQ自动发送消息
- 逻辑回归的理解与python示例
- python自定义封装带颜色的logging模块
- 【Leetcode/Python】001-Two|【Leetcode/Python】001-Two Sum
- 2019-01-17-晨读7期-直子Day25
- 21天|21天|M&M《见识》04
- 阿菘的ScalersTalk第五轮新概念朗读持续力训练Day15|阿菘的ScalersTalk第五轮新概念朗读持续力训练Day15 20191025
- Python基础|Python基础 - 练习1