。。。闲来无事,爬了一下我最爱的B站~~~卒
首先进入B站的番剧索引页
ps:以前经常浏览这个索引页找动漫看,所以熟练的操作~滑稽
翻页发现url链接并没有改变,用谷歌开发者工具network发现加载了XHR文件并返回json格式的响应
文章图片
放到atom里看下数据是咋样的
文章图片
要对其进行翻页处理,观察一下query string的规律,发现那么多个参数只有page这个参数是变化的
文章图片
所以接下来都很好做了~嘻嘻
items.py
import scrapy
from scrapy import Fieldclass BilibiliItem(scrapy.Item):title = Field()
cover = Field()
sum_index = Field()
is_finish = Field()
link = Field()
follow = Field()
plays = Field()
score = Field()
_id = Field()
bzhan.py
import scrapy
import demjson #这个库要pip一哈
from scrapy.selector import Selector
from bilibili.items import BilibiliItem
from random import randintclass BzhanSpider(scrapy.Spider):
name = 'bzhan'
allowed_domains = ['bilibili.com']
start_urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20']def parse(self, response):
json_content = demjson.decode(response.body)
datas = json_content["result"]["data"]
item = BilibiliItem()
for data in datas:
cover = data['cover']
sum_index = data['index_show']
is_finish = data['is_finish']
is_finish = '已完结' if is_finish == 1 else '未完结'
link = data['link']
follow = data['order']['follow']
plays = data['order']['play']try:
score = data['order']['score']
except:
score = '未知'
title = data['title']item['_id'] = title
item['cover'] = cover
item['sum_index'] = sum_index
item['is_finish'] = is_finish
item['link'] = link
item['follow'] = follow
item['plays'] = plays
item['score'] = score
item['title'] = titleyield item
urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page={0}&season_type=1&pagesize=20'.format(k) for k in range(2,156)]
for url in urls:
request = scrapy.Request(url,callback=self.parse)
yield request
利用python对象字典的方式进行解析。。不难
piplines.py
import pymongoclass BilibiliPipeline(object):
def process_item(self, item, spider):
client = pymongo.MongoClient('localhost', 27017)
mydb = client['mydb']
bilibili = mydb['bilibili']
bilibili.insert_one(item)
print(item)
return item
【爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))】settings.py略。。。。。。
结果可以爬取到三千多个数据
文章图片
心疼我的b站一秒。。
推荐阅读
- 爬虫|若想拿下爬虫大单,怎能不会逆向爬虫,价值过万的逆向爬虫教程限时分享
- python|尚硅谷python爬虫(二)-解析方法
- web挖洞|HACK学习黑帽子Python--漏洞检测脚本快速编写
- Pyecharts|Pyecharts 猎聘招聘数据可视化
- Python爬虫笔记|Python爬虫学习笔记_DAY_17_Python爬虫之使用cookie绕过登录的介绍【Python爬虫】
- Python爬虫笔记|Python爬虫学习笔记_DAY_19_Python爬虫之代理ip与代理池的使用介绍【Python爬虫】
- Python爬虫笔记|Python爬虫学习笔记_DAY_18_Python爬虫之handler处理器的使用【Python爬虫】
- python|用 Python 写一副春联&福字,把最好的祝福,送给重要的人
- 爬虫|淘宝商品数据爬取
- 爬虫学习历程小记