爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))

。。。闲来无事,爬了一下我最爱的B站~~~卒
首先进入B站的番剧索引页
ps:以前经常浏览这个索引页找动漫看,所以熟练的操作~滑稽

翻页发现url链接并没有改变,用谷歌开发者工具network发现加载了XHR文件并返回json格式的响应
爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))
文章图片

放到atom里看下数据是咋样的
爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))
文章图片

要对其进行翻页处理,观察一下query string的规律,发现那么多个参数只有page这个参数是变化的
爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))
文章图片

所以接下来都很好做了~嘻嘻
items.py

import scrapy from scrapy import Fieldclass BilibiliItem(scrapy.Item):title = Field() cover = Field() sum_index = Field() is_finish = Field() link = Field() follow = Field() plays = Field() score = Field() _id = Field()

bzhan.py
import scrapy import demjson #这个库要pip一哈 from scrapy.selector import Selector from bilibili.items import BilibiliItem from random import randintclass BzhanSpider(scrapy.Spider): name = 'bzhan' allowed_domains = ['bilibili.com'] start_urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20']def parse(self, response): json_content = demjson.decode(response.body) datas = json_content["result"]["data"] item = BilibiliItem() for data in datas: cover = data['cover'] sum_index = data['index_show'] is_finish = data['is_finish'] is_finish = '已完结' if is_finish == 1 else '未完结' link = data['link'] follow = data['order']['follow'] plays = data['order']['play']try: score = data['order']['score'] except: score = '未知' title = data['title']item['_id'] = title item['cover'] = cover item['sum_index'] = sum_index item['is_finish'] = is_finish item['link'] = link item['follow'] = follow item['plays'] = plays item['score'] = score item['title'] = titleyield item urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page={0}&season_type=1&pagesize=20'.format(k) for k in range(2,156)] for url in urls: request = scrapy.Request(url,callback=self.parse) yield request

利用python对象字典的方式进行解析。。不难
piplines.py
import pymongoclass BilibiliPipeline(object): def process_item(self, item, spider): client = pymongo.MongoClient('localhost', 27017) mydb = client['mydb'] bilibili = mydb['bilibili'] bilibili.insert_one(item) print(item) return item

【爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))】settings.py略。。。。。。
结果可以爬取到三千多个数据
爬虫|Scrapy实例(爬取B站所有动漫番剧信息(Ajax接口+json数据解析))
文章图片

心疼我的b站一秒。。

    推荐阅读