今天我们要爬取这个网站的《辉夜大小姐想让我告白》漫画(穷人靠科技,富人靠硬币,懂,不多说)
主要就两步:1.在主界面找到所有话的链接 2.在每一话找到该话的所有图片
需要源码的直接翻到最后首先我们找到了每一话的链接
文章图片
# 获取章节链接和章节名称
hrefs = re.findall('\n.*?\n.*?(.*?)',r.text)
for href in hrefs:
# 拼接章节链接
chapter_url = 'http://www.90mh.com' + href[0]
name = href[1]
chapter_path = root_path + '\\' + name
print(chapter_path)
# 辉夜大小姐想让我告白\周刊13话
在进入其中一话,找到每一话的所有图片
文章图片
文章图片
# 获取章节图片
chapter_imges = re.search('chapterImages = (\[.*?\])',chapter_page.text,re.S)
chapter_src = https://www.it610.com/article/re.search('chapterPath = "(.*?)"',chapter_page.text).group(1)
''' ...... '''
pic_url = 'https://js1.zzszs.com.cn/' + chapter_src + chapter_imges[i]
最终效果:
文章图片
成功!
【python漫画爬虫:我不做人了,b站!爬取辉夜大小姐等漫画】当然,不同网站结构不同,爬取方式也有些许不同。比如动漫之家——参考自这里.
但方式其实也就那么几种,还是可以摸索出来的,目前我爬了四五个网站,也都成功了,大家可以自己动手试试。
源码:
这里采用了多协程的方式,比正常方式快几十倍,但编写时麻烦些,并且存在有的网址访问超时的情况,故需要多跑几遍.这里我使用了代理,大家需要自己配置,并更改代理ip地址.
import requests
import re
import time
import os
from ast import literal_eval
import asyncio
import aiohttp
import aiofilesasync def get_image(session,href_url,name):
# 拼接章节链接
chapter_url = 'http://www.90mh.com' + href_url
chapter_path = root_path + '\\' + name
print(chapter_path)# 建立章节文件夹
if not os.path.exists(chapter_path):
os.mkdir(chapter_path)
try:
async with session.get(chapter_url, headers=headers, proxy=proxy, timeout=30) as response:
r = await response.text()
except:
async with session.get(chapter_url, headers=headers, proxy=proxy, timeout=30) as response:
r = await response.text()
# 获取章节图片
chapter_imges = re.search('chapterImages = (\[.*?\])', r, re.S)
chapter_src = https://www.it610.com/article/re.search('chapterPath = "(.*?)"', r).group(1)chapter_imges = chapter_imges.group(1)
# 将字符串形式的列表转为列表
chapter_imges = literal_eval(chapter_imges)tasks = []
for i in range(len(chapter_imges)):
if i < 10:
pic_path = chapter_path + '\\' + str(0) + str(i) + '.jpg'
else:
pic_path = chapter_path + '\\' + str(i) + '.jpg'
print(pic_path)
if not os.path.exists(pic_path):
pic_url = 'https://js1.zzszs.com.cn/' + chapter_src + chapter_imges[i]
tasks.append(get_photo(session,pic_url,pic_path))
if tasks:
await asyncio.wait(tasks)
if hrefs:
href = https://www.it610.com/article/hrefs.pop()
task = [asyncio.create_task(get_image(session, href[0], href[1]))]
await asyncio.wait(task)async def get_photo(session,pic_url,pic_path):
try:
async with session.get(pic_url, headers=pic_headers, timeout=30) as p:
pic = await p.content.read()
except:
async with session.get(pic_url, headers=pic_headers, timeout=50) as p:
pic = await p.content.read()
async with aiofiles.open(pic_path,'wb') as f:
await f.write(pic)group_size = 5
ip = '127.0.0.1:7890'
proxy = 'http://' + ip
proxies = {
'http': 'http://' + ip,
'https': 'https://' + ip
}
# 漫画主页
url = 'http://www.90mh.com/manhua/zongzhijiushifeichangkeai/'
host = 'www.90mh.com'
headers = {
'Host': 'www.90mh.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
pic_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
root_path = '总之就是非常可爱'async def main():
# 建立根文件夹
if not os.path.exists(root_path):
os.mkdir(root_path)
async with aiohttp.ClientSession() as session:
try:
async with session.get(url, headers=headers,proxy=proxy, timeout=30) as response:#
r = await response.text()
except:
async with session.get(url, headers=headers, proxy=proxy, timeout=50) as response:
r = await response.text()# 获取章节链接和章节名称
global hrefs
hrefs = re.findall('\n.*?\n.*?(.*?)',r)tasks = []
if len(hrefs) < group_size:
num = len(hrefs)
else:
num = group_size
for i in range(num):
href = https://www.it610.com/article/hrefs.pop()
tasks.append(asyncio.create_task(get_image(session,href[0],href[1])))
await asyncio.wait(tasks)if __name__ =='__main__':
asyncio.run(main())
推荐阅读
- python爬pixiv排行榜
- python爬虫|看一小伙如何使用 python爬虫来算命()
- 会计转行能做什么(看完这个故事,让你更清醒)
- Scrapy爬取顶点小说网
- 爬虫系列(数据标准化)
- #Python爬虫#Item Pipeline介绍(附爬取网站获取图片到本地代码)
- Scrapy爬取小说简单逻辑
- Python爬虫,私活接单记录,假日到手5500美滋滋
- 新selenium模拟登录知乎
- 渣本零基础努力自学python,半年成功上岸,良心分享学习心得和踩坑经历