翻页翻页 - 锐客网

我们先观察要爬取的网页：" http://quotes.toscrape.com/ "，下方有一个翻页按钮：

文章图片
它的 HTML 代码如下：

Next →

我们需要提取 < a> 标签的 href 的值来构造下一页的连接，我们先在 shell 中尝试一下：

>>> response.css('li.next > a::attr(href)').extract_first() '/page/2/'

利用 ::attr() 方法能提取标签中的值。
显然，这并非我们最终想要获得的 url，我们可以利用 urljoin() 方法来构建 url：

>>> next_page = response.css('li.next > a::attr(href)').extract_first() >>> next_page = response.urljoin(next_page) >>> next_page 'http://quotes.toscrape.com/page/2/'

【翻页】现在我们在我们的爬虫里面运用翻页的方法，抓取数据：

#!/usr/bin/python # -*- coding: utf-8 -*-import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ "http://quotes.toscrape.com/page/1/", ]def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }# 获取下一页标签中的 href 属性 next_page = response.css('li.next > a::attr(href)').extract_first() # 判断下一页的 url 是否存在 if next_page is not None: # 用 urljoin() 方法构造完整的 url next_page = response.urljoin(next_page) # 回调函数继续处理下一页的 url yield scrapy.Request(next_page, callback=self.parse)

利用以下命令爬取：