spider-request/response

1.Request

  • 部分代码
class Request(object_ref):def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None):self._encoding = encoding# this one has to be set first self.method = str(method).upper() self._set_url(url) self._set_body(body) assert isinstance(priority, int), "Request priority not an integer: %r" % priority self.priority = priorityassert callback or not errback, "Cannot use errback without a callback" self.callback = callback self.errback = errbackself.cookies = cookies or {} self.headers = Headers(headers or {}, encoding=encoding) self.dont_filter = dont_filterself._meta = dict(meta) if meta else None@property def meta(self): if self._meta is None: self._meta = {} return self._meta

-其中,比较常用的参数:
url: 就是需要请求,并进行下一步处理的urlcallback: 指定该请求返回的Response,由那个函数来处理。method: 请求一般不需要指定,默认GET方法,可设置为"GET", "POST", "PUT"等,且保证字符串大写headers: 请求时,包含的头文件。一般不需要。内容一般如下: # 自己写过爬虫的肯定知道 Host: media.readthedocs.org User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0 Accept: text/css,*/*; q=0.1 Accept-Language: zh-cn,zh; q=0.8,en-us; q=0.5,en; q=0.3 Accept-Encoding: gzip, deflate Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/ Cookie: _ga=GA1.2.1612165614.1415584110; Connection: keep-alive If-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMT Cache-Control: max-age=0meta: 比较常用,在不同的请求之间传递数据使用的。字典dict型request_with_cookies = Request( url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True} )encoding: 使用默认的 'utf-8' 就行。dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。errback: 指定错误处理函数

2.Response
  • 部分代码
class Response(object_ref): def __init__(self, url, status=200, headers=None, body='', flags=None, request=None): self.headers = Headers(headers or {}) self.status = int(status) self._set_body(body) self._set_url(url) self.request = request self.flags = [] if flags is None else list(flags)@property def meta(self): try: return self.request.meta except AttributeError: raise AttributeError("Response.meta not available, this response " \ "is not tied to any request")

  • 大部分参数和上面的差不多:
status: 响应码 _set_body(body): 响应体 _set_url(url):响应url self.request = request

3.发送POST请求
  • 可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
  • 如果希望程序执行一开始就发送POST请求,可以重写Spider类的start_requests(self) 方法,并且不再调用start_urls里的url。
class mySpider(scrapy.Spider): # start_urls = ["http://www.example.com/"]def start_requests(self): url = 'http://www.renren.com/PLogin.do'# FormRequest 是Scrapy发送POST请求的方法 yield scrapy.FormRequest( url = url, formdata = https://www.it610.com/article/{"email" : "mr_mao_hacker@163.com", "password" : "axxxxxxxe"}, callback = self.parse_page ) def parse_page(self, response): # do something

4.模拟登陆
使用FormRequest.from_response()方法[模拟用户登录]
(http://docs.pythontab.com/scrapy/scrapy0.24/topics/request-response.html#topics-request-response-ref-request-userlogin)
通常网站通过 实现对某些表单字段(如数据或是登录界面中的认证令牌等)的预填充。
使用Scrapy抓取网页时,如果想要预填充或重写像用户名、用户密码这些表单字段, 可以使用 FormRequest.from_response() 方法实现。
  • 下面是使用这种方法的爬虫例子:
import scrapyclass LoginSpider(scrapy.Spider): name = 'example.com' start_urls = ['http://www.example.com/users/login.php']def parse(self, response): return scrapy.FormRequest.from_response( response, formdata=https://www.it610.com/article/{'username': 'john', 'password': 'secret'}, callback=self.after_login )def after_login(self, response): # check login succeed before going on if "authentication failed" in response.body: self.log("Login failed", level=log.ERROR) return

5.部分网站隐藏bug
【spider-request/response】有些网站响应内容不一定是你之前发送的url,
class SunSpider(CrawlSpider): name = 'sun' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']rules = ( Rule(LinkExtractor(allow=r'type=4&page=\d+'),callback='parse_item',follow=True), ) def parse_item(self, response): print response.url 有时一直跟进返回的url会有差别,返回一个假的url地址,不能正常访问网页。需要修改url地址 这时候有一个 process_links 更改链接 Rule(LinkExtractor(allow=r'type=4&page=\d+'), process_links='deal_links',callback='parse_item',follow=True), ) 返回的是一个链接列表,需要重新处理每个response处理的链接 links是LinkExtractor提取出来的当前页面的url列表 def deal_links(self,links): for link in links: link.url=link.url.replace("?","&").replace("Type&","Type?")

    推荐阅读