Scrapy 1.0.5 (последний официальный, как я пишу эти строки) не использует handle_httpstatus_list
во встроенной утилите RedirectMiddleware - см. this issue. От Scrapy 1.1.0 (1.1.0rc1 is available), the issue is fixed.
Даже если отключить переадресацию, вы можете имитировать его поведение в вашей функции обратного вызова, проверяя заголовок Location
и возвращает Request
к перенаправлению
Пример паук:
$ cat redirecttest.py
import scrapy
class RedirectTest(scrapy.Spider):
name = "redirecttest"
start_urls = [
'http://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip'
]
handle_httpstatus_list = [302]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, callback=self.parse_page)
def parse_page(self, response):
self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url))
if response.status in (302,) and 'Location' in response.headers:
self.logger.debug("(parse_page) Location header: %r" % response.headers['Location'])
yield scrapy.Request(
response.urljoin(response.headers['Location']),
callback=self.parse_page)
консоли журнала:
$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
[scrapy] INFO: Optional features available: ssl, http11
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'}
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[scrapy] INFO: Enabled item pipelines:
[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip'
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip
[scrapy] INFO: Closing spider (finished)
Обратите внимание, что вам понадобится http_handlestatus_list
с 302 в нем, в противном случае вы увидите этот род d от log (от HttpErrorMiddleware
):
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed
Можете ли вы предоставить URL-адрес, который дает вам статус HTTP 302? – Rahul