2016-06-09 6 views
0

Я нахожусь в умелым ползти через Интернет с помощью следующих Scrapy сценарияКак сохранить для обхода веб-страниц в памяти с помощью Scrapy

import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 
from lxml import html 

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

from scrapy.spider import BaseSpider 
from scrapy import log 

#from tutorial.items import TutorialItem 
from tutorial.items import DmozItem 


class StayuncleCrawlerSpider(CrawlSpider): 

    name = 'stayuncle_crawler' 

    allowed_domains = ['stayuncle.com'] 
    start_urls = ['http://www.stayuncle.com/'] 
    CrawlSpider.DOWNLOAD_DELAY=.25; 



    rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)  ] 

def parse_item(self,response,spider): 

      doc = html.fromstring(response.body) 
      item = DmozItem() 
      item['title'] = doc.xpath('//meta[@property="og:title"]/@content') 
      item['link'] = response.url 
      item['desc'] = doc.xpath('//meta[@name="description"]/@content') 
      yield self.parse_save(self,response) 
      yield item 



    # self.log('A response from %s just arrived!' % response.url) 

def parse_save(self, response): 
     filename = response.url.split("/")[-2] + '.html' 
     with open(filename, 'wb') as f: 
      f.write(response.body) 

Вот лог

/Users/Nand/crawledData/tutorial/tutorial/spiders/stack_crawler.py:16: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor 
    Rule(SgmlLinkExtractor(allow=('pages/')), callback='parse_item', follow=True), 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:7: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead 
    from scrapy.contrib.spiders import CrawlSpider, Rule 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:8: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:8: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:11: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead 
    from scrapy.spider import BaseSpider 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:12: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more. 
    from scrapy import log 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:28: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor 
    rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True), 
/Users/Nand/crawledData/tutorial/tutorial/spiders/stayuncle_crawler.py:29: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor 
    Rule(SgmlLinkExtractor(), callback='parse_save', follow=True) 
2016-06-09 17:13:28 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial) 
2016-06-09 17:13:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'} 
2016-06-09 17:13:28 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2016-06-09 17:13:28 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-06-09 17:13:28 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-06-09 17:13:28 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-06-09 17:13:28 [scrapy] INFO: Spider opened 
2016-06-09 17:13:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-06-09 17:13:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-06-09 17:13:28 [py.warnings] WARNING: /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/__init__.py:65: UserWarning: StayuncleCrawlerSpider.DOWNLOAD_DELAY attribute is deprecated, use StayuncleCrawlerSpider.download_delay instead 
    (type(spider).__name__, type(spider).__name__)) 

2016-06-09 17:13:29 [scrapy] DEBUG: Crawled (404) <GET http://www.stayuncle.com/robots.txt> (referer: None) 
2016-06-09 17:13:29 [scrapy] DEBUG: Redirecting (302) to <GET http://www.stayuncle.com/home> from <GET http://www.stayuncle.com/> 
2016-06-09 17:13:29 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/home> (referer: None) 
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'stayuncle.tumblr.com': <GET http://stayuncle.tumblr.com/> 
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'facebook.com': <GET http://facebook.com/stayuncle> 
2016-06-09 17:13:29 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET http://twitter.com/stayuncle> 
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/home> (referer: http://www.stayuncle.com/home) 
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.stayuncle.com/home> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/cdn-cgi/l/email-protection> (referer: http://www.stayuncle.com/home) 
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.cloudflare.com': <GET https://www.cloudflare.com/sign-up?utm_source=email_protection> 
2016-06-09 17:13:30 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/career> (referer: http://www.stayuncle.com/home) 
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/StayUncle?ref=hl> 
2016-06-09 17:13:30 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET https://www.twitter.com/stayuncle> 
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/howwechose> (referer: http://www.stayuncle.com/home) 
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (404) <GET http://www.stayuncle.com/index.html> (referer: http://www.stayuncle.com/career) 
2016-06-09 17:13:31 [scrapy] DEBUG: Crawled (200) <GET http://www.stayuncle.com/about> (referer: http://www.stayuncle.com/home) 
2016-06-09 17:13:31 [scrapy] DEBUG: Ignoring response <404 http://www.stayuncle.com/index.html>: HTTP status code is not handled or not allowed 
2016-06-09 17:13:31 [scrapy] DEBUG: Filtered offsite request to 'in.linkedin.com': <GET https://in.linkedin.com/pub/nand-singh/1b/31b/464> 
2016-06-09 17:13:31 [scrapy] INFO: Closing spider (finished) 
2016-06-09 17:13:31 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 2748, 
'downloader/request_count': 9, 
'downloader/request_method_count/GET': 9, 
'downloader/response_bytes': 32186, 
'downloader/response_count': 9, 
'downloader/response_status_count/200': 6, 
'downloader/response_status_count/302': 1, 
'downloader/response_status_count/404': 2, 
'dupefilter/filtered': 23, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 6, 9, 11, 43, 31, 709558), 
'log_count/DEBUG': 19, 
'log_count/INFO': 7, 
'log_count/WARNING': 1, 
'offsite/domains': 7, 
'offsite/filtered': 22, 
'request_depth_max': 2, 
'response_received_count': 8, 
'scheduler/dequeued': 8, 
'scheduler/dequeued/memory': 8, 
'scheduler/enqueued': 8, 
'scheduler/enqueued/memory': 8, 
'start_time': datetime.datetime(2016, 6, 9, 11, 43, 28, 793762)} 
2016-06-09 17:13:31 [scrapy] INFO: Spider closed (finished) 

, но я хочу, чтобы сохранить все полз Интернет страницы в форме html? Я пробовал сохранять сканированные веб-страницы, как указано в http://doc.scrapy.org/en/latest/intro/tutorial.html, но это не работает для меня. Может ли кто-нибудь помочь мне с некоторой привязкой кода, чтобы я мог достичь этого.

+0

Что вы подразумеваете под "в память"? Учебное пособие по Scrapy показывает [пример написания необработанного HTML на диск] (http://doc.scrapy.org/en/latest/intro/tutorial.html#our-first-spider). Это может заставить вас начать. –

+0

@paultrmbrth У меня есть обновленный вопрос, я пробовал то же самое, но он не работал для меня, можете ли вы помочь мне разобраться, что я делаю неправильно? – nand

+0

Пожалуйста, добавьте сведения о том, что не работает (например, ничего не записано на диск, исключение, обмениваться консольными журналами и т. Д.) –

ответ

0

После правильного создания отступов @ def parse_item (self, response, spider): этот сегмент кода работал.

 Смежные вопросы

  • Нет связанных вопросов^_^