Я новичок в python и scrapy. Я следил за учебником и пытался обходить несколько веб-страниц. Я использовал код в tutorial и заменил URL-адреса - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0
и http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1
соответственно.доступ к веб-странице при использовании scrapy
Когда генерируется html-файл, все данные не отображаются. отображаются только данные до этого URL-адреса - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=0&sd=0&states=ALL&near=&ps=20&p=0
.
Также во время выполнения команды второй URL-адрес был удален, указав его как дубликат, и создается только один файл html.
Я хочу знать, запрещает ли веб-страница доступ к этим конкретным данным, или я должен изменить свой код, чтобы получить точные данные.
Когда я дам команду оболочки, я получаю ошибку. В результате, когда я использовал команду обхода и командная оболочка была -
C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline crawl citydata
2016-10-19 12:00:27 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:00:27 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True,
'BOT_NAME': 'tutorial'}
2016-10-19 12:00:27 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:00:27 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:00:27 [scrapy] INFO: Spider opened
2016-10-19 12:00:27 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-10-19 12:00:27 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:00:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.
city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=
&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX
&i6819=1&ps=20&p=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to
show all duplicates)
2016-10-19 12:00:28 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
robots.txt> (referer: None)
2016-10-19 12:00:29 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=691
4&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20
&p=0> (referer: None)
2016-10-19 12:00:29 [citydata] DEBUG: Saved file citydata-advanced.html
2016-10-19 12:00:29 [scrapy] INFO: Closing spider (finished)
2016-10-19 12:00:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 459,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 44649,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 19, 6, 30, 29, 751000),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 10, 19, 6, 30, 27, 910000)}
2016-10-19 12:00:29 [scrapy] INFO: Spider closed (finished)
C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline shell 'http://w
ww.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&ne
ar=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=
MAX&i6819=1&ps=20&p=0'
2016-10-19 12:21:51 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:21:51 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters
.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'
, 'LOGSTATS_INTERVAL': 0}
2016-10-19 12:21:51 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:21:51 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:21:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:21:51 [scrapy] INFO: Spider opened
2016-10-19 12:21:53 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 1 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:56 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 2 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:58 [scrapy] DEBUG: Gave up retrying <GET http://'http:/robots.t
xt> (failed 3 times): DNS lookup failed: address "'http:" not found: [Errno 1100
4] getaddrinfo failed.
2016-10-19 12:21:58 [scrapy] ERROR: Error downloading <GET http://'http:/robots.
txt>: DNS lookup failed: address "'http:" not found: [Errno 11004] getaddrinfo f
ailed.
DNSLookupError: DNS lookup failed: address "'http:" not found: [Errno 11004] get
addrinfo failed.
2016-10-19 12:22:00 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 1 times): DNS lookup failed: address
"'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:03 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 2 times): DNS lookup failed: address
"'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:05 [scrapy] DEBUG: Gave up retrying <GET http://'http://www.cit
y-data.com/advanced/search.php#body?fips=0> (failed 3 times): DNS lookup failed:
address "'http:" not found: [Errno 11004] getaddrinfo failed.
Traceback (most recent call last):
File "C:\Python27\lib\runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "C:\Python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 161, in <module>
execute()
File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 88, in _run_print
_help
func(*a, **kw)
File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 149, in _run_comm
and
cmd.run(args, opts)
File "C:\Python27\lib\site-packages\scrapy\commands\shell.py", line 71, in run
shell.start(url=url)
File "C:\Python27\lib\site-packages\scrapy\shell.py", line 47, in start
self.fetch(url, spider)
File "C:\Python27\lib\site-packages\scrapy\shell.py", line 112, in fetch
reactor, self._schedule, request, spider)
File "C:\Python27\lib\site-packages\twisted\internet\threads.py", line 122, in
blockingCallFromThread
result.raiseException()
File "<string>", line 2, in raiseException
twisted.internet.error.DNSLookupError: DNS lookup failed: address "'http:" not f
ound: [Errno 11004] getaddrinfo failed.
'csize' is not recognized as an internal or external command,
operable program or batch file.
'sc' is not recognized as an internal or external command,
operable program or batch file.
'sd' is not recognized as an internal or external command,
operable program or batch file.
'states' is not recognized as an internal or external command,
operable program or batch file.
'near' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit1' is not recognized as an internal or external command,
operable program or batch file.
'b6914' is not recognized as an internal or external command,
operable program or batch file.
'e6914' is not recognized as an internal or external command,
operable program or batch file.
'i6914' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit2' is not recognized as an internal or external command,
operable program or batch file.
'b6819' is not recognized as an internal or external command,
operable program or batch file.
'e6819' is not recognized as an internal or external command,
operable program or batch file.
'i6819' is not recognized as an internal or external command,
operable program or batch file.
'ps' is not recognized as an internal or external command,
operable program or batch file.
'p' is not recognized as an internal or external command,
operable program or batch file.
Мой код
import scrapy
class QuotesSpider(scrapy.Spider):
name = "citydata"
def start_requests(self):
urls = [
'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'citydata-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Кто-то, пожалуйста, руководство меня через это.
Было бы неплохо увидеть ваш паук на самом деле. Сообщение об ошибке выглядит так: Python попытается использовать ваш URL как переменные - возможно, некоторые кавычки отсутствуют. – GHajba
@GHajba Я вставил код в вопрос. не могли бы вы рассказать мне, есть ли какая-либо ошибка в коде – clued
Похоже, проблема связана с вашим проектом не с вашим пауком, потому что если я запустим команду оболочки, я не получу ошибки, но, конечно, я не использую бот ' как вы можете видеть на выходе консоли. – GHajba