Когда я запускаю wordcount.py (python mrjob http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-your-first-job), используя потоки hadoop в текстовом файле, он дает мне вывод, но когда то же самое выполняется с .snappy файлами, я получил нуль размер выход.hasoop python job on snappy files производит 0 размер вывода
Опции Пробовал:
[testgen word_count]# cat mrjob.conf
runners:
hadoop: # this will work for both hadoop and emr
jobconf:
mapreduce.task.timeout: 3600000
#mapreduce.max.split.size: 20971520
#mapreduce.input.fileinputformat.split.maxsize: 102400
#mapreduce.map.memory.mb: 8192
mapred.map.child.java.opts: -Xmx4294967296
mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/
# "true" must be a string argument, not a boolean! (#323)
#mapreduce.output.compress: "true"
#mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec
[testgen word_count]#
Команда:
[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 0%
HADOOP: map 100% reduce 11%
HADOOP: map 100% reduce 97%
HADOOP: map 100% reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
(no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]#
Нет выброшенные ошибки, вывод задания успешно, проверенные конфигурации рабочих мест в работе статистики, которую взял.
Есть ли другой способ устранения неполадок?