2013-04-24 2 views
1

Я пытаюсь скопировать некоторую ручную дату с использованием k-средств в mahout. Я создал 6 файлов с едва ли 1 или 2 слова текста в каждом файле. Создал из них файл последовательности, используя ./mahout seqdirectory. При попытке преобразовать файл последовательности в вектор, используя команду ./mahout seq2sparse, я получаю java.lang.OutOfMemoryError: ошибка кучи Java. Размер файла последовательности - .215 КБ.java.lang.OutOfMemoryError: ошибка кучи Java во время работы seq2sparse в mahout

Команда: ./mahout seq2sparse -i Mokha/выход -o Mokha/вектор -ow

Error Log:

SLF4J: Class path contains multiple SLF4J bindings. 
SLF4J: Found binding in [jar:file:/home/bitnami/mahout/mahout-distribution-0.5/m 
ahout-examples-0.5-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
SLF4J: Found binding in [jar:file:/home/bitnami/mahout/mahout-distribution-0.5/l 
ib/slf4j-jcl-1.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 
Apr 24, 2013 2:25:11 AM org.slf4j.impl.JCLLoggerAdapter warn 
WARNING: No seq2sparse.props found on classpath, will use command-line arguments 
only 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Maximum n-gram size is: 1 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Deleting mokha/vector 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Minimum LLR value: 1.0 
Apr 24, 2013 2:25:12 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Number of reduce tasks: 1 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0001 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task done 
INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commi 
ting 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 
INFO: 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task commit 
INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitt 
er commitTask 
INFO: Saved output of task 'attempt_local_0001_m_000000_0' to mokha/vector/token 
ized-documents 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 
INFO: 
Apr 24, 2013 2:25:12 AM org.apache.hadoop.mapred.Task sendDone 
INFO: Task 'attempt_local_0001_m_000000_0' done. 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 100% reduce 0% 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0001 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 5 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO: FileSystemCounters 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  FILE_BYTES_READ=1471400 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  FILE_BYTES_WRITTEN=1496783 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO: Map-Reduce Framework 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  Map input records=6 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  Spilled Records=0 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.Counters log 
INFO:  Map output records=6 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0002 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:13 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> 
INFO: io.sort.mb = 100 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.LocalJobRunner$Job run 
WARNING: job_local_0002 
java.lang.OutOfMemoryError: Java heap space 
     at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java: 
781) 
     at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.ja 
va:524) 
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) 
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) 
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1 
77) 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 0% reduce 0% 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0002 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 0 
Apr 24, 2013 2:25:14 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0003 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 1 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init> 
INFO: io.sort.mb = 100 
Apr 24, 2013 2:25:15 AM org.apache.hadoop.mapred.LocalJobRunner$Job run 
WARNING: job_local_0003 
java.lang.OutOfMemoryError: Java heap space 
     at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java: 
781) 
     at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.ja 
va:524) 
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) 
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) 
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1 
77) 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 0% reduce 0% 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0003 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 0 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 0 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Running job: job_local_0004 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapreduce.lib.input.FileInputFormat li 
stStatus 
INFO: Total input paths to process : 0 
Apr 24, 2013 2:25:16 AM org.apache.hadoop.mapred.LocalJobRunner$Job run 
WARNING: job_local_0004 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 
     at java.util.ArrayList.RangeCheck(ArrayList.java:547) 
     at java.util.ArrayList.get(ArrayList.java:322) 
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1 
24) 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: map 0% reduce 0% 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob 
INFO: Job complete: job_local_0004 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.mapred.Counters log 
INFO: Counters: 0 
Apr 24, 2013 2:25:17 AM org.slf4j.impl.JCLLoggerAdapter info 
INFO: Deleting mokha/vector/partial-vectors-0 
Apr 24, 2013 2:25:17 AM org.apache.hadoop.metrics.jvm.JvmMetrics init 
INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - al 
ready initialized 
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputExc 
eption: Input path does not exist: file:/home/bitnami/mahout/mahout-distribution 
-0.5/bin/mokha/vector/tf-vectors 
     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File 
InputFormat.java:224) 
     at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listSta 
tus(SequenceFileInputFormat.java:55) 
     at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI 
nputFormat.java:241) 
     at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) 
     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 
79) 
     at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) 
     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) 
     at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFI 
DFConverter.java:350) 
     at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.processTfIdf(TFIDFC 
onverter.java:151) 
     at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spars 
eVectorsFromSequenceFiles.java:262) 
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) 
     at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Spar 
seVectorsFromSequenceFiles.java:52) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. 
java:39) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces 
sorImpl.java:25) 
     at java.lang.reflect.Method.invoke(Method.java:597) 
     at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra 
mDriver.java:68) 
     at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) 
     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) 

ответ

0

Я не знаю, если вы уже пробовали это, но просто размещение это на случай, если вы его пропустили.

'Set the environment variable 'MAVEN_OPTS' to allow for more memory via 'export MAVEN_OPTS=-Xmx1024m' 

см (в разделе проблем общего) here

+0

Мы не используем Maven. –

1

сценарий бен/Погонщика читает переменные окружения «MAHOUT_HEAPSIZE» (в мегабайтах) и устанавливает переменный в «JAVA_HEAP_MAX» от него, если он существует. Версия mahout, которую я использую (0.8), имеет JAVA_HEAP_MAX, установленную в 3G. Выполнение

export MAHOUT_HEAPSIZE=10000m 

перед пологом кластерной перспективе, кажется, помогли мои пробеги остаться в живых дольше на одной машине. Однако я подозреваю, что лучшим решением будет переход на работу в кластере.

для справки, есть другая связанная почта: Mahout runs out of heap space

+0

Я не думаю, что единицы разрешены в переменной, должно быть 'export MAHOUT_HEAPSIZE = 10000' – tokland