2016-05-17 8 views
0

Использование "Bitfusion Ubuntu 14 TensorFlow" AMI, любая попытка преформ операции с большими тензорами, такие какBitfusion Ubuntu 14 TensorFlow AMI терпит неудачу с OOM ошибки

sess.run(tf.argmax(y, 1), feed_dict={x: use_x}) 

, когда use_x является 28000 tf.Tensor поплавков, результаты в

«Ресурс исчерпан: OOM»

ошибки. Это делает AMI непригодным для использования.

Есть ли настройки, которые мне не хватает, чтобы предотвратить это?

----------

I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768):  Total Chunks: 1, Chunks in use: 0 56.8KiB allocated for chunks. 3.1KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536):  Total Chunks: 1, Chunks in use: 0 111.2KiB allocated for chunks. 4B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608): Total Chunks: 2, Chunks in use: 0 23.73MiB allocated for chunks. 440.3KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 83.74MiB was 64.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0000 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0100 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0200 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0300 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a0400 of size 8192 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a2400 of size 6144 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a3c00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a3d00 of size 3328 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a4a00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023a4b00 of size 204800 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023d6b00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7023d6c00 of size 25088000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703bc3c00 of size 8192 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703bc5c00 of size 12000000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704737700 of size 6144 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704738f00 of size 60160 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704747a00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704747b00 of size 8192 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749b00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749c00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749d00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749e00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704749f00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a000 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a100 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70474a200 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704758600 of size 60160 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704767100 of size 76288 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779b00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779c00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779d00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779e00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x704779f00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a000 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a100 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a200 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a300 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a400 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a500 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a600 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a700 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a800 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477a900 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477aa00 of size 3328 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477b700 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70477b800 of size 204800 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7047ad800 of size 12000000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x705f67a00 of size 8192 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x705f69a00 of size 25088000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x707756a00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c8600 of size 6144 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c9e00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082c9f00 of size 6144 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082e7400 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x7082e7500 of size 25088000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x709ad4500 of size 12000000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a646000 of size 3328 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a646d00 of size 204800 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70a678d00 of size 87808000 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x70fa36500 of size 3703905024 
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x70474a300 of size 58112 
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x70531f300 of size 12879616 
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x707756b00 of size 12000000 
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x7082cb700 of size 113920 
I tensorflow/core/common_runtime/bfc_allocator.cc:689]  Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 35 Chunks of size 256 totalling 8.8KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 3328 totalling 9.8KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 6144 totalling 24.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 8192 totalling 32.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 60160 totalling 117.5KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 76288 totalling 74.5KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 204800 totalling 600.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 12000000 totalling 34.33MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 3 Chunks of size 25088000 totalling 71.78MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 87808000 totalling 83.74MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 3703905024 totalling 3.45GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 3.64GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats: 
Limit:     3928915968 
InUse:     3903864320 
MaxInUse:    3903864320 
NumAllocs:     418794 
MaxAllocSize:   3703905024 

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ******************************************************************************xxxxxxxxxxxxxxxxxxxxxx 
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 83.74MiB. See logs for memory state. 
W tensorflow/core/framework/op_kernel.cc:907] Resource exhausted: OOM when allocating tensor with shape[28000,1,28,28] 

Traceback (most recent call last): 
    File "tf_simple.py", line 173, in <module> 
    evals = sess.run(tf.argmax(y, 1), feed_dict={x: use_x}) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 343, in run 
    run_metadata_ptr) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 567, in _run 
    feed_dict_string, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 640, in _do_run 
    target_list, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 662, in _do_call 
    e.code) 
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[28000,1,28,28] 
    [[Node: 1_conv_layer/kernel_logits/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](as_grid, 1_conv_layer/kernel_weights/W1/read)]] 
    [[Node: ArgMax/_2316 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1481_ArgMax", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] 
Caused by op u'1_conv_layer/kernel_logits/Conv2D', defined at: 
    File "tf_simple.py", line 47, in <module> 
    final_dropout=final_dropout) 
    File "/home/ubuntu/mlcode/tf_utils.py", line 150, in make_ff_network 
    layer_name) 
    File "/home/ubuntu/mlcode/tf_utils.py", line 86, in _add_conv_layer 
    kernel_logits = tf.nn.conv2d(input_tensor, weights, strides=[1, 1, 1, 1], padding='SAME') + biases 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 295, in conv2d 
    data_format=data_format, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 694, in apply_op 
    op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op 
    original_op=self._default_original_op, op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__ 
    self._traceback = _extract_stack() 

ответ

1

Проблема есть предел памяти на AWS графических процессоров ~ 4 Гб, это не проблема с ОИМ:

Предел: 3928915968

InUse: 3903864320

MaxInUse: 3903864320

NumAllocs: 418794

MaxAllocSize: 3703905024 предел

Память 3.928GB, память используется 3.903GB, и запрос на выделение для 0.083GB, что превышает предел памяти. В AWS ваши варианты - либо переписать код таким образом, чтобы он мог работать в пределах 4 ГБ, работать в режиме только CPU для этого раздела кода и использовать системную память (которая, конечно же, поражает цель использования графического процессора), или подождите, пока AWS представит новые экземпляры GPU с большей памятью. Кроме того, вы можете искать другого поставщика облачных вычислений, такого как Nimbix, который предлагает более современные графические процессоры.

+0

AWS теперь имеет p2-узлы с 12 ГБ памяти. Это должно позволить вам работать с более крупными тензорами без нехватки памяти на этих графических процессорах. – mbajkowski