Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED #41

Open
sumanthprabhu opened this issue May 30, 2017 · 0 comments

Comments

@sumanthprabhu
Copy link

I am get the following error when trying to run the model in the "train" phase -

2017-05-30 05:39:23,518 root  INFO     max_gradient_norm: 5.000000
2017-05-30 05:39:23,518 root  INFO     clip_gradients: True
2017-05-30 05:39:23,518 root  INFO     valid_target_length inf
2017-05-30 05:39:23,518 root  INFO     target_vocab_size: 39
2017-05-30 05:39:23,518 root  INFO     target_embedding_size: 10.000000
2017-05-30 05:39:23,518 root  INFO     attn_num_hidden: 128
2017-05-30 05:39:23,518 root  INFO     attn_num_layers: 2
2017-05-30 05:39:23,519 root  INFO     visualize: True
2017-05-30 05:39:23,519 root  INFO     buckets
2017-05-30 05:39:23,519 root  INFO     [(16, 11), (27, 17), (35, 19), (64, 22), (80, 32)]
2017-05-30 05:41:51,137 root  INFO     Created model with fresh parameters.
Train: :   0%|          | 0/156 [00:00<?, ?it/s]2017-05-30 05:46:19,134 root  INFO     Generating first batch)
E tensorflow/stream_executor/cuda/cuda_blas.cc:472] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED

input_tensor dim: (?, 1, 32, ?)
CNN outdim before squeeze: (?, 1, ?, 512)
CNN outdim: (?, ?, 512)
Traceback (most recent call last):
  File "src/launcher.py", line 148, in <module>
    main(sys.argv[1:], exp_config.ExpConfig)
  File "src/launcher.py", line 145, in main
    model.launch()
  File "/home/sprabh6/Attention-OCR/src/model/model.py", line 300, in launch
    summaries, step_loss, step_logits, _ = self.step(encoder_masks, img_data, zero_paddings, decoder_inputs, target_weights, bucket_id, self.forward_only)
  File "/home/sprabh6/Attention-OCR/src/model/model.py", line 411, in step
    outputs = self.sess.run(output_feed, input_feed)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : a.shape=(64, 522), b.shape=(522, 128), m=64, n=128, k=522
	 [[Node: model_with_buckets/embedding_attention_decoder_1/attention_decoder/attention_decoder/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](model_with_buckets/embedding_attention_decoder_1/attention_decoder/attention_decoder/concat, embedding_attention_decoder/attention_decoder/weights/read)]]
	 [[Node: conv_conv5/BatchNorm/AssignMovingAvg/_270 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_28061_conv_conv5/BatchNorm/AssignMovingAvg", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'model_with_buckets/embedding_attention_decoder_1/attention_decoder/attention_decoder/MatMul', defined at:
  File "src/launcher.py", line 148, in <module>
    main(sys.argv[1:], exp_config.ExpConfig)
  File "src/launcher.py", line 144, in main
    session = sess)
  File "/home/sprabh6/Attention-OCR/src/model/model.py", line 151, in __init__
    use_gru = use_gru)
  File "/home/sprabh6/Attention-OCR/src/model/seq2seq_model.py", line 141, in __init__
    softmax_loss_function=softmax_loss_function)
  File "/home/sprabh6/Attention-OCR/src/model/seq2seq.py", line 993, in model_with_buckets
    decoder_inputs[:int(bucket[1])], int(bucket[0]))
  File "/home/sprabh6/Attention-OCR/src/model/seq2seq_model.py", line 140, in <lambda>
    self.target_weights, buckets, lambda x, y, z: seq2seq_f(x, y, z, False),
  File "/home/sprabh6/Attention-OCR/src/model/seq2seq_model.py", line 122, in seq2seq_f
    attn_num_hidden = attn_num_hidden)
  File "/home/sprabh6/Attention-OCR/src/model/seq2seq.py", line 675, in embedding_attention_decoder
    initial_state_attention=initial_state_attention, attn_num_hidden=attn_num_hidden)
  File "/home/sprabh6/Attention-OCR/src/model/seq2seq.py", line 575, in attention_decoder
    x = linear([inp] + attns, input_size, True)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 751, in _linear
    res = math_ops.matmul(array_ops.concat(args, 1), weights)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1765, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1454, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/sprabh6/anaconda/envs/tf_1.0_keras_1/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Blas SGEMM launch failed : a.shape=(64, 522), b.shape=(522, 128), m=64, n=128, k=522
	 [[Node: model_with_buckets/embedding_attention_decoder_1/attention_decoder/attention_decoder/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](model_with_buckets/embedding_attention_decoder_1/attention_decoder/attention_decoder/concat, embedding_attention_decoder/attention_decoder/weights/read)]]
	 [[Node: conv_conv5/BatchNorm/AssignMovingAvg/_270 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_28061_conv_conv5/BatchNorm/AssignMovingAvg", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Figured that my LD_LIBRARY_PATH wasn't set properly. So added an entry to make it point to libcublas. Still didn't work. Figured it could be a memory problem. Set GPU options in launcher.py as follows -

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, gpu_options=gpu_options)) as sess:
        model = Model(
                phase = parameters.phase,
                visualize = parameters.visualize,
                data_path = parameters.data_path,
                data_base_dir = parameters.data_base_dir,
                output_dir = parameters.output_dir,
                batch_size = parameters.batch_size,
                initial_learning_rate = parameters.initial_learning_rate,
                num_epoch = parameters.num_epoch,
                steps_per_checkpoint = parameters.steps_per_checkpoint,
                target_vocab_size = parameters.target_vocab_size,
                model_dir = parameters.model_dir,
                target_embedding_size = parameters.target_embedding_size,
                attn_num_hidden = parameters.attn_num_hidden,
                attn_num_layers = parameters.attn_num_layers,
                clip_gradients = parameters.clip_gradients,
                max_gradient_norm = parameters.max_gradient_norm,
                load_model = parameters.load_model,
                valid_target_length = float('inf'),
                gpu_id=parameters.gpu_id,
                use_gru=parameters.use_gru,
                session = sess)
        model.launch()

Still doesn't work. Can anyone please tell me if I'm missing anything ?
Tensorflow version - 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant