Estoy tratando de ejecutar mi código Keras CuDNNGRU en tensorflow usando gpu pero siempre aparece el error "Error al encontrar la implementación de dnn" aunque ya instalé CUDA y CuDNN.
Ya reinstalé CUDA y CuDNN varias veces y actualicé la versión de CuDNN de 7.2.1 a 7.5.0 pero no soluciona nada. También trato de ejecutar mi código en Jupyter Notebook y en el compilador de python (en la terminal) y ambos resultados son iguales. Aquí están los detalles de hardware y software míos.
Aquí está mi código.
encoder_LSTM = tf.keras.layers.CuDNNGRU(hidden_unit,return_sequences=True,return_state=True)
encoder_LSTM_rev=tf.keras.layers.CuDNNGRU(hidden_unit,return_state=True,return_sequences=True,go_backwards=True)
encoder_outputs, state_h = encoder_LSTM(x)
encoder_outputsR, state_hR = encoder_LSTM_rev(x)
Y este es el mensaje de error.
2019-05-27 19:08:06.814896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-27 19:08:06.814956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-27 19:08:06.814971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-05-27 19:08:06.814978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-05-27 19:08:06.815279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14678 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)
2019-05-27 19:08:08.050226: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-05-27 19:08:08.050350: E tensorflow/stream_executor/cuda/cuda_dnn.cc:381] Possibly insufficient driver version: 384.183.0
2019-05-27 19:08:08.050378: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214: Unknown: Fail to find the dnn implementation.
2019-05-27 19:08:08.050483: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-05-27 19:08:08.050523: E tensorflow/stream_executor/cuda/cuda_dnn.cc:381] Possibly insufficient driver version: 384.183.0
2019-05-27 19:08:08.050541: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214: Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[{{node cu_dnngru/CudnnRNN}} = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
[[{{node mean_squared_error/value/_37}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ta_skenario1.py", line 271, in <module>
losss, op = sess.run([loss, optimizer], feed_dict={x:data,y_label:label,initial_input:begin_sentence})
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node cu_dnngru/CudnnRNN (defined at ta_skenario1.py:205) = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
[[{{node mean_squared_error/value/_37}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'cu_dnngru/CudnnRNN', defined at:
File "ta_skenario1.py", line 205, in <module>
encoder_outputs, state_h = encoder_LSTM(x)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 619, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/cudnn_recurrent.py", line 109, in call
output, states = self._process_batch(inputs, initial_state)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/cudnn_recurrent.py", line 299, in _process_batch
rnn_mode='gru')
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 116, in cudnn_rnn
is_training=is_training, name=name)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/paperspace/.conda/envs/gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Fail to find the dnn implementation.
[[node cu_dnngru/CudnnRNN (defined at ta_skenario1.py:205) = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"](cu_dnngru/transpose, cu_dnngru/ExpandDims, gradients/while/Shape/Enter_grad/zeros/Const, cu_dnngru/concat)]]
[[{{node mean_squared_error/value/_37}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1756_mean_squared_error/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
¿Alguna idea? Gracias
ACTUALIZACIÓN: Traté de degradar la versión de CuDNN de 7.5.0 a 7.1.4 pero el resultado sigue siendo el mismo.
Solución del problema
No estoy seguro de si puede ayudar, pero en mi caso, el problema se debió al uso de varios archivos de cuaderno jupyter.
Estaba escribiendo un código simple para una red neuronal y decidí dividirlo en 2 cuadernos, uno para el entrenamiento y otro para la predicción (si no tenía recursos/tiempo para entrenar su red, proporcioné mi modelo guardado en un archivo).
Si ejecuté los dos cuadernos "juntos", así que básicamente primero el entrenamiento y luego el de predicción sin desconectar el núcleo del primer código, habría recibido este error.
Desconectar el núcleo del primer cuaderno jupyter antes de usar el segundo resolvió mi problema.
No hay comentarios.:
Publicar un comentario