I am using tensorflow 2.13 with help of docker file available within tensorflow models/research/object_detection/dockerfiles folder.
The Docker file contents are
FROM tensorflow/tensorflow:latest-gpuARG DEBIAN_FRONTEND=noninteractive# Install apt dependenciesRUN apt-get update && apt-get install -y \ git \ gpg-agent \ python3-cairocffi \ protobuf-compiler \ python3-pil \ python3-lxml \ python3-tk \ python3-opencv \ libssl-dev \ software-properties-common \ wgetWORKDIR /home/tensorflow## Copy this code (make sure you are under the ../models/research directory)COPY models/research/. /home/tensorflow/models# Compile protobuf configsRUN (cd /home/tensorflow/models/ && protoc object_detection/protos/*.proto --python_out=.)WORKDIR /home/tensorflow/models/RUN cp object_detection/packages/tf2/setup.py ./ENV PATH="/home/tensorflow/.local/bin:${PATH}"RUN python -m pip install -U pipRUN python -m pip install .COPY scripts /home/tensorflow/COPY workspace /home/tensorflow/#ENTRYPOINT ["python", "object_detection/model_main_tf2.py"]
After all due steps the docker image runs. My host machine has GPU GTX 1650. I tried to test if my tensorflow installation is using GPU using test_tf.py as below
import tensorflow as tfif tf.test.gpu_device_name(): print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))else: print("Please install GPU version of TF")
Here is the output from test_tf.py
root@5433479cb167:/home/tensorflow# python3 test_tf.py 2023-09-07 14:43:50.620651: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.2023-09-07 14:43:53.994502: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: UNKNOWN ERROR (34)Please install GPU version of TF
Using docker is critical for my work so I need to figure out a solution. From the message my understanding is TF detects the GPU but is unable to use it . The last message is confusing since the base image in use is FROM tensorflow/tensorflow:latest-gpu
I started a small dataset training ( 50 images ) and it seems to be using my CPU to full extent. My training loop is stuck with the following message on the console -
I0907 14:31:03.622151 140609981511424 api.py:460] feature_map_spatial_dims: [(128, 128), (64, 64), (32, 32), (16, 16), (8, 8)]I0907 14:31:10.580329 140609981511424 api.py:460] feature_map_spatial_dims: [(128, 128), (64, 64), (32, 32), (16, 16), (8, 8)]I0907 14:31:16.743497 140609981511424 api.py:460] feature_map_spatial_dims: [(128, 128), (64, 64), (32, 32), (16, 16), (8, 8)]I0907 14:31:23.568284 140609981511424 api.py:460] feature_map_spatial_dims: [(128, 128), (64, 64), (32, 32), (16, 16), (8, 8)]
Prior to this run my training loop had crashed with the same 4 msgs on console and I had to reduce my batch size which allows the training to now continue. But I have no loss messages on console yet.
Please suggest any steps I can take to address / resolve these issues.
Update # after about an hour my training loop produced loss statement and that confirms all is well with CPU based training
INFO:tensorflow:Step 100 per-step time 30.199sI0907 15:21:21.773339 140617841817408 model_lib_v2.py:705] Step 100 per-step time 30.199sINFO:tensorflow:{'Loss/classification_loss': 0.16712503,'Loss/localization_loss': 0.101843126,'Loss/regularization_loss': 0.29896417,'Loss/total_loss': 0.56793237,'learning_rate': 0.0141663505}I0907 15:21:21.820583 140617841817408 model_lib_v2.py:708] {'Loss/classification_loss': 0.16712503,'Loss/localization_loss': 0.101843126,'Loss/regularization_loss': 0.29896417,'Loss/total_loss': 0.56793237,'learning_rate': 0.0141663505}
Still I am looking for a solution to my GPU woes.