I have a more conceptional question about running llama-cpp-python in a Docker Container. Following a lot of different tutorials I am more confused as in the beginning.
I have a Debian 12 Server with a CPU - Intel Core i7-7700 and a GPU - GeForce GTX 1080.
I installed on the host via the Debian via DEFAULT APT Repository and the Nvidia APT Repository the following components
- linux-headers-amd64
- nvidia-detect
- nvidia-driver
- nvidia-smi
- linux-image-amd64
- cuda
The Nvida driver is installed correctly which I can verify with
# nvidia-smiSun Mar 31 10:46:20 2024 +-----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: N/A ||-----------------------------------------+------------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 NVIDIA GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A || 36% 42C P0 39W / 180W | 0MiB / 8192MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| No running processes found |+-----------------------------------------------------------------------------------------+Next I build a Docker Image where I installed inside the following libraries:
- jupyterlab
- cuda-toolkit-12-3
- llama-cpp-python
Than I run my Container with my llama_cpp application
$ docker run --gpus all my-docker-image It works, but the GPU has no effect even if I can see from my log output that something with GPU and CUDA was detected by llama-cpp:
....ggml_init_cublas: GGML_CUDA_FORCE_MMQ: noggml_init_cublas: CUDA_USE_TENSOR_CORES: yesggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yesllama_kv_cache_init: CUDA_Host KV buffer size = 381.00 MiBllama_new_context_with_model: KV self size = 381.00 MiB, K (f16): 190.50 MiB, V (f16): 190.50 MiBllama_new_context_with_model: CUDA_Host output buffer size = 62.50 MiBllama_new_context_with_model: CUDA0 compute buffer size = 227.41 MiBllama_new_context_with_model: CUDA_Host compute buffer size = 13.96 MiBllama_new_context_with_model: graph nodes = 1060llama_new_context_with_model: graph splits = 356AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | .....There is no performance different in running the container with or without GPU.
My first question is: Is my environment setup correct or are there any components missing on the Host or Container side?And my second question is: What is necessary to run llama-cpp-python inside a container using the GPU?
My installation code of llama-cpp-python within my container looks like this:
...RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir....