I'm working on a project involving FastAPI with a Segment anything Model (SAM) for image processing. I have a CUDA setup with a GPU (NVIDIA RTX A4000) available for acceleration, and I'm running FastAPI with 15 workers. However, when I start the FastAPI server using the following CLI:
uvicorn server:app --port 5008 --host 0.0.0.0 --workers 15only 3 to 5 workers load properly. I encountered a CUDA memory error for the remaining workers indicating that the memory is insufficient. The specific error message I receive is:
"RuntimeError: CUDA out of memory. Tried to allocate X MiB"
I've tried loading the SAM model when the workers start up, but I still face the same issue with workers running out of CUDA memory. Is there a way I can configure FastAPI or the workers to use the available CUDA memory more efficiently so that I can utilize all 15 workers without running into memory errors?
I'd appreciate any insights or suggestions on optimizing the CUDA memory usage for all workers in FastAPI. Thank you!