I recently fine-tuned LLaMA 3.1 (70B model) and deployed it using FastAPI. However, during inference, it only handles up to 2 prompts at once. Iām running it on an A100 80GB GPU, but it only utilizes around 40GB of memory, leaving the rest unused. Is there a way to fully utilize the GPU memory to support more prompts during inference? Any advice on improving the throughput or maximizing GPU usage would be appreciated.
@Dovud_Asadov Fast.ai has a Discord; You might try asking there.
There is also this, but Iām not sure who is running it as host:
1 Like