Faster Inference LLM

I recently fine-tuned LLaMA 3.1 (70B model) and deployed it using FastAPI. However, during inference, it only handles up to 2 prompts at once. Iā€™m running it on an A100 80GB GPU, but it only utilizes around 40GB of memory, leaving the rest unused. Is there a way to fully utilize the GPU memory to support more prompts during inference? Any advice on improving the throughput or maximizing GPU usage would be appreciated.

@Dovud_Asadov Fast.ai has a Discord; You might try asking there.

There is also this, but Iā€™m not sure who is running it as host:

1 Like