Faster Inference LLM

Dovud_Asadov · October 28, 2024, 10:14am

I recently fine-tuned LLaMA 3.1 (70B model) and deployed it using FastAPI. However, during inference, it only handles up to 2 prompts at once. I’m running it on an A100 80GB GPU, but it only utilizes around 40GB of memory, leaving the rest unused. Is there a way to fully utilize the GPU memory to support more prompts during inference? Any advice on improving the throughput or maximizing GPU usage would be appreciated.

Nevermnd · October 28, 2024, 10:19am

@Dovud_Asadov Fast.ai has a Discord; You might try asking there.

There is also this, but I’m not sure who is running it as host:

Topic		Replies	Views
Cool development for fast, local CPU driven LLM inference AI Discussions ai-discussions	6	555	April 5, 2024
Inference delays AI Discussions feedback , ai-discussions , project	0	25	November 2, 2024
Ctransformers also using CPU as well as GPU for a model that should fit in VRAM AI Discussions ai-discussions	1	517	March 2, 2024
Llama3.2 from Huggingface in Google Colab AI Discussions ai-discussions	6	350	November 7, 2024
Prompt Engineering with Llama 2&3: supplied notebook code error Prompt Engineering with Llama 2 short-course	4	157	February 22, 2025

Faster Inference LLM

Related topics