How to increase the response time of LLama 2 model?

Irfan_Shah · March 20, 2024, 6:18am

If we use Flan T5 or LLama 2. In which way we get response in short time now I am getting response in 2 minutes but I want in 1 minutes.
How can I make my model cost effective in the sense of server?

gent.spah · March 20, 2024, 8:05am

More bandwidth for you connection, bigger processing unit (GPU, TPU or CPU) in the server itself!

Alberto_Zacchini · March 20, 2024, 8:19am

Are you running the model in local? You can try use the vLLM library, they also include Llama 2 models so give it a try

Irfan_Shah · March 20, 2024, 8:47am

We deploy using FAST API

Topic		Replies	Views
Inference delays AI Discussions feedback , ai-discussions , project	0	25	November 2, 2024
Faster Inference LLM AI Discussions ai-discussions , llm	1	60	October 28, 2024
Cool development for fast, local CPU driven LLM inference AI Discussions ai-discussions	6	504	April 5, 2024
Meta shrinks Llama models for faster on-device AI AI Discussions ai-discussions , data-points	1	75	October 28, 2024
ChatBot with LLama2 Building Systems with the ChatGPT API	10	424	January 31, 2024

How to increase the response time of LLama 2 model?

Related topics