How to increase the response time of LLama 2 model?

If we use Flan T5 or LLama 2. In which way we get response in short time now I am getting response in 2 minutes but I want in 1 minutes.
How can I make my model cost effective in the sense of server?

More bandwidth for you connection, bigger processing unit (GPU, TPU or CPU) in the server itself!

1 Like

Are you running the model in local? You can try use the vLLM library, they also include Llama 2 models so give it a try

1 Like

We deploy using FAST API