If we use Flan T5 or LLama 2. In which way we get response in short time now I am getting response in 2 minutes but I want in 1 minutes.
How can I make my model cost effective in the sense of server?
More bandwidth for you connection, bigger processing unit (GPU, TPU or CPU) in the server itself!
1 Like
Are you running the model in local? You can try use the vLLM library, they also include Llama 2 models so give it a try
1 Like
We deploy using FAST API