Hi,
I was wondering how we could implement KV Caching to speed up token generation.
Can anyone help me with the necessary modifications for the code shared below?
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(“google/gemma-2b-it”)
model = AutoModelForCausalLM.from_pretrained(“google/gemma-2b-it”)
input_text = “Write me a poem about Machine Learning.”
input_ids = tokenizer(input_text, return_tensors=“pt”)
outputs = model.generate(**input_ids, max_new_tokens = 100)
print(tokenizer.decode(outputs[0]))