KV Caching for Instruction Tuned models

Hi,

I was wondering how we could implement KV Caching to speed up token generation.

Can anyone help me with the necessary modifications for the code shared below?

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“google/gemma-2b-it”)
model = AutoModelForCausalLM.from_pretrained(“google/gemma-2b-it”)

input_text = “Write me a poem about Machine Learning.”
input_ids = tokenizer(input_text, return_tensors=“pt”)

outputs = model.generate(**input_ids, max_new_tokens = 100)
print(tokenizer.decode(outputs[0]))

Hey @manju, the process works the same as the one demonstrated in lesson 1.

If you use the HF transformers model.generate function, it should employ kv caching by default if you pass in use_cache=True.