KV Caching for Instruction Tuned models

manju · March 21, 2024, 2:37am

Hi,

I was wondering how we could implement KV Caching to speed up token generation.

Can anyone help me with the necessary modifications for the code shared below?

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“google/gemma-2b-it”)
model = AutoModelForCausalLM.from_pretrained(“google/gemma-2b-it”)

input_text = “Write me a poem about Machine Learning.”
input_ids = tokenizer(input_text, return_tensors=“pt”)

outputs = model.generate(**input_ids, max_new_tokens = 100)
print(tokenizer.decode(outputs[0]))

tgaddair · March 21, 2024, 4:21am

Hey @manju, the process works the same as the one demonstrated in lesson 1.

If you use the HF transformers model.generate function, it should employ kv caching by default if you pass in use_cache=True.

Topic		Replies	Views
Need help to find the syntax of transformers and tokenizers used in week 1 lab Generative AI with Large Language Models week-module-1	3	414	July 30, 2023
meta-llama/Llama-2-7b-chat-hf Generative AI with Large Language Models week-module-1	2	634	October 31, 2023
Is the tokenizer a model? Generative AI with Large Language Models week-module-1	1	493	September 8, 2023
The uses of Tokenizer Generative AI with Large Language Models week-module-1	1	382	October 2, 2023
C1w1 lab - about FLAT-T5 locally Generative AI with Large Language Models week-module-1	1	450	June 30, 2023