Thanks! Quick note on X = torch.randn(…) vs using the model’s actual hidden states
Hi everyone,
thank you for putting these lessons together. I’m working through the LoRA section and wanted to share a small observation (and I might be misunderstanding the intent, so please feel free to correct me).
What confused me
In the example, after generating a token from input_ids, the code later introduces a new random tensor:
# dummy input tensor
# shape: (batch_size, sequence_length, hidden_size)
X = torch.randn(1, 8, 1024)
If we then apply a LoRA-style adapter using this X, we’re no longer operating on the same activations produced by the model for the given input_ids. So it seems expected that downstream outputs/logits (and any argmax token) could differ — not necessarily because LoRA “changed the model” in a controlled way, but because the input to the layer changed.
What I think is clearer (reusing the “real” X)
When the goal is to illustrate LoRA as an additive low-rank update for the same hidden states, it helps to reuse the hidden states that already exist for the same input_ids, e.g.:
X = model.embedding(input_ids)
(or equivalently, capture the embedding output with a forward hook).
Below is a minimal reproducible snippet that:
-
Generates a token from
input_ids -
Reuses
X = model.embedding(input_ids)(nottorch.randn) -
Wraps
model.linearwith a LoRA-style module
Repro code
import torch
import math
# ----- toy model -----
class TestModel(torch.nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.embedding = torch.nn.Embedding(10, hidden_size)
self.linear = torch.nn.Linear(hidden_size, hidden_size)
self.lm_head = torch.nn.Linear(hidden_size, 10)
def forward(self, input_ids):
x = self.embedding(input_ids)
x = self.linear(x)
x = self.lm_head(x)
return x
detokenizer = [
"red","orange","yellow","green","blue",
"indigo","violet","magenta","marigold","chartreuse",
]
def generate_token(model, input_ids):
with torch.no_grad():
logits = model(input_ids)
last_logits = logits[:, -1, :]
next_token_id = last_logits.argmax(dim=1).item()
return detokenizer[next_token_id]
# ----- set seed BEFORE creating the model for reproducible weights -----
torch.manual_seed(0)
hidden_size = 1024
model = TestModel(hidden_size)
input_ids = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]])
print("Before LoRA:", generate_token(model, input_ids))
# ----- IMPORTANT: use the model's actual hidden states, not torch.randn -----
with torch.no_grad():
X = model.embedding(input_ids) # (batch, seq, hidden)
print("X shape:", X.shape)
# ----- LoRA wrapper -----
class LoraLayer(torch.nn.Module):
def __init__(self, base_layer: torch.nn.Linear, r: int):
super().__init__()
self.base_layer = base_layer
in_features = base_layer.in_features
out_features = base_layer.out_features
# Trainable LoRA params
self.lora_a = torch.nn.Parameter(torch.empty(in_features, r))
self.lora_b = torch.nn.Parameter(torch.zeros(r, out_features))
# lora_b starts at 0 -> output initially unchanged
def forward(self, x):
y_base = self.base_layer(x) # x @ W.T + b
y_lora = (x @ self.lora_a @ self.lora_b)
return y_base + y_lora
# Replace the linear layer with LoRA-wrapped linear
model.linear = LoraLayer(model.linear, r=2)
# sanity check: LoRA layer works on the same X shape
with torch.no_grad():
print("LoRA linear(X) shape:", model.linear(X).shape)
print("After LoRA (no training):", generate_token(model, input_ids))
I got:
Before LoRA: green
X shape: torch.Size([1, 8, 1024])
LoRA linear(X) shape: torch.Size([1, 8, 1024])
After LoRA (no training): red