Lesson 5: LowRank Adaptation, Quick note on X = torch.randn(...)

Thanks! Quick note on X = torch.randn(…) vs using the model’s actual hidden states

Hi everyone,
thank you for putting these lessons together. I’m working through the LoRA section and wanted to share a small observation (and I might be misunderstanding the intent, so please feel free to correct me).

What confused me

In the example, after generating a token from input_ids, the code later introduces a new random tensor:

# dummy input tensor
# shape: (batch_size, sequence_length, hidden_size)
X = torch.randn(1, 8, 1024)

If we then apply a LoRA-style adapter using this X, we’re no longer operating on the same activations produced by the model for the given input_ids. So it seems expected that downstream outputs/logits (and any argmax token) could differ — not necessarily because LoRA “changed the model” in a controlled way, but because the input to the layer changed.

What I think is clearer (reusing the “real” X)

When the goal is to illustrate LoRA as an additive low-rank update for the same hidden states, it helps to reuse the hidden states that already exist for the same input_ids, e.g.:

X = model.embedding(input_ids)

(or equivalently, capture the embedding output with a forward hook).

Below is a minimal reproducible snippet that:

  1. Generates a token from input_ids

  2. Reuses X = model.embedding(input_ids) (not torch.randn)

  3. Wraps model.linear with a LoRA-style module

Repro code

import torch
import math

# ----- toy model -----
class TestModel(torch.nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(10, hidden_size)
        self.linear = torch.nn.Linear(hidden_size, hidden_size)
        self.lm_head = torch.nn.Linear(hidden_size, 10)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        x = self.linear(x)
        x = self.lm_head(x)
        return x

detokenizer = [
    "red","orange","yellow","green","blue",
    "indigo","violet","magenta","marigold","chartreuse",
]

def generate_token(model, input_ids):
    with torch.no_grad():
        logits = model(input_ids)
    last_logits = logits[:, -1, :]
    next_token_id = last_logits.argmax(dim=1).item()
    return detokenizer[next_token_id]

# ----- set seed BEFORE creating the model for reproducible weights -----
torch.manual_seed(0)
hidden_size = 1024
model = TestModel(hidden_size)

input_ids = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]])

print("Before LoRA:", generate_token(model, input_ids))

# ----- IMPORTANT: use the model's actual hidden states, not torch.randn -----
with torch.no_grad():
    X = model.embedding(input_ids)  # (batch, seq, hidden)

print("X shape:", X.shape)

# ----- LoRA wrapper -----
class LoraLayer(torch.nn.Module):
    def __init__(self, base_layer: torch.nn.Linear, r: int):
        super().__init__()
        self.base_layer = base_layer
        in_features = base_layer.in_features
        out_features = base_layer.out_features

        # Trainable LoRA params
        self.lora_a = torch.nn.Parameter(torch.empty(in_features, r))
        self.lora_b = torch.nn.Parameter(torch.zeros(r, out_features))

        # lora_b starts at 0 -> output initially unchanged

    def forward(self, x):
        y_base = self.base_layer(x)  # x @ W.T + b
        y_lora = (x @ self.lora_a @ self.lora_b)
        return y_base + y_lora

# Replace the linear layer with LoRA-wrapped linear
model.linear = LoraLayer(model.linear, r=2)

# sanity check: LoRA layer works on the same X shape
with torch.no_grad():
    print("LoRA linear(X) shape:", model.linear(X).shape)

print("After LoRA (no training):", generate_token(model, input_ids))



I got:

Before LoRA: green
X shape: torch.Size([1, 8, 1024])
LoRA linear(X) shape: torch.Size([1, 8, 1024])
After LoRA (no training): red

Thank you @taless474 for taking the course and sharing this comment!

It is interesting finding and I’ll share with the team :slight_smile:

1 Like