Ungraded lab: Attention - Issue with function DotProductAttention()

I’m unsure of this function’s correctness:

def DotProductAttention(query, key, value, mask, scale=True):
    """Dot product self-attention.
        query (numpy.ndarray): array of query representations with shape (L_q by d)
        key (numpy.ndarray): array of key representations with shape (L_k by d)
        value (numpy.ndarray): array of value representations with shape (L_k by d) where L_v = L_k
        mask (numpy.ndarray): attention-mask, gates attention with shape (L_q by L_k)
        scale (bool): whether to scale the dot product of the query and transposed key

        numpy.ndarray: Self-attention array for q, k, v arrays. (L_q by L_k)

    assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"

    # Save depth/dimension of the query embedding for scaling down the dot product
    if scale: 
        depth = query.shape[-1]
        depth = 1

    # Calculate scaled query key dot product according to formula above
    dots = np.matmul(query, np.swapaxes(key, -1, -2)) / np.sqrt(depth) 
    # Apply the mask
    if mask is not None:
        dots = np.where(mask, dots, np.full_like(dots, -1e9)) 
    # Softmax formula implementation
    # Use scipy.special.logsumexp of masked_qkT to avoid underflow by division by large numbers
    # Note: softmax = e^(dots - logaddexp(dots)) = E^dots / sumexp(dots)
    logsumexp = scipy.special.logsumexp(dots, axis=-1, keepdims=True)

    # Take exponential of dots minus logsumexp to get softmax
    # Use np.exp()
    dots = np.exp(dots - logsumexp)

    # Multiply dots by value to get self-attention
    # Use np.matmul()
    attention = np.matmul(dots, value)
    return attention
  1. Firstly, why does the doc string say the output will be of shape (L_q by L_k)? should it not be (L_q by d)?
  2. If the function is implementing self-attention, do we not expect L_q = L_k ? This is further reinforced by the function following this (causal attention) where the mask is a square matrix.
  3. Why does the mask have the additional dimension (batch dimension) in the casual attention special case implementation:
def dot_product_self_attention(q, k, v, scale=True):
    """ Masked dot product self attention.
        q (numpy.ndarray): queries.
        k (numpy.ndarray): keys.
        v (numpy.ndarray): values.
        numpy.ndarray: masked dot product self attention tensor.
    # Size of the penultimate dimension of the query
    mask_size = q.shape[-2]

    # Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
    # Use np.tril() - Lower triangle of an array and np.ones()
    mask = np.tril(np.ones((1, mask_size, mask_size), dtype=np.bool_), k=0)  
    return DotProductAttention(q, k, v, mask, scale=scale)
  1. Lastly, do we expect Q=K for self-attention?


Hi @Mohammad_Atif_Khan

These are good questions and since nobody answered them for a while, let me offer my take:

Yes, I think the docstring is wrong. I will submit it for correction.

I’m not sure I understand the question. By definition, self-attention takes the same inputs, so L_q = L_k = L_v.

The function DotProductAttention could be applied not only to self-attention.

Here, if the mask values are so that it would “pay attention” to previous inputs, the result would be self-attention. But if you pass the mask “None” then the result would be just Dot-product Attention, if you pass the scale=True, this would be Scaled Dot-product Attention. Btw, I hate this kind of terminology in ML :slight_smile:

You are asking about the first dimension (equal to “1” in (1, mask_size, mask_size))? Then it is not necessary (in self-attention we use the same mask). I guess, when the batch elements have different masks then this mask would be relevant.

Again, the inputs are the same for self-attention, but the values of Q and K tensors are most definitely different.