# Ungraded lab: Attention - Issue with function DotProductAttention()

I’m unsure of this function’s correctness:

``````def DotProductAttention(query, key, value, mask, scale=True):
"""Dot product self-attention.
Args:
query (numpy.ndarray): array of query representations with shape (L_q by d)
key (numpy.ndarray): array of key representations with shape (L_k by d)
value (numpy.ndarray): array of value representations with shape (L_k by d) where L_v = L_k
scale (bool): whether to scale the dot product of the query and transposed key

Returns:
numpy.ndarray: Self-attention array for q, k, v arrays. (L_q by L_k)
"""

assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"

# Save depth/dimension of the query embedding for scaling down the dot product
if scale:
depth = query.shape[-1]
else:
depth = 1

# Calculate scaled query key dot product according to formula above
dots = np.matmul(query, np.swapaxes(key, -1, -2)) / np.sqrt(depth)

dots = np.where(mask, dots, np.full_like(dots, -1e9))

# Softmax formula implementation
# Use scipy.special.logsumexp of masked_qkT to avoid underflow by division by large numbers
# Note: softmax = e^(dots - logaddexp(dots)) = E^dots / sumexp(dots)
logsumexp = scipy.special.logsumexp(dots, axis=-1, keepdims=True)

# Take exponential of dots minus logsumexp to get softmax
# Use np.exp()
dots = np.exp(dots - logsumexp)

# Multiply dots by value to get self-attention
# Use np.matmul()
attention = np.matmul(dots, value)

return attention
``````
1. Firstly, why does the doc string say the output will be of shape (L_q by L_k)? should it not be (L_q by d)?
2. If the function is implementing self-attention, do we not expect L_q = L_k ? This is further reinforced by the function following this (causal attention) where the mask is a square matrix.
3. Why does the mask have the additional dimension (batch dimension) in the casual attention special case implementation:
``````def dot_product_self_attention(q, k, v, scale=True):
""" Masked dot product self attention.
Args:
q (numpy.ndarray): queries.
k (numpy.ndarray): keys.
v (numpy.ndarray): values.
Returns:
numpy.ndarray: masked dot product self attention tensor.
"""

# Size of the penultimate dimension of the query

# Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
# Use np.tril() - Lower triangle of an array and np.ones()

return DotProductAttention(q, k, v, mask, scale=scale)
``````
1. Lastly, do we expect Q=K for self-attention?

thanks.

These are good questions and since nobody answered them for a while, let me offer my take:

Yes, I think the docstring is wrong. I will submit it for correction.

I’m not sure I understand the question. By definition, self-attention takes the same inputs, so L_q = L_k = L_v.

The function `DotProductAttention` could be applied not only to self-attention.

Here, if the mask values are so that it would “pay attention” to previous inputs, the result would be self-attention. But if you pass the mask “None” then the result would be just Dot-product Attention, if you pass the scale=True, this would be Scaled Dot-product Attention. Btw, I hate this kind of terminology in ML

You are asking about the first dimension (equal to “1” in (1, mask_size, mask_size))? Then it is not necessary (in self-attention we use the same mask). I guess, when the batch elements have different masks then this mask would be relevant.

Again, the inputs are the same for self-attention, but the values of Q and K tensors are most definitely different.

Cheers