I’m unsure of this function’s correctness:
def DotProductAttention(query, key, value, mask, scale=True):
"""Dot product self-attention.
Args:
query (numpy.ndarray): array of query representations with shape (L_q by d)
key (numpy.ndarray): array of key representations with shape (L_k by d)
value (numpy.ndarray): array of value representations with shape (L_k by d) where L_v = L_k
mask (numpy.ndarray): attention-mask, gates attention with shape (L_q by L_k)
scale (bool): whether to scale the dot product of the query and transposed key
Returns:
numpy.ndarray: Self-attention array for q, k, v arrays. (L_q by L_k)
"""
assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"
# Save depth/dimension of the query embedding for scaling down the dot product
if scale:
depth = query.shape[-1]
else:
depth = 1
# Calculate scaled query key dot product according to formula above
dots = np.matmul(query, np.swapaxes(key, -1, -2)) / np.sqrt(depth)
# Apply the mask
if mask is not None:
dots = np.where(mask, dots, np.full_like(dots, -1e9))
# Softmax formula implementation
# Use scipy.special.logsumexp of masked_qkT to avoid underflow by division by large numbers
# Note: softmax = e^(dots - logaddexp(dots)) = E^dots / sumexp(dots)
logsumexp = scipy.special.logsumexp(dots, axis=-1, keepdims=True)
# Take exponential of dots minus logsumexp to get softmax
# Use np.exp()
dots = np.exp(dots - logsumexp)
# Multiply dots by value to get self-attention
# Use np.matmul()
attention = np.matmul(dots, value)
return attention
- Firstly, why does the doc string say the output will be of shape (L_q by L_k)? should it not be (L_q by d)?
- If the function is implementing self-attention, do we not expect L_q = L_k ? This is further reinforced by the function following this (causal attention) where the mask is a square matrix.
- Why does the mask have the additional dimension (batch dimension) in the casual attention special case implementation:
def dot_product_self_attention(q, k, v, scale=True):
""" Masked dot product self attention.
Args:
q (numpy.ndarray): queries.
k (numpy.ndarray): keys.
v (numpy.ndarray): values.
Returns:
numpy.ndarray: masked dot product self attention tensor.
"""
# Size of the penultimate dimension of the query
mask_size = q.shape[-2]
# Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
# Use np.tril() - Lower triangle of an array and np.ones()
mask = np.tril(np.ones((1, mask_size, mask_size), dtype=np.bool_), k=0)
return DotProductAttention(q, k, v, mask, scale=scale)
- Lastly, do we expect Q=K for self-attention?
thanks.