Ungraded lab: Attention - Issue with function DotProductAttention()

Mohammad_Atif_Khan · December 28, 2022, 6:42pm

I’m unsure of this function’s correctness:

def DotProductAttention(query, key, value, mask, scale=True):
    """Dot product self-attention.
    Args:
        query (numpy.ndarray): array of query representations with shape (L_q by d)
        key (numpy.ndarray): array of key representations with shape (L_k by d)
        value (numpy.ndarray): array of value representations with shape (L_k by d) where L_v = L_k
        mask (numpy.ndarray): attention-mask, gates attention with shape (L_q by L_k)
        scale (bool): whether to scale the dot product of the query and transposed key

    Returns:
        numpy.ndarray: Self-attention array for q, k, v arrays. (L_q by L_k)
    """

    assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"

    # Save depth/dimension of the query embedding for scaling down the dot product
    if scale: 
        depth = query.shape[-1]
    else:
        depth = 1

    # Calculate scaled query key dot product according to formula above
    dots = np.matmul(query, np.swapaxes(key, -1, -2)) / np.sqrt(depth) 
    
    # Apply the mask
    if mask is not None:
        dots = np.where(mask, dots, np.full_like(dots, -1e9)) 
    
    # Softmax formula implementation
    # Use scipy.special.logsumexp of masked_qkT to avoid underflow by division by large numbers
    # Note: softmax = e^(dots - logaddexp(dots)) = E^dots / sumexp(dots)
    logsumexp = scipy.special.logsumexp(dots, axis=-1, keepdims=True)

    # Take exponential of dots minus logsumexp to get softmax
    # Use np.exp()
    dots = np.exp(dots - logsumexp)

    # Multiply dots by value to get self-attention
    # Use np.matmul()
    attention = np.matmul(dots, value)
    
    return attention

Firstly, why does the doc string say the output will be of shape (L_q by L_k)? should it not be (L_q by d)?
If the function is implementing self-attention, do we not expect L_q = L_k ? This is further reinforced by the function following this (causal attention) where the mask is a square matrix.
Why does the mask have the additional dimension (batch dimension) in the casual attention special case implementation:

def dot_product_self_attention(q, k, v, scale=True):
    """ Masked dot product self attention.
    Args:
        q (numpy.ndarray): queries.
        k (numpy.ndarray): keys.
        v (numpy.ndarray): values.
    Returns:
        numpy.ndarray: masked dot product self attention tensor.
    """
    
    # Size of the penultimate dimension of the query
    mask_size = q.shape[-2]

    # Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
    # Use np.tril() - Lower triangle of an array and np.ones()
    mask = np.tril(np.ones((1, mask_size, mask_size), dtype=np.bool_), k=0)  
        
    return DotProductAttention(q, k, v, mask, scale=scale)

Lastly, do we expect Q=K for self-attention?

thanks.

arvyzukai · January 12, 2023, 4:30pm

Hi @Mohammad_Atif_Khan

These are good questions and since nobody answered them for a while, let me offer my take:

Yes, I think the docstring is wrong. I will submit it for correction.

I’m not sure I understand the question. By definition, self-attention takes the same inputs, so L_q = L_k = L_v.

The function DotProductAttention could be applied not only to self-attention.

Here, if the mask values are so that it would “pay attention” to previous inputs, the result would be self-attention. But if you pass the mask “None” then the result would be just Dot-product Attention, if you pass the scale=True, this would be Scaled Dot-product Attention. Btw, I hate this kind of terminology in ML

You are asking about the first dimension (equal to “1” in (1, mask_size, mask_size))? Then it is not necessary (in self-attention we use the same mask). I guess, when the batch elements have different masks then this mask would be relevant.

Again, the inputs are the same for self-attention, but the values of Q and K tensors are most definitely different.

Cheers

Topic		Replies	Views
C4W2 assignment exercise 1 scaled dot-product attention \| wrong output values NLP with Attention Models week-2	3	197	April 24, 2024
C4W2_Assignment in Natural Language Processing with Attention NLP with Attention Models week-3	2	62	September 2, 2024
Questions regarding course 4 week 1 NLP with Attention Models week-1	1	577	August 3, 2022
C4W2 UNQ_C5 how to specify the input parameters for the dot_product_self_attention function NLP with Attention Models week-2	2	563	October 19, 2022
W2_tests for test_dot_product_self_attention NLP with Attention Models week-2	4	565	July 27, 2022

Ungraded lab: Attention - Issue with function DotProductAttention()

Related topics