Prof Ng’s notation is a bit ambiguous. You have to watch the context to figure out whether the d… value is a final gradient or merely a “Chain Rule” factor being used to compute a final gradient. It is only dW^{[l]} and db^{[l]} which are actually the full gradients meaning that they are partial derivatives of J w.r.t. the parameter in question. So those are the only ones that are averages over the samples. The rest are vectors (pre average). In your particular example:

dAL = \displaystyle \frac {\partial L}{\partial AL}

So it is a vector of dimension 1 x m. Remember that J is the average of the loss values L across the samples. Of course the derivative of the average is the average of the derivatives. Think about it for a sec and that should make sense.