Andrew’s image is slightly misleading.
Height and Width for an image is part of “r”, resolution.
Depth “d” is the “depth of network layers”.
Width “w” is the number of channels created by multiple filters in convolutions.
Here is the another thread about the width.
Hope this helps.