Welcome to the community !
For your question A), actually, there are two cases. One is a traditional computer vision to apply different types of “Fixed” filters to detect objects. And the other is a deep learning approach to use a convolutional network that we will learn in the Course 4 of this specialization. I think the 2nd one is more intuitive and also good to understand deep learning.
Here is Andrew’s picture about convolutional neural network.
In short, we use different types of filters to extract features from an image. And, you see the size (height and width) is getting smaller, but the depth (called channels) is getting larger.
Then, think about using, say 5x5 filter. If this filter is applied to 224x224x3 image, then, it can only focus on a very small portion of an image like just edges in a small area. But, if the same size of filter is applied to 26x26x256, then, it can cover larger area, i.e, can detect shapes.
And, so on. This paper shows actual focuses of those layers visually, which will help you to understand what’s happening in each layer.
For your 2nd question B), we should start with Andrew’s example.
The traditional approach (upper figure) consists of multiple steps. And, each step is an independent task and can be optimized as a task. So, there are multiple human interventions, i.e, human fixes a goal of each step, i.e, intermediate feature map. On the other hand, in the case of end-to-end learning, there is no intermediate feature map that human designs. As you know, each step, in the traditional way, has some weights. But, it only focuses on a small task. So, its number is not so big.
But, in the case of end-to-end learning, it can be a combination of all tasks, i.e, the number of combination of weights become huge.
This types of large (deep) neural network is hard to be converged, since there are two many combinations of weights. (just like defining co-efficiencies for a multi-variate polynomial)
To make this deep network converted, we need more data and more iterations which require computational power. (In that sense, one of the reason that end-to-end deep learning is getting focused is an evolution of computational power like GPU.)
Hope this helps some.