Thank you for your answers.
I agree, phrasing it as a “generalized feature” was not good choice and both your points.
Please ignore training part for now. Let me add an example and try to illustrate what intuition I have and why it makes sense:
Item Link:
https://www.coursera.org/learn/advanced-learning-algorithms/lecture/MsbrF/demand-prediction
Prof. Andrew’s Idea in above slide was that we could ‘possibly’ learn intermediate parameters (let us call it, hypothetical features) as ‘affordability’, ‘awareness’ etc.
Let’s assume, after model training, a single test sample x_test[i]
in 4-D space. Initially, this sample was represented as a vector (4, 1) where each dimension was a feature (assuming feature independence).
Now, after training we got some transformation matrix (here, w[1]
(4, 3)). Which is used for transformation and provides a[1]
(3, 1). In this matrix a[1]
, we do not care what each dimension represent but it is kind of representing approximately (hence, the error) the same vector x (4, 1) but now as a[1] (3, 1) in 3-D (n[l]
dimensions). We can kind of say, we’re reducing (or ‘changing’) the dimensions necessary to represent the same data.
small note: number of hypothetical features in layers = number of nodes in layer nl
Then, repeated again for layer 2.
Point 1: Agreed, these hypothetical features doesn’t guarantee optima but actually I am focused on the count nl
of such features/dimensions.
Point 2: Agreed, these hypothetical features aren’t entirely interpretable in most cases. I used word entirely, because we at least get some sense that they initially represent simple features (e.g. edges at different angles or similar features in initial layers of face recognition) and later more complex ones.
Let’s extend the intuition to convnets. In object detection problem where the image information is initially stored as intensity values represented in lower number of dimensions (i.e. 3-D RBG channels and other similar ones). Later on we represent same information in some hypothetical features which would be same as number of filters applied (multiplied by resolution of course). We perform maxpooling and similar operation to kind of reduce dimensions.
Then each of vectors in this hypothetical features space would be considered as a dimension itself when the volume calculated by convnets is input into dense network. This is bit hard to explain in words but quite interesting to imagine how dimensions might be changing behind the scene!
E.g. an image having resolution (na, nb, nc
) where na = width, nb
= height and nc
channels. One intuition here, the total number of features are na
* nb
* nc
. If nf
filters applied in , the same information can be represented in these new nf
hypothetical features (hence, we get output as (na, nb, nf
)). But here we are not reducing the dimensions just changing their representation into new hypothetical features
. From pixel intensity in R,G and B spectrum to kind of ‘pixel intensity’ in hypothetical dimensions
like vertical edges, horizontal edges and so on (notice nf
such features/dimensions here). I used kind of because after applying filters they are not representing intensity I think.