-(Multi)variate intuition (?)

So, granted, I realize I am still early in the game on all this-- Thus perhaps this topic arises later, but one question has been kind of bothering me:

The neural nets thus far, to put it in simple terms, seem like serial linear regression with ‘enhancements’.

In many other cases though you would not have just (and I am not sure how to label this correctly and not confuse with what Prof. Ng as taught, so I will use {} as my signifer) W.TX + b, but W{1}.TX1 + W{2}.TX2 + W{3}.TX3… + b.

Or like this:

Of course this would make your model way more complicated-- though certainly someone has already tried this (and also, otherwise I’m not quite sure how you would, upfront, instantiate feature engineering)–

But, I guess my question is has anyone demonstrated that it is better to toss all your data into X, with just one weight set, and then just let the network ‘figure it out’ ?

Again, I just might have not reached that point in my studies yet where this is covered, but trying to think forward.

What you’re missing is that every layer includes a non-linear activation function.

So it isn’t just a chain of linear regressions.

1 Like

@TMosh Yes, of course I understand that. Perhaps I do not always express things in the best way.

I guess what I am asking is why don’t we do a multivariate version of the same thing (?) And if not, why not (or what makes the network itself somehow ‘better’) ?

Aren’t your B1, B2, etc just weight values?

@TMosh, I mean yes, but at least in a ‘classical’ situation they are seperable, or to say seperable indepent variables, so you are increasing your degrees of freedom as you work through the equation.

Perhaps the hidden nodes in the network are, somehow, effectively performing the same thing (?).

Again this is an ‘I don’t know’ kind of question, which is why I’m asking.

But to be explicit, yes, your B1, B2, etc are weight values, but the node at every layer is now dealing with multiple independent variables, and thus seperate weight sets.

If you just ‘lump’ the data all on a single weight set you will lose that effect.

Though, as stated, I understand your network calculations would get really complicated.

Perhaps this is just a topic I haven’t gotten to yet, or someone must have already tried this and decided it doesn’t work that great-- So I’m curious as to why.

I’m struggling to understand the details of your idea.

If you are proposing using a different set of weights for every example in the data set, 1) how are you going to train it, and 2) how are you going to avoid overfitting?

@TMosh so ‘no, no, no’-- I mean I am relying heavily on my knowledge of traditional regression here.

I think I need some time to come up with a better example, but just of the cuff let’s assume we are trying to predict house prices.

For this lets assume you decide you have two datapoints you consider are important-- the price of the house, and then a classifer (let’s say the color of the house, white = 0, black = 1, blue = 2)-- Essentially these are each ‘independent’ variables, but now you have two datasets heading into the network: X1 - House Price, X2 - House Color.

I am not suggesting a different set of weights for every example-- Rather what I am saying is you have two sets of weights for all the data going in W1 (at the first layer for the price) and W2 (for the color). Now these are kind of ‘competing’ with one another as the network develops and of course adjust your loss and back-prop functions accordingly to account for the fact you are dealing with two independent input variables.

In contrast, with what we’ve done so far maybe you could like ‘sneak’ this info in-- Say the house is $100,000 and so you do adjust the data for $100,000 for a white house, $100,001 for a black house, $100,002 for a blue house.

I mean maybe it would still figure it out… Yet it doesn’t strike me as the best way to do it-- Plus you are over-biasing your data in the process.

Or you could also alternatively say your only X variable (for each example) is [house_price, color]. I guess I am just not sure why can give up on the independece of variables this way, and it still works (well).

Well, apparently the model can still learn regardless of whether the inputs are all independent or they’re not. Why is that surprising? Maybe it’s suboptimal in some sense, but the question is which is more practical: trying to figure out how to preprocess your data into genuinely independent inputs and then training the model or just letting the model figure it out.

1 Like

Dear Paul,

Yeah no I agree and the more I play I might just try this idea out for myself.

Though, in a very distant way, this also reflects a concept I’ve long been considering-- I mean I think we all know the human brain, especially in terms of say visual representation, or fast learning is way more efficent than AI.

In my mind that is not because we are ‘born’ with innate knowledge, yet our brains do come with a certain evolutionary structure.

Yet if you consider the way most of AI is trained today, the problem is structured as going right to the solution (i.e. ‘Cat ?’ ‘Not Cat ?’).

And though my thought might be more oriented towards ‘older’ thoughts of AI, that it would be a symbolic rather than empirical enterprise, in the end I don’t think that is not totally off.

I know Prof. Ng himself strays away from this concept, and I agree, current structures of Neural Networks do not, at all, work like the brain.

I mean, as the famous Oliver Sacks brought up, in the case of the condition of prosopagnosia, we don’t ‘learn’ faces the way networks do. We are born and just ‘kind of know them’. Unfortunately people with such a condition just ‘don’t’. Visually, everything is there, they just can’t put it together.

In the same way, eventually I actually don’t think the solution is just more and more data-- But rather providing the proper shape of the inputs (and perhaps this itself could be the product of yet another neural net ?).

Yet, what that training set would be, specific to each problem, honestly I have no idea yet. Nor is this really the question here I first asked-- I’m just trying to make sure I understand the ground theory at this point so I can trace its edges.

We’re into pretty deep (pun fully intended) waters here. Of course there are deep divisions among the practitioners in the field as to whether the current “massive data” based Deep Learning can create something that actually qualifies as AGI. And exactly as you say, it’s apparent that the way our actual brains learn things and encode knowledge is pretty clearly not equivalent to how artificial Deep Networks learn things. I’m not a neuroscientist, but my understanding from what I’ve read and heard is that even neuroscientists don’t yet have a model they can all agree on for how the brain actually works, even at the level of memory or recognizing pictures of cats. Let alone something like “consciousness”. What actually is it and what biological mechanisms give rise to it? How can you tell if a bat is conscious or not?

These are fascinating things to think about, but my personal reaction is that you’re getting a bit ahead of yourself here. My suggestion is just to relax a bit and listen and learn what Prof Ng has to show us here. Once you’ve absorbed all of that, you’ll be ready to figure out how you want to take it further. But each of us has our own preferred learning style, so you can do it your way.

1 Like

No-- I agree, and honestly I am not all that impressed by this ‘hype’ around AGI, or even generative AI.

I am kind of ‘traditionalist’. Thus please note my original question was why are we running these serial regressions with, basically only one β.

I mean if you did that with standard regression on an advanced data set, you’d probably end up with a pretty lousy model…

I don’t understand your point. As Tom said earlier in the conversation, each layer in the network is a non-linear function. We chain those together. “Composition of functions” is the mathematical term. When you compose non-linear functions, the non-linearity “compounds”. The nerd joke way to say it is that it gets “more non-linear”, but of course that’s absurd. But you can quantify it if you consider composing polynomial functions: the degree of the composite function is higher, right? I’m not saying we are using polynomial functions here. Note that I’m pretty sure we already had this conversation, right? It’s got a definite deja vu feel to it, but maybe that was a thread with someone else.

So the “magic” happens by adding more layers. In this way, the model can learn a decision boundary which is essentially arbitrarily complicated. There is the Universal Approximation Theorem, right? Of course that’s the classic mathematical “existence proof”: it gives you absolutely no help actually finding the function or putting any bounds on how expensive it is to compute.

So what are you talking about with being limited to “only one \beta”? If you are talking about the bias term, there is one in each layer, right?

I may be missing something here, but this is traditional multiple term regression right ?

https://online.stat.psu.edu/stat462/node/132/

Yet here we substitue β for our weight vector, no ? So I am not at all talking about our bias term.

And I think the conversation you had… Was with someone else (?) But I will look up the ‘Universal Approximation Theorem’ as that is not something I am familar with…

The deja vu part that I was referring to is not the UAT paragraph, but the previous bit about composing non-linear functions and how to consider the polynomial case as a way to see how composing non-linear functions gives you the ability to express more and more complex functions.

My day is about to get fully booked with an event we need to go to tonight, so I will not be able to read up on multiple term regression and how that applies here or doesn’t. I hope to have something useful to say by tomorrow midday (UTC -7 as of this morning).

Hello @Nevermnd,

Is the following what you have in mind? That there are two sets of weights \beta_1 and \beta_2 which accept the same set of inputs X?

Please feel free to draw a different flow chart if you have something different in mind :wink:

Cheers,
Raymond

@rmwkwok – So your idea is interesting; But again my whole point here is that of a question, not neccesarily the proposition of a novel method.

For simplicities sake I’ve included only a single node here to be clear, but I am basically wondering why we do this:

image

And not this:

I mean I am not sure (and I had to alter your numbers a bit for understanding). Perhaps researchers have just found the second method works ‘well enough’.

That is what I don’t know…

Hello @Nevermnd,

Thanks for the diagrams!!

No problem with that! At least we have to be on the same page on what you are questioning about. If the approach of your concern is indeed novel, then to prove or not is driven by curiosity. If the approach is not novel, then we need to know it is not novel. That’s it!

It seems to me that your second diagram (the one you wonder why not doing) is the same as my original diagram without the second node. I erased the second node, put yours next to mine, and get the following:

I think the first thing is to clarify the use of symbols. Let’s examine them.

My X composes of two x's. In our convention, we use the big X for a set of features, and use the small x's for each of our features. I have two features: house size and house color, so I call them x_1 and x_2 respectively. Each of them takes one number as its value.

You have two features too, but you used the big X's for them. Such use may be confusing, becuase generally we use a small x for a feature and the big X for a set of features. If you refer to other online material, the big X there should be for a set of features too. For example, below comes from the link you shared:

It has only one feature, so it is simpler than our diagrams, but the spirit is the same - small x for a feature, and the big X for the whole set of features. They use \beta_0 and \beta_1 as the symbols for weights, but in our diagrams, we call them b and w_1 instead.

If we follow the convention, I would change your symbols into:

and this makes the two diagrams the same.

We have one weight (your W_1 or your \beta_1 or my w_1) that is multiplied to the first feature (your X_1 or my x_1), plus another weight (your W_2 or your \beta_2 or my w_2) that is mulltipled to the second feature (your X_2 or my x_2), plus a bias (you and I both call it b).

Essentially, your W_1 or your \beta_1 or my w_1 has one number as its value. Similarly, your W_2 or your \beta_2 or my w_2 also has one number as its value.

So, @Nevermnd, after my careful examination, I think we have a different way of using symbols, but your diagram is not very different from mine. What do you think? If you think my interpretation of your symbols is not quite your way, can you point them out? If you think my interpretation is correct but some changes are needed to your diagram to show other difference, would you mind share a new diagram with the difference but using the symbols according to the convention? Following the convention helps all readers understand your ideas quickly!

Cheers,
Raymond

And by saying our two diagrams are indeed the same, my answer to your question of “why doing this but not that” is that, we ARE indeed doing that, but not this. We do NOT combine two features into image. We keep them separate, and call them small x_1 and small x_2 almost like your “that” diagram:

image

Cheers,
Raymond

1 Like

@Nevermnd, this just comes up in my mind and in case it is the centre of your question, please let me know.

One thing that my diagram did not show is the training process. My training process is that my w_1 and w_2 and b are trained at the same time as one model.

I suppose your diagram was doing the same thing. I suppose you were not training a model with one of the two features as y=w_1x_1, and then training another model with the other feature as y=w_2x_2, then combining the two models into y=w_1x_1 + w_2x_2 +b with an extra b to be trained. Such three-step training process is not something I am supposing you thinking.

If you think there is some difference in the training process that you need to point out, then please share your sequence of the training process.

Cheers!
Raymond

1 Like