Representation of neural network as one mathematical function

If I understood well from the “Why do we need activation functions?” video, we can represent function of output node (layer k) as combination of output functions from previous layer k-1, and output functions from k-1 layer, as combination of output functions from k-2 layer etc.

This way we could represent the model as one complex mathematical function with lot of nested functions.
I found some article on the internet explaining this as well.

If this is the case, then what is the purpose of representing the model as neural network? Is it only for readability and for us to be easier to think about the model or is there something I am missing?

Even in code, we could write it differently and not stack the layers, right? (probably it would be very complex code though)

The point is that the full function from input to output that is represented by a Neural Network is the composition of all the functions that make up the individual layers. So, yes, it is a complex function. But you express it as the composition of all those individual functions. Think about this a bit more: how would you actually write that as a single function without the idea of “composition”? How could you “not stack the layers”? That doesn’t sound like it’s really going to help us understand what’s actually happening, right?

One way I have heard it described, which I found helpful in terms of visualization, is that you can think of the forward propagation as building up an onion one layer at a time. Of course you get two functions at each network layer: the linear activation followed by the non-linear activation. Then you wrap the loss function and the cost function as the final two outer layers of the “onion”. Now when you do back propagation, you can think of it as peeling the onion one layer at a time from the outside in …

1 Like

Hi, thanks for reply.

So if I understood well, we are representing model as neural network only so that we could understand better what is happening and because it is easier for us to reason about the model that way?
Is there any other benefit to representing it as neural network in comparison to one complex function?

Btw, that’s an interesting description of neural network :). it definitely helps visualizing.

It’s not just so that we can understand it better. That is literally “what it is”. Why would you want to write it in some different form? That was the point I was trying to make: what does it even mean to consider it as anything other than the composition of functions? That was the question I asked you: how would you actually write it as one “integrated” function? How would you express that? What would that look like?

So let’s say we have a neural network with one hidden layer that has two nodes and one node in output layer. And all activation functions are linear activation functions.
Then I can represent the model like this f(x) = w_21*(w_11x + b_11) + w_22(w_12*x + b_12) + b, where w_11 and w_12 are weights in activation function from hidden layer and w_21 and w_22 are weights in activation function of output node.

In my head there are two different representations, one is as a graph of nodes and the other is “inline” as function above. Both are representing composition of functions.
And when writing python code I could write it inline like the one above, instead of first calculating the output of each hidden node separately and then feeding those values to output node function.

So maybe the better question is, why wouldn’t I write it right away like this in code f(x) = w_21*(w_11x + b_11) + w_22(w_12*x + b_12) + b?

@ivanralevic

If this had indeed been the case, then you could have combined the functions in each node and arrived at a final linear equation, which would be the combination of the linear equations from all the neurons.

But the fact that the activation functions are NOT linear functions means that we cannot linearly combine them and arrive at the final simplified version f(x) = w_21*(w_11x + b_11) + w_22 (w_12*x + b_12) + b, that you have shown.

So, the catch here is the NON-LINEAR activation function at each neuron.

I could write it with sigmoid functions:

Mathematical representation would be:
image

and python code would be:
f(x) = w_21*(1/(1+e**(-1*(w_11x + b_11))) + w_22 (1/(1+e**(-1*(w_12*x + b_12))) + b

this is correct right?
Why wouldn’t I just write it like this in code? (other than obvious reason that it is too complex)

You applied the sigmoid activation function for the 2 neurons in the hidden layer, but missed to apply the sigmoid activation function at the output layer.

Now try applying it at the output layer and see how unwieldy it will become - And if you do manage to write it out, let’s still not forget that this is the simple case of 1 input, 2 neurons in the hidden layer and an output. How about multiple inputs, multiple neurons in each hidden layer and multiple hidden layers?

1 Like

Yes, this is a completely trivial case. 1 hidden layer with 2 neurons. Typical networks to solve real world problems frequently have hundreds of layers and millions of parameters or more.

You gave the answer yourself. I’d add one more thing:

  1. Yes, it’s too complex.
  2. But more to the point: What does it tell you that is more informative than the layered version?
1 Like

Ok, understood. That was my question. So basically it is just easier for us to present it as layers and neurons.

Thanks for the answers!

Hello @ivanralevic ,

I think there are 2 more necessary reasons to “modularize” NN layers:

  1. to make recurrent neural network implementation easy
  2. save computation cost by reusing calcuation results, the spirit of dynamic programming

To illustrate the above 2 reasons, we can look at the following googled diagram for LSTM - a famous layer/module for recurrent neural network

Screenshot from 2022-09-08 05-06-48

  1. h_{t} is a result of this LSTM (last equation), and it is used as the input in the next time step for calculating h_{t+1}, since the number of timestep is a variable which can be t=2 for this sample but t=100 for the next sample, it is not trivial to express the maths explicitly like you do.

  2. h_{t-1} is used multiple times in the first 4 equations. “Modularizing” layers is a trival option to reuse h_{t-1} without calculaing it 4 times.

Connecting one layer to one and only one layer next is what we always see in the MLS, but we actually can handle layers in many more ways thanks to the concept of “modularization”

Cheers,
Raymond

1 Like

@rmwkwok
Hi Raymond,

Great explanation. Thank you very much. It makes sense.

Cheers

1 Like