Neural Network Optimization with Recursion

Hey everyone! I wanted to share a different optimization approach that uses a recursive function instead of a for-loop when working with the layers of a deep neural network.

So, basically, we create a function call stack during forward propagation, compute and return dA[L] in the base case of the recursive function, and then simply backpropagate as the stack unwinds. This approach allows us to use the function call stack as a cache for the variables needed to compute derivatives during backpropagation instead of using a list of variables required to compute gradients.

The recursive function reduces the layers of abstraction required to implement vanilla deep neural networks with many hidden layers.

This work is inspired by watching DLS course 1 Week 4 videos - “The building blocks of deep neural networks” and “Forward and Backward Propagation.” The code here: GitHub - makasimba/Optimization-With-Recursion: Feed-forward neural network building blocks inspired by this discussion: https://community.deeplearning.ai/t/optimization-with-recursion/264465?u=simbv.

I suppose you have cross-validated your code that it works correctly.

I am not sure if there is any significant down-side doing it your way. That seems to occupy some resources by holding up a long stack of calls waiting. However, I have read your propagate function and it is an interesting idea!

I guess you purposefully do update with a loop after propagate, instead of doing the update in your stack of calls? - I mean you didn’t need to store the dW’s if you had updated them right the way?

Also, is the speed difference something worth noticing?

Raymond

1 Like

Just a quick search:


source

Maybe not all of them apply in your case, but it is worth checking row #5, 6 for if they are the case and how significant they are.

However, I think your work and your implementation is already interesting by itself. I think that’s the point.

1 Like

It’s a really interesting and creative idea! You could argue that you are just substituting the increased stack size for the memory that would otherwise be allocated to the cache. But as the reference that Raymond gave us point out, there is more overhead in a stack frame than just the local variables of the function. So you probably do end up using a bit more memory for each layer of the network with the recursive implementation. Then there’s the question of whether this impairs the generality of the code: you can recurse arbitrarily far since there is technically no limit on the number of layers in a network. Every language environment has some limit on stack size to protect against infinite recursion (the recursive equivalent of an infinite loop). We will soon run into cases with networks with hundreds of layers. I have never checked to see what limits python places on stack size.

Of course the other fundamental question is the conceptual clarity of the code. Does the recursive approach make things clearer? If it does, then maybe a little more memory usage is worth it. As the reference Raymond gave us also points out, recursion takes more CPU as well compared to iteration. For training large models, resource issues are non-trivial. I have not yet looked at your code, so all my comments so far are just at the general level.

Thanks for the really interesting topic!

1 Like

I just read through your code and it does seem simpler than what we built in W4 A1 and A2, although I’m not sure it’s a “pure” comparison. Your implementation is just the bare code, as opposed to the notebook which includes a lot of other instructions and overhead. But there clearly are fewer layers of functions in your implemention. I haven’t thought too hard about it yet, but I think I see the reason that you implemented the “update” as a for loop called from optimize after the propagate step: the point is to make sure we preserve the same semantics as the for loop version. We don’t want to update the parameters incrementally in the middle of a given iteration, although maybe we just need to think more carefully about the order in which things happen and maybe it doesn’t matter if we do the updates incrementally when we are “unwinding” the recursion. More thought required …

So if the code is really simpler (as it seems to be) then we just have to confirm that the resource consumption issues are not a problem. Maybe try some side by side comparisons of the cpu and memory cost of training with the two methods.

Thank you for having this idea and carrying it through and sharing it. You are obviously a sophisticated python programmer. I also learned some new things about how to use the numpy PRNG functions by reading your code.

2 Likes

Hi Raymond. The biggest downside is MEMORY and perhaps speed. The speed difference was surprisingly small the last time I ( not so thoroughly ) tested it on a small dataset. Although I had initially assumed that this implementation would be much slower because recursion is typically slower than looping in any case, and we are making a lot more function calls and that comes with a lot of overhead as your recursion-iteration comparison table shows. I do think that both memory and speed are definitely worth noting, especially for very deep networks.

Right, the weights for each layer can be updated right away during function call unstacking. I updated them all at once at the end of each cycle because it follows the familiar flow - initialise parameters, forward prop, compute cost, back prop, and lastly update weights. I thought it would make things more clear if I just stick with this logic. I’m not sure I succeeded on that front though. The code certainly needs to be improved ( or a complete do-over) to clarify things. Please feel free to add, modify or remove any confusing parts.

This implementation’s primary purpose is conceptual clarity. For pedagogy purposes, if the number of hidden layers remains small then it shouldn’t be a big issue. I might be wrong. But as Paul nicely sums it up, if conceptual clarity outweighs the downsides, then this might actually be worthwhile.

I see. @Simbv, I agree with all of those reasons and considerations, and if recusion might sacrifice some code clarify, then can we make that up for by more comments and some clear DocString? I think they are all your options. As a reader, I think your current version of the code is clear, and I won’t doubt that your next version (if any) will just be as clear :wink:

Cheers,
Raymond

PS: Just suggestions.

1 Like

Right, I did create the update function and call it from the optimize function to preserve semantics and hopefully make it easy for people to follow along. As you and Raymond have pointed out, the weights and biases for each layer can be updated on the fly as we backpropagate.

The modification is quite simple really. All we need to do is:

  1. Change the back_propagate function signature to incorporate the bias ( which isn’t being passed in)
  2. Move the gradient descent code from the update function into the back_propagate function.
  3. Finally delete the update function and the code within back_propagate that is caching dW and db.

:smiley: Working on simplifying the code and the documentation now.

1 Like

Memory and speed are definitely something to look out for as you and @rmwkwok have pointed out.

Then there is the recursion depth issue. Prof Ng did mention in one of his lectures that Microsoft tested a network with more than 150 layers. So recursion depth is certainly something to think about. But apparently, and I wasn’t aware of this til now, Python has a maximum recursion depth of 1 000 :hushed: which can be increased! :flushed: What is the maximum recursion depth in Python, and how to increase it? - Stack Overflow (and why it’s dangerous to do so).

‘bare’ is a really nice way of putting it. It was truly awful, to begin with. I’m sorry about that. I just had the idea, I got excited and went on to implement and share it.

I have added documentation that I hope helps a bit.

I do think that there are probably better ways to structure, implement, and document this whole idea. I hope more people can take a look at the code and add their own ideas and perhaps produce better implementations.

Comparing the two implementations is a great idea.

And thank you for your kind words, Paul. The course assignments are great btw. :raised_hands:

2 Likes