Optimization with Recursion

Hi everyone. I wanted to share an alternative optimization implementation that uses recursion instead of a for-loop when going through the layers of a deep NN.

So basically, we create a function call stack on forward propagation, calculate and return dA[L] in the recursive function base case, and then simply backpropagate as the frames unstack. This approach allows us to use the function call stack as a cache for variables we need to compute derivatives during backdrop instead of a list of tuples.

The recursive function reduces the amount of code required to implement deep neural networks with many hidden layers, and makes it easier ( I hope) to read, understand, and maintain the code.

This implementation was inspired by the W4 videos - The building blocks of deep neural networks and Forward and Backward Propagation. Here is the code GitHub - makasimba/Optimization-With-Recursion: This repo contains code for building a simple feed-forward neural network from scratch.

I suppose you have cross-validated your code that it works correctly.

I am not sure if there is any significant down-side doing it your way. That seems to occupy some resources by holding up a long stack of calls waiting. However, I have read your propagate function and it is an interesting idea!

I guess you purposefully do update with a loop after propagate, instead of doing the update in your stack of calls? - I mean you didn’t need to store the dW’s if you had updated them right the way?

Also, is the speed difference something worth noticing?

Raymond

1 Like

Just a quick search:


source

Maybe not all of them apply in your case, but it is worth checking row #5, 6 for if they are the case and how significant they are.

However, I think your work and your implementation is already interesting by itself. I think that’s the point.

1 Like

It’s a really interesting and creative idea! You could argue that you are just substituting the increased stack size for the memory that would otherwise be allocated to the cache. But as the reference that Raymond gave us point out, there is more overhead in a stack frame than just the local variables of the function. So you probably do end up using a bit more memory for each layer of the network with the recursive implementation. Then there’s the question of whether this impairs the generality of the code: you can recurse arbitrarily far since there is technically no limit on the number of layers in a network. Every language environment has some limit on stack size to protect against infinite recursion (the recursive equivalent of an infinite loop). We will soon run into cases with networks with hundreds of layers. I have never checked to see what limits python places on stack size.

Of course the other fundamental question is the conceptual clarity of the code. Does the recursive approach make things clearer? If it does, then maybe a little more memory usage is worth it. As the reference Raymond gave us also points out, recursion takes more CPU as well compared to iteration. For training large models, resource issues are non-trivial. I have not yet looked at your code, so all my comments so far are just at the general level.

Thanks for the really interesting topic!

1 Like

I just read through your code and it does seem simpler than what we built in W4 A1 and A2, although I’m not sure it’s a “pure” comparison. Your implementation is just the bare code, as opposed to the notebook which includes a lot of other instructions and overhead. But there clearly are fewer layers of functions in your implemention. I haven’t thought too hard about it yet, but I think I see the reason that you implemented the “update” as a for loop called from optimize after the propagate step: the point is to make sure we preserve the same semantics as the for loop version. We don’t want to update the parameters incrementally in the middle of a given iteration, although maybe we just need to think more carefully about the order in which things happen and maybe it doesn’t matter if we do the updates incrementally when we are “unwinding” the recursion. More thought required …

So if the code is really simpler (as it seems to be) then we just have to confirm that the resource consumption issues are not a problem. Maybe try some side by side comparisons of the cpu and memory cost of training with the two methods.

Thank you for having this idea and carrying it through and sharing it. You are obviously a sophisticated python programmer. I also learned some new things about how to use the numpy PRNG functions by reading your code.

2 Likes

Hi Raymond. The biggest downside is MEMORY and perhaps speed. The speed difference was surprisingly small the last time I ( not so thoroughly ) tested it on a small dataset. Although I had initially assumed that this implementation would be much slower because recursion is typically slower than looping in any case, and we are making a lot more function calls and that comes with a lot of overhead as your recursion-iteration comparison table shows. I do think that both memory and speed are definitely worth noting, especially for very deep networks.

Right, the weights for each layer can be updated right away during function call unstacking. I updated them all at once at the end of each cycle because it follows the familiar flow - initialise parameters, forward prop, compute cost, back prop, and lastly update weights. I thought it would make things more clear if I just stick with this logic. I’m not sure I succeeded on that front though. The code certainly needs to be improved ( or a complete do-over) to clarify things. Please feel free to add, modify or remove any confusing parts.

This implementation’s primary purpose is conceptual clarity. For pedagogy purposes, if the number of hidden layers remains small then it shouldn’t be a big issue. I might be wrong. But as Paul nicely sums it up, if conceptual clarity outweighs the downsides, then this might actually be worthwhile.

I see. @Simbv, I agree with all of those reasons and considerations, and if recusion might sacrifice some code clarify, then can we make that up for by more comments and some clear DocString? I think they are all your options. As a reader, I think your current version of the code is clear, and I won’t doubt that your next version (if any) will just be as clear :wink:

Cheers,
Raymond

PS: Just suggestions.

1 Like

Right, I did create the update function and call it from the optimize function to preserve semantics and hopefully make it easy for people to follow along. As you and Raymond have pointed out, the weights and biases for each layer can be updated on the fly as we backpropagate.

The modification is quite simple really. All we need to do is:

  1. Change the back_propagate function signature to incorporate the bias ( which isn’t being passed in)
  2. Move the gradient descent code from the update function into the back_propagate function.
  3. Finally delete the update function and the code within back_propagate that is caching dW and db.

:smiley: Working on simplifying the code and the documentation now.

1 Like

Memory and speed are definitely something to look out for as you and @rmwkwok have pointed out.

Then there is the recursion depth issue. Prof Ng did mention in one of his lectures that Microsoft tested a network with more than 150 layers. So recursion depth is certainly something to think about. But apparently, and I wasn’t aware of this til now, Python has a maximum recursion depth of 1 000 :hushed: which can be increased! :flushed: What is the maximum recursion depth in Python, and how to increase it? - Stack Overflow (and why it’s dangerous to do so).

‘bare’ is a really nice way of putting it. It was truly awful, to begin with. I’m sorry about that. I just had the idea, I got excited and went on to implement and share it.

I have added documentation that I hope helps a bit.

I do think that there are probably better ways to structure, implement, and document this whole idea. I hope more people can take a look at the code and add their own ideas and perhaps produce better implementations.

Comparing the two implementations is a great idea.

And thank you for your kind words, Paul. The course assignments are great btw. :raised_hands:

1 Like