Hey everyone! I wanted to share a different optimization approach that uses a recursive function instead of a for-loop when working with the layers of a deep neural network.
So, basically, we create a function call stack during forward propagation, compute and return dA[L] in the base case of the recursive function, and then simply backpropagate as the stack unwinds. This approach allows us to use the function call stack as a cache for the variables needed to compute derivatives during backpropagation instead of using a list of variables required to compute gradients.
The recursive function reduces the layers of abstraction required to implement vanilla deep neural networks with many hidden layers.
I suppose you have cross-validated your code that it works correctly.
I am not sure if there is any significant down-side doing it your way. That seems to occupy some resources by holding up a long stack of calls waiting. However, I have read your propagate function and it is an interesting idea!
I guess you purposefully do update with a loop after propagate, instead of doing the update in your stack of calls? - I mean you didnât need to store the dWâs if you had updated them right the way?
Also, is the speed difference something worth noticing?
Itâs a really interesting and creative idea! You could argue that you are just substituting the increased stack size for the memory that would otherwise be allocated to the cache. But as the reference that Raymond gave us point out, there is more overhead in a stack frame than just the local variables of the function. So you probably do end up using a bit more memory for each layer of the network with the recursive implementation. Then thereâs the question of whether this impairs the generality of the code: you can recurse arbitrarily far since there is technically no limit on the number of layers in a network. Every language environment has some limit on stack size to protect against infinite recursion (the recursive equivalent of an infinite loop). We will soon run into cases with networks with hundreds of layers. I have never checked to see what limits python places on stack size.
Of course the other fundamental question is the conceptual clarity of the code. Does the recursive approach make things clearer? If it does, then maybe a little more memory usage is worth it. As the reference Raymond gave us also points out, recursion takes more CPU as well compared to iteration. For training large models, resource issues are non-trivial. I have not yet looked at your code, so all my comments so far are just at the general level.
I just read through your code and it does seem simpler than what we built in W4 A1 and A2, although Iâm not sure itâs a âpureâ comparison. Your implementation is just the bare code, as opposed to the notebook which includes a lot of other instructions and overhead. But there clearly are fewer layers of functions in your implemention. I havenât thought too hard about it yet, but I think I see the reason that you implemented the âupdateâ as a for loop called from optimize after the propagate step: the point is to make sure we preserve the same semantics as the for loop version. We donât want to update the parameters incrementally in the middle of a given iteration, although maybe we just need to think more carefully about the order in which things happen and maybe it doesnât matter if we do the updates incrementally when we are âunwindingâ the recursion. More thought required âŚ
So if the code is really simpler (as it seems to be) then we just have to confirm that the resource consumption issues are not a problem. Maybe try some side by side comparisons of the cpu and memory cost of training with the two methods.
Thank you for having this idea and carrying it through and sharing it. You are obviously a sophisticated python programmer. I also learned some new things about how to use the numpy PRNG functions by reading your code.
Hi Raymond. The biggest downside is MEMORY and perhaps speed. The speed difference was surprisingly small the last time I ( not so thoroughly ) tested it on a small dataset. Although I had initially assumed that this implementation would be much slower because recursion is typically slower than looping in any case, and we are making a lot more function calls and that comes with a lot of overhead as your recursion-iteration comparison table shows. I do think that both memory and speed are definitely worth noting, especially for very deep networks.
Right, the weights for each layer can be updated right away during function call unstacking. I updated them all at once at the end of each cycle because it follows the familiar flow - initialise parameters, forward prop, compute cost, back prop, and lastly update weights. I thought it would make things more clear if I just stick with this logic. Iâm not sure I succeeded on that front though. The code certainly needs to be improved ( or a complete do-over) to clarify things. Please feel free to add, modify or remove any confusing parts.
This implementationâs primary purpose is conceptual clarity. For pedagogy purposes, if the number of hidden layers remains small then it shouldnât be a big issue. I might be wrong. But as Paul nicely sums it up, if conceptual clarity outweighs the downsides, then this might actually be worthwhile.
I see. @Simbv, I agree with all of those reasons and considerations, and if recusion might sacrifice some code clarify, then can we make that up for by more comments and some clear DocString? I think they are all your options. As a reader, I think your current version of the code is clear, and I wonât doubt that your next version (if any) will just be as clear
Right, I did create the update function and call it from the optimize function to preserve semantics and hopefully make it easy for people to follow along. As you and Raymond have pointed out, the weights and biases for each layer can be updated on the fly as we backpropagate.
The modification is quite simple really. All we need to do is:
Change the back_propagate function signature to incorporate the bias ( which isnât being passed in)
Move the gradient descent code from the update function into the back_propagate function.
Finally delete the update function and the code within back_propagate that is caching dW and db.
Memory and speed are definitely something to look out for as you and @rmwkwok have pointed out.
Then there is the recursion depth issue. Prof Ng did mention in one of his lectures that Microsoft tested a network with more than 150 layers. So recursion depth is certainly something to think about. But apparently, and I wasnât aware of this til now, Python has a maximum recursion depth of 1 000 which can be increased! What is the maximum recursion depth in Python, and how to increase it? - Stack Overflow (and why itâs dangerous to do so).
âbareâ is a really nice way of putting it. It was truly awful, to begin with. Iâm sorry about that. I just had the idea, I got excited and went on to implement and share it.
I have added documentation that I hope helps a bit.
I do think that there are probably better ways to structure, implement, and document this whole idea. I hope more people can take a look at the code and add their own ideas and perhaps produce better implementations.
Comparing the two implementations is a great idea.
And thank you for your kind words, Paul. The course assignments are great btw.