W2_A2_Optimal "nudge" dx given for each node in computational graphs

In Week 2 video “Derivatives with a Computation Graph”, the professor set all “nudges” (for different dependent variables) to the same arbitrary value da=db=dc=du=dv=0.001. Is there a generic algorithm to select an optimal node-specific “dx” (da, …, dv, … etc.) value in each node of the computation graph? For example, which one is better for “du”: 1e-1, 1e-3, 1e-100, 1e-100000, or another subjectively “small” positive value? Note that this optimal values depend on floating point precision of the selected numerical data type and on the functions in the computation graphs.

Here Prof Ng is just showing the general idea of how learning is applied. In practice those dx and other “d” values are “gradients”. Those are actually the derivatives of the loss (cost) function w.r.t. to the variable in question. So the “nudge” you get for each variable at each step is based on the direction that specific value needs to go to lower the cost and by how much. All this is just calculus in action …

Thank you for your response. But my question was not theoretical (from math and calculus), but practical (from computer science and numerical differentiation). Does TensorFlow use an exact (symbolic, closed form) derivatives in its computation graphs, or it uses numerical approximations (numerical differentiation with finite differences), as was shown in the lecture? In the latter case, I would like to know how it computes an optimal “dx” (infinitesimal positive number) for each node in its computation graph. Clearly, this “dx” should not be too big (especially if the second derivative is far away from zero) and it should not be too small (to avoid the loss in precision during “float” type computations). What are “dx”'s optimal values (where “x”'s vary from node to node)? Which factors does this optimal value of “dx” depend on in each node?

I have not actually looked into what goes on in the automatic differentiation logic in TensorFlow. I think all the packages (TF, PyTorch, Kaffe etc) do some form of numerical differentiation based on finite differences. I’m also sure that they have some folks on staff who know enough Numerical Analysis to allow them to implement the algorithms in a way that works across as wide a range of cases as possible, while still being relatively efficient. I’m sure there are some interesting tradeoffs to be made there. They probably also have coded the analytic derivatives of certain commonly used functions, e.g. sigmoid and cross entropy loss.

Here’s one place to start on the PyTorch autograd documentation.

Here’s the equivalent for TensorFlow. If you want to know even more details, TF and PyTorch are open source. At least so I’ve heard. I have not actually tried to read any of that code.

Of course the other high level point to make here is that this question has nothing to do with Deep Learning, right? This is purely a Numerical Analysis question: what’s the best way to implement numerical differentiation? In order to be ML/DL engineers, we don’t need to write that code. Someone else has already written it for us. We don’t need to build a JIT compiler for python either. I’m sure there’s some really interesting work that went into that, but that’s the topic for a completely different course.

1 Like

Thank you for shared links and the ideas. I have just found an interesting discussion on the topic: Step size h in the incremental ratio approximation of the derivative. It may be handy for someone who decides to implement computational graph from scratch.

Thanks for sharing the link. We will talk about Gradient Checking in Course 2 of this series, which is a way to use a relatively crude version of numeric differentation to confirm that our back prop code is correct. Here’s a thread which links to another bit of math to do with one-sided vs two-sided finite differences that were mentioned in that section of the course.

It might be worth a few words about Prof Ng’s pedagogical approach here. These courses are designed so that they do not require knowledge of calculus. He doesn’t show the derivations of the formulas for back propagation, but just presents them. There is plenty of material out there for people who have the math background to find these derivations.

Then for each type of neural network that we study (Fully Connected nets here in Course 1 and Course 2), Convolutional Nets (Course 4) and Recurrent Nets (Course 5), Prof Ng will lead us through the construction of the core parts of the algorithm ourselves in python, including the basics of back propagation. When we’re writing it by hand in python, we do all the derivatives analytically and then just write the code for those. But as soon as we have mastered the core algorithm, he moves on to using the TensorFlow framework for building more complex solutions. That happens for the first time in Course 2 Week 3. After that, all the serious solutions are built using TF, so we no longer have to worry about the derivatives and everything is taken care of for us “under the covers” by the autodiff mechanisms in TF.

That’s how things work in real applications as well: the State of the Art these days is so complex that nobody can write all their own code in python. If you are a researcher at the leading edge who is literally creating new algorithms, then you’ll be working in python to prototype things and prove whether they work or not. As soon as you have something that’s proven to work, you publish the paper and then you or someone else writes the code for TF, PyTorch and all the other frameworks to implement your new concept, so that it becomes part of the new SOTA.