How does tf.GradientTape() work?

Hi,

I cannot understand how tf.GradientTape() works. This is referred to the third week’s programming assignment “Tensorflow_introduction” in Course 2. Particularly, this part of the function “model”

with tf.GradientTape() as tape:
# 1. predict
Z3 = forward_propagation(tf.transpose(minibatch_X), parameters)

            # 2. loss
            minibatch_cost = compute_cost(Z3, tf.transpose(minibatch_Y))

I have read the documentation. My understanding is that we should use tape.watch() to record the operations for the backpropogation. But the code works without it. I am really lost here. Any help will be appreciated.

Henrikh

I have not previously looked into the matter, but here’s the second full sentence in the text that you see when you click the documentation link that you gave:

Trainable variables (created by tf.Variable or tf.compat.v1.get_variable , where trainable=True is default in both cases) are automatically watched.

So what implications does that have for the case at hand?

Thanks. I think I have the answer now. I missed that sentence.

Indeed, the variable is trainable by default unless synchronization is set to ON_READ , in which case it defaults to False . This I found in the description of Args in the same documentation link.

For tf.VariableSynchronization ON_READ indicates that the variable will be aggregated across devices when it is read (eg. when checkpointing or when evaluating an op that uses the variable) as it is written here.

My understanding of the latter is that it is impossible to track a variable that is being computed on multiple GPUs, for example. Do you think I am right on this?

Henrikh

I have not done any research on this point, but just on general principles that can’t be a true statement. That would mean that you simply can’t train a model using multiple GPUs, right? Training depends on gradients, so there must be a way to make that work. Well, maybe you can do it by managing the “distribution” of the computation at the level of your minibatches. So any given minibatch is trained on one GPU. But then that would mean your minibatches are processed in parallel and that’s not really the way minibatch normally works. You could then average all the different minibatch gradients and apply them to the “current” model that was input to all those parallel minibatches. A little awkward, but maybe it would work.

If TF supports training on multiple GPUs, there has to be a way to make this work.

I think you need to dig deeper in the documentation. Let us know what you find! :nerd_face:

1 Like

Very interesting. For sure I will dig deeper and will post here once I find something.

Thanks for your time!

Now that I think more about the scenario I suggested above, I think that works. After all the gradients are the averages across the samples in any case. Suppose you have 16 GPUs, even if TF doesn’t natively support multiple GPUs, you just split your training set into 16 equal sized chunks.

You have 16 separate TF processes each with its own private GPU. You have one master thread to control everything. One epoch looks like this:

  1. Master broadcasts the current model to all 16 “trainer threads”
  2. Each thread computes the gradients for its chunk of the dataset.
  3. When the threads finish the training pass, the master collects the 16 averaged gradients and averages the averages. As long as all the batches are the same size, that works.
  4. Apply the total averaged gradients to the model.

But I just googled “does tensorflow support multiple GPUs” and got a bunch of hits. One is this page about distributed operations in TF.

1 Like

I think, in every 16 thread the variables (e.g. the weight matrices) should be traced. So these variables will be trainable during the execution of every thread and the gradients will be evaluated with respect to these variables.
But there could be a situation when a specific variable is aggregated across devices and should not be trainable. I could not come up with such an example but here VariableSynchronization there is a code snippet for some temporary gradient.

temp_grad=[tf.Variable([0.], trainable=False,
… synchronization=tf.VariableSynchronization.ON_READ,
… aggregation=tf.VariableAggregation.MEAN
… )]

Also, here there is an example on how to use the tf.distribute.Strategy. I think this one is related to the link in your last message.
Perhaps, this one will help me to understand better how distributed training works.

I will update this thread as soon as I have something new.

Thanks for your detailed explanation.

Thanks for following up on this. I just skimmed the first page of that link that I gave in my earlier reply and maybe it’s worth explicitly saying that I think the whole point of “distributed operations” in TF is exactly the scenario that I laid out in my hypothesis above about how this could work. You are only using the TF automatic gradient propagation mechanisms in each individual “single GPU” worker thread. Then at the global level, you just manually gather the gradients from the worker threads, average and then apply them to the global copy of the model. Then “rinse and repeat” …

Let us know what more you learn on all this!

Regards,
Paul