Art generation in neural style transfer Exercise 6 - train_step

When creating a post, please add:
*C4 Week 4 Exercise 6 - train_step
I was experimenting with my own content and style pictures. But the error below showed up in train_step. Even though earlier, when I was doing the assigned exercise, there wasn’t an error message. Can anyone explain why?


AssertionError Traceback (most recent call last)
Input In [37], in <cell line: 7>()
1 ### you cannot edit this cell
2
3 # You always must run the last cell before this one. You will get an error if not.
5 generated_image = tf.Variable(generated_image)
----> 7 train_step_test(train_step, generated_image)

File /tf/W4A2/public_tests.py:89, in train_step_test(target, generated_image)
87 print(J1)
88 assert type(J1) == EagerTensor, f"Wrong type {type(J1)} != {EagerTensor}"
—> 89 assert np.isclose(J1, 25629.055, rtol=0.05), f"Unexpected cost for epoch 0: {J1} != {25629.055}"
91 J2 = target(generated_image)
92 print(J2)

AssertionError: Unexpected cost for epoch 0: 27707.96875 != 25629.055

The expected cost comes from the specific test the notebook was doing.

If you change the images, do you agree that the cost may be a different value?

I also have a problem with Exercise 6. This one throws an error, seemingly before any any student code is evoked, before the execution arrives at train_step. Any guidance from the instructor/moderators would be appreciated:

ValueError: No gradients provided for any variable: (['Variable:0'],). Provided `grads_and_vars` is ((None, <tf.Variable 'Variable:0' shape=(1, 400, 400, 3) dtype=float32>),).

The entire traceback is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [30], in <cell line: 7>()
      1 ### you cannot edit this cell
      2 
      3 # You always must run the last cell before this one. You will get an error if not.
      5 generated_image = tf.Variable(generated_image)
----> 7 train_step_test(train_step, generated_image)

File /tf/W4A2/public_tests.py:86, in train_step_test(target, generated_image)
     82 def train_step_test(target, generated_image):
     83     generated_image = tf.Variable(generated_image)
---> 86     J1 = target(generated_image)
     87     print(J1)
     88     assert type(J1) == EagerTensor, f"Wrong type {type(J1)} != {EagerTensor}"

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File /tmp/__autograph_generated_file4vjy1r38.py:16, in outer_factory.<locals>.inner_factory.<locals>.tf__train_step(generated_image)
     14     J = ag__.converted_call(ag__.ld(total_cost), (ag__.ld(J_content), ag__.ld(J_style)), dict(alpha=10, beta=40), fscope)
     15 grad = ag__.converted_call(ag__.ld(tape).gradient, (ag__.ld(J), ag__.ld(generated_image)), None, fscope)
---> 16 ag__.converted_call(ag__.ld(optimizer).apply_gradients, ([(ag__.ld(grad), ag__.ld(generated_image))],), None, fscope)
     17 ag__.converted_call(ag__.ld(generated_image).assign, (ag__.converted_call(ag__.ld(clip_0_1), (ag__.ld(generated_image),), None, fscope),), None, fscope)
     18 try:

File /usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py:640, in OptimizerV2.apply_gradients(self, grads_and_vars, name, experimental_aggregate_gradients)
    599 def apply_gradients(self,
    600                     grads_and_vars,
    601                     name=None,
    602                     experimental_aggregate_gradients=True):
    603   """Apply gradients to variables.
    604 
    605   This is the second part of `minimize()`. It returns an `Operation` that
   (...)
    638     RuntimeError: If called in a cross-replica context.
    639   """
--> 640   grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
    641   var_list = [v for (_, v) in grads_and_vars]
    643   with tf.name_scope(self._name):
    644     # Create iteration if necessary.

File /usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/utils.py:73, in filter_empty_gradients(grads_and_vars)
     71 if not filtered:
     72   variable = ([v.name for _, v in grads_and_vars],)
---> 73   raise ValueError(f"No gradients provided for any variable: {variable}. "
     74                    f"Provided `grads_and_vars` is {grads_and_vars}.")
     75 if vars_with_empty_grads:
     76   logging.warning(
     77       ("Gradients do not exist for variables %s when minimizing the loss. "
     78        "If you're using `model.compile()`, did you forget to provide a `loss`"
     79        "argument?"),
     80       ([v.name for v in vars_with_empty_grads]))

ValueError: in user code:

    File "<ipython-input-29-76de60c7f5da>", line 32, in train_step  *
        optimizer.apply_gradients([(grad, generated_image)])
    File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 640, in apply_gradients  **
        grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
    File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/utils.py", line 73, in filter_empty_gradients
        raise ValueError(f"No gradients provided for any variable: {variable}. "

That “no gradients” message usually means that you have included a numpy function in the compute graph. If you are using TF with auto gradients, all the functions need to be TF functions because they include the gradient logic. One simple case that has tripped me up in the past is transpose. If you write A.T, you get the numpy version of transpose. You need to use tf.transpose instead.

Thanks, Paul. This was the key to solving of a vexing problem. In this case, it wasn’t caused by a transpose. Wish I could say more without giving out details of the exercise. Now it’s on the Sequence Models course!

1 Like

It’s great news that you found the solution based on that suggestion. Was the issue actually a numpy operation in one of the functions? It would be interesting to know what function and what numpy operation, if you can describe that without showing the actual code.

The case in which I had previously seen this was in the C2 W3 assignment with the transposes in compute_total_loss. I tried to construct an example in the Neural Style assignment by playing with the transposes in the gram_matrix function, but it ended up being too baroque since all the operands are TF tensors from the get go. The other thing I discovered in that exercise is that it’s hard to write TF code that works in both “eager” mode and the classic mode. It was unclear to me a priori, but it turns out in this notebook the code is actually executed both ways. :weary_cat:

Hi Paul. The problem ultimately was that I needed to recast a python-list input to a function as a tensor. It turns out that a function called by train_step preserves the input’s variable type in the function return. List-in, list-out, leading to later functions choking because, as you said, it has no gradient. This was a surprise, since I initially trusted a function that had already passed an earlier “assert” test. The second surprise was that tf.GradientTape messes with breakpoints in editors like PyCharm, making it appear that the exercise failed outside the with tf.GradientTape loop, when it really failed inside the loop. Sometimes I download the exercise files and run the exercise in a full development environment, where I can set breakpoints and really see what’s going on.

2 Likes

Interesting! Thanks for the detailed explanation. Yes, that sounds like it must have been pretty challenging to sort out, even with the clue about non-TF operations.

I’ve never tried running a debugger on a TF function. Maybe that point about the breakpoints being misleading in train_step is caused by the fact that the training logic is not running in “eager” mode, but the original TF “graph” mode. I didn’t notice previously that that was what they were doing there.

I’m afraid that I don’t yet appreciate the difference between the eager and graph modes. More studying to do! I have to say, the exercises in these last two courses require a little more original thinking and trial-and-error than the first two. I wonder if that’s by accident or design.

They don’t really discuss “graph” mode here. In the original version of these courses (circa 2017), TF did not yet support “eager” mode, so they had no choice but to cover it. In “graph” mode, you first define the compute graph and then you execute it as a separate step. It takes a bit more code and it’s pretty clunky, but it’s more efficient at runtime. There was a major rewrite of the DLS C2, C4 and C5 in April 2021 to upgrade to TF 2.x and at that point they switched to using Eager mode essentially everywhere.

It’s probably not worth worrying about that much more at this point, but the TF documentation site is pretty epic and I’m sure they’ve got tutorials and explanations if you want more info on that mode.

Yes, the assignments get noticeably deeper as we go through the courses here, culminating in the Transformers section in Week 4 of DLS C5. There be Dragons! :weary_cat:

1 Like

Thanks for all the info! One more thought that might help other students. For those who want to run their exercises locally, C4 exercise one (face recognition) had a major TensorFlow version compatibility problem between the online ipynb’s version 2.3.0 TensorFlow and later versions like 2.12.0. The pre-trained model for the exercise won’t load using the exercise code as-is. Seems like some model loading methods from 2.3.0 (and earlier) have been deprecated. Maybe Google doesn’t have enough resources to assure backward compatibility? :slight_smile:

Yes, coming from the c/Unix/linux world, I’ve always found it a bit odd that in this whole python/ML/DL space, backwards compatibility of APIs just doesn’t seem to be “a thing”. Oh, well.

But perhaps because of that, there is an ecosystem to support maintaining a bunch of specific version recipes in parallel. There may be more than one such system, but the one I’m familiar with is Anaconda/Conda. Here’s a thread which will get you started down the road of extracting the info about the specific versions used in a given notebook and then duplicating that environment. The waters are pretty deep and there are too many possible situations, so it’s too much for anyone here to be expected to maintain a globally valid set of instructions for doing this in all cases. But you can use that thread as a starting point and then you’ll develop your own “chops” as you go along. The problems literally never end, so you need to become self-supporting. StackExchange is your friend. :grinning_face_with_smiling_eyes:

2 Likes