I believe this post contains the answer to your question.

Note that

A bigger model and different architecture may be seen as a procedure of changing a model capacity.

A hyperparameter is any parameter that we fix during training process. In that sense, a number of layers is a hyperparameter that changes model capacity, a regularization coefficient of norm penalties is a hyperparameter that allows us to increase or decrease regularization, etc.

An optimization algorithm definitely has an effect on learning and its learning rate is probably the most important hyperparameter, but we do not consider it as an instrument for reducing bias or variance. It determines how fast we converge to some solution.

I think I got confused because I thought orthogonalization means taking actions to decrease bias/variance that would not affect variance/bias (from the analogy Andrew mentioned something like tuning the height/width without affecting the width/hight of the TV).

But from the post you linked to, it seems like we cannot be sure addressing one problem without affecting the other. So it seem like there really isn’t orthogonalization in training ML models. At the beginning, we just have to address the bias problem first than variance. And maybe after some preliminary result, we chose which problem to address through comparing avoidable bias and variance.

Is my understanding correct?

Also I realized I spelled the title wrong, is there a way to change it

We just can’t do otherwise, because our judgement about high variance based on the estimation of the difference between train and dev set errors. That said, we need to achieve a low error on the train set first (address a high bias problem) first.

To my understanding, orthogonalization is just a technique that allows us to observe an effect of our actions in an isolated way.

For example, if we changed model capacity by adding new layers and also used additional features, we wouldn’t be able to tell which of these actions decreased bias. But, if we only changed model capacity, we would be able to estimate the effect it caused. The same for additional features. That is high-level idea behind orthogonalization.