You can describe the data with sin(t), cos(t) relation, see:

After transforming the data, they become linearly separable.

An other way: you can also solve a problem like this with polynomial approaches, see: The Kernel Trick in Support Vector Classification | by Drew Wilimitis | Towards Data Science

When analysing the residuum you do not want to see any systematic pattern. If the model did a bad job you could see at least some pattern in the residual data and no random “white noise”.

This can absolutely help in data understanding.

Why do you think it would be “better” to add new features?

I think the transformation w/o growing the dimensional space is already making the data linearly separable in a minimum dimensional space. To me it seemed quite elegant this way. But many approaches solve the issue.

In general: The most suitable approach depends on the data as well your business problem you are solving. Often in reality it is sufficient to find a solution which is just “good enough”.

Best regards

Christian