This question is related to computer vision only :
I am confused, I do not really understand what batch normalization is doing and how it is doing it over the batches ? How is working over the different batches ? And why is this thing working at the end ? could it be also possible not to normalize the data and to get better results (like is it necessary) ?
Sorry for the many questions, but I am confused. Thank you.
I suggest you watch the lectures again. Prof Ng explains all that. Just as one example of the things you apparently missed: he explains that what it is doing is using exponentially weighted averages to compute the BN parameters across the various minibatches.
But the high level point here is that if you think BN doesn’t help in a given case, it’s always a perfectly valid experiment to try leaving it out and see whether you get better or worse results.