Hi, Himanshu.
I have not looked at your code, but one point to make is that perfectly correct code can get NaN for the cost with either sigmoid or softmax output if the activation value “saturates” to exactly 0 or 1. You can add some logic to your cost calculations to check for that case and avoid getting NaN. Here’s a thread which discusses that.
On the point about setting the random seed in every iteration, that’s the way they have us do it in the assignments just for ease of grading, but I think it’s a mistake to do that in a “real” system: that is not the intent of dropout. The whole idea is that you want the behavior to be stochastic. If you just wanted a smaller network, you could have used a smaller network. Here’s a thread which discusses that point.