How to create custom model for detect how many person in the video same as yolo model

I want to create a custom model to detect the person from images and videos and count them so anyone guide me on how to do that and explain me step by step way. It’s helped me a lot.

Check out the Deep Learning specialization especially courses with computer vision also tensorflow advanced techniques specialisations, the have some examples on how to do it!