Lots of prior discussion of this topic. Here is one…
Think crisply about what the set of boxes output by the YOLO CNN represents, what non-max suppression accomplishes without considering classes, and what the inclusion of classes would change. What is the benefit of including class at that step? What is the cost? Let us know?