Siamese Networks in object detection

Hi, I was wondering if I could combine object detection like SSD MobilenetV2 with a Siamese Network into one single model?

The approach I was using so far was to first detect the objects, crop the detections and finally feed them into the Siamese Network. But it consists of two models + cropping with is not great for real time applications.

Any thoughts on that?