Training on Synthetic... Testing on Fuel

I’m working though a data challenge where the training set is all synthetic (with truth/labels | 90,000 records), and the test set (no truth/labels | 10,000 records) is collected, real-world data. There are no additional data to augment the training set. I’ve created a dev set out of the training set to allow for measuring performance (bias/variance). The problem that I’m having is that there is little challenge to gaining near perfect performance on the training and dev sets (high precision and recall). However, when predictions are submitted for the test set, very low performance is realized (60%-65% accuracy). Much of the tweaking must be done with little to no feedback from the test data (limited number of submissions). Any ideas on how best to identify where to tweak or generalize a binary classifier when training/dev performance appears to be optimized, but is clearly overfitting? (Note: There are around 30 features. 5 categorical and the rest quantitative.)