One possible solution to fit 1 hour of car noise over 10,000 hours or more clean audio, without the fear of over fitting to the noise is as follows,
Step 1 : Divide the 1 hour noise audio into 60 fragments of 1 minute audio.
Step 2 : Join those 60 fragments in a random manner to synthesis audio for long duration. 10^60 possibilities.
This may not stop the model to recognise the 60 different audio fragments. But It will remove the effects of sequential order, compared to replicating the audio 10,000 times.
Any thoughts on that?