Hi guys,
I’m new to ML and by now I finished the second course of the machine learning specialization,
I am looking to create a model that goes through a set of files and detects similar duplicates.
but I have a question and I’m not sure what is the better choice, should I create a single model for each file type, or will a single model for all file types be just as good?
Hi @Assaf1
if I understand correctly you want to check if the files are identical content-wise and if so, you would (manually?) check on it and replace the dupes. Right?
Can you abstract away the different file formats?
If so: From my perspective one model could make sense if you have well structured data and good knowledge about the features how to identify that they are unique or dupes. As a simple example: if you know the data model well and e.g. have a unique Identifier it is easy to check for dupes by counting and a rule based judgement. Dependent on your domain knowledge you might derive good features that indicate that there is a duplicate, e.g. by checking identical content in your files if this could be realised in a computationally efficient time (otherwise taking samples).
I think if you have more unstructured data (like video snippets) and you wanna check if they were also part of already maintained data (like your video data base), you could check for similarity measures, like distances in your embeddings (that you obtain after a transformation). This thread on calculating a convolution could be interesting if you wanna apply signal processing methods from control and system theory, see also this thread: How to Calculate the Convolution? - #2 by Christian_Simonis .
Also here a model could work, if you manage to get the data into an embedding space (transformed) to make your decision / classification.
Best regards
Christian