I tried hugging face, kaggle and google dataset search and I found really hard the finding something specific. Can anybody share any tricks maybe how to do it effectively?
My current task for example:
I need datasets with any of contacts in any worldwide formats (phones, emails, links, adresses). It is really hard to find all types how users can write phones for example, with +1 on start, with spaces, commas, braсkets etc. Or, maybe already trained model.
I know, that I can do random data through different libs, but I am sure, that it will not be enough. Users can type really different things. We can even not expect something similar.
This kind of data is protected by GDPR and other data regulations around the word.
That might be explanation to unavailability of this kind of data.
An alternative is use the Fake lib to generate it.
As I said, the data that can be typed by themselves can be another format that are programmed in this libs. I need as much extended types of the data as can be.
And what about links, for example. It can really not differenciate when we make typo and not type space after dot between sentences for example.
This is something we must get used to. Collecting and processing data is the hard part of the word data science my friend.
Many companies even pay a fortune for databases that they cannot acquire in their data lakes.
This is one of the challenges we will face on our journey