As part of my plan to contribute with Spanish machine learning projects, I prepared a dataset in which I tried to solve an issue that bothered me.
Most multilingual datasets using Wiktionary as their source, tend to use its translation system to fetch non English words, what causes a lot of words and definitions being discarded.
To solve it, I created a custom parser of the Spanish WIktionary dumps and used it to create a HuggingFace dataset: carloscapote/es.wiktionary.org.
I’d be happy to have some feedback as I got started with machine learning only a few months ago and this is my first machine learning project so far.