Using closed sourced multilingual embedding model to create parallel datasets for low resource languages

A few months back, my team and I participated in the CohereAIHack, a hackathon that was organized specifically for Africans to try out the cohere multilingual embedding API.

My team settled on using the multilingual embeddings model to align sentences in one language ( preferably a low resource language) to their potential paired translation in english. The idea is that if we can crawl documents in both languages online (eg from news sites), we can easily pair up sentences that are translations of each other. And this could potentially be useful in other NLP use-cases.

I invite you all to check out the project description here.
Also, Here is the github repo.
We recently submitted a paper on this for a conference, and I am interested in what the community thinks about the project and in the possible directions we could take this project. Any comments and feedback are welcome.


Hausa to English! Wow! Sounds like a great initiative!
Have you tried contacting universities and African studies department and inform them about this project? I’m sure some students and lecturers who have no clue that it exists would find it super interesting!


Thank you very much for your thoughtful response and the valuable ideas you’ve shared. I’m pleased to provide the link to our paper, which details the work we’ve undertaken. This paper is a collaborative effort with esteemed researchers from two prominent Nigerian universities. I’m looking forward to any feedback or further discussion this might inspire.

1 Like