Using closed sourced multilingual embedding model to create parallel datasets for low resource languages

lukmanaj · November 8, 2023, 1:54am

A few months back, my team and I participated in the CohereAIHack, a hackathon that was organized specifically for Africans to try out the cohere multilingual embedding API.

My team settled on using the multilingual embeddings model to align sentences in one language ( preferably a low resource language) to their potential paired translation in english. The idea is that if we can crawl documents in both languages online (eg from news sites), we can easily pair up sentences that are translations of each other. And this could potentially be useful in other NLP use-cases.

I invite you all to check out the project description here.
Also, Here is the github repo.
We recently submitted a paper on this for a conference, and I am interested in what the community thinks about the project and in the possible directions we could take this project. Any comments and feedback are welcome.

beawal · November 22, 2023, 11:14am

Hausa to English! Wow! Sounds like a great initiative!
Have you tried contacting universities and African studies department and inform them about this project? I’m sure some students and lecturers who have no clue that it exists would find it super interesting!

lukmanaj · November 22, 2023, 2:50pm

Thank you very much for your thoughtful response and the valuable ideas you’ve shared. I’m pleased to provide the link to our paper, which details the work we’ve undertaken. This paper is a collaborative effort with esteemed researchers from two prominent Nigerian universities. I’m looking forward to any feedback or further discussion this might inspire.

Topic		Replies	Views
Generative ai for local language translation AI Discussions project	2	262	February 7, 2024
Massively Multilingual Translation: Machine Learning Model Trained to Translate 1,000 Languages AI Discussions the-batch , ai-discussions	1	75	May 19, 2023
Massively Multilingual Translation: NLP Model Translates 200 Different Languages AI Discussions the-batch , ai-discussions	1	72	May 18, 2023
Multimodal Machine Translation AI Discussions ai-discussions , project	1	110	January 24, 2024
🐍 Let's Reminisce! ✨ AI Discussions ai-discussions , ai-python-for-beginners	4	69	September 4, 2024

Using closed sourced multilingual embedding model to create parallel datasets for low resource languages

Related topics