Hi everyone,
I’m a 3rd-year B.Tech student working on an Automatic Speech Recognition (ASR) system for Telugu-English code-switched speech.
While researching, I came across a paper describing a large-scale Telugu-English code-switched corpus (200 hours, 400 speakers with detailed annotations like POS tags, language tags, and code-switch points).
I’ve already reached out to the author for access, but I wanted to ask here:
-
Has anyone worked with Telugu-English code-switched datasets?
-
Are there any publicly available corpora or alternatives I can use?
-
Any suggestions for handling code-switching in ASR models (especially data scarcity)?
Would really appreciate any guidance or resources.
Thanks!