Hi:
I am trying to build a Speech to Speech system (Take in Audio from microphone → Convert Speech to Text → Analyze Text → Generate Response → Convert Text to Speech). Has anyone tried something like this? Any tips on how to get started? I was planning on using Whisper and OpenAI GPT 4o for the text analysis. But would love to get tips on any other system I should use.
Thanks,
Hello,
I am just curious, what type of analysis do you plan to do on the text?
Its to simulate a conversation. So responding to a question for example.
Looking for any examples of similar projects that anyone may have done.
I don’t have an example because my work is mostly on computer vision and not speech or text, but I would argue, why won’t you use a speech model?
A few days ago I had a nice conversation with Microsoft Copilot and it takes audio and gives audio as output, so if you use an API of such a model, it would be less complex for you to implement something.
I don’t know about Whisper or GPT 4o if they have audio-to-audio but I am sure there are multi-modal models. I suggest checking HugginFace for that.
I hope that helps.