I’m starting to design a system to be similar to Jarvis assistant in iron man’s movies.
I will be the only client so there is no complexity in handling requests.
So far, I’m looking to make multiple RAG images based on open sources LLMs, hosted in a private cloud.
My requests will be transformed by speech-to-text on my edges devices and my server’s responses transformed back to text-to-speech on my edge devices.
I will have a system in between RAGs servers and edge devices that will spawn/switch RAG depending on keywords. Example:
“Switch to physics context”
“Switch to python context”
“Switch to construction laws context”
Now I think of deploying this system on my phone, and computer so I can talk to it everywhere, potentially deploying it on IoT devices so it can have access to speakers in my home
For STT and TTS systems, im looking at mozilla deepspeech
For the RAG fundation llms, im looking at llama-like
For the frontend application, im not sure as I need something cross platform that can handle STT and TTS systems
For the STT, I can easily finetune that with my own’s voice labelled samples
Altough it seems to be quite a lot of work, it doesnt seem impossible. It doesnt seem that costs will be that high. Even a small gpu will be able to handle only one client
What are the potentials roadblocks or things im missing here that could make it too hard/impossible for a a personnal/one person use?
I hit couple of roadblocks.
I started with the stt part. Deepspeech isnt supported anymore, went for whisperAI instead. But they dont have streaming capability so I had to do that
Currently im running whisper transcription on a server and my client connects to it
Next im looking at sending the stt answer to an llm model, will probably start with chatgpt just so I have something running quick
But I also realized i’ll need a wake word mecanism and proper EOS
Ive designed it so I can later on take notes with the voice, summarize them and use them with evernote, so that’s where im heading at
But I fear the cost of servers to run the llm and stt / tts will be too high for personal use
Im open for discussions, not sure the project is mature enough to collaborate tho
I’m also working on this. And currently I’m creating this a Desktop application my stt flow seems ohkish working on it to make it better. Already implemented wake word.
For STT I’m using whisper large v3 model which have high accuracy and multi lingual support.
For initally stage I want my assistant will work as a windows or OS service which should be up and run in the background just a simple wake up will trigger and our virtual buddy will start talking.
I’m Open for any discussions on how we can make this more cool and intresting.
Thank you