Creating Iron Man's Jarvis-like assistant

I’m starting to design a system to be similar to Jarvis assistant in iron man’s movies.

I will be the only client so there is no complexity in handling requests.
So far, I’m looking to make multiple RAG images based on open sources LLMs, hosted in a private cloud.
My requests will be transformed by speech-to-text on my edges devices and my server’s responses transformed back to text-to-speech on my edge devices.
I will have a system in between RAGs servers and edge devices that will spawn/switch RAG depending on keywords. Example:
“Switch to physics context”
“Switch to python context”
“Switch to construction laws context”

Now I think of deploying this system on my phone, and computer so I can talk to it everywhere, potentially deploying it on IoT devices so it can have access to speakers in my home

For STT and TTS systems, im looking at mozilla deepspeech
For the RAG fundation llms, im looking at llama-like
For the frontend application, im not sure as I need something cross platform that can handle STT and TTS systems
For the STT, I can easily finetune that with my own’s voice labelled samples

Altough it seems to be quite a lot of work, it doesnt seem impossible. It doesnt seem that costs will be that high. Even a small gpu will be able to handle only one client

What are the potentials roadblocks or things im missing here that could make it too hard/impossible for a a personnal/one person use?

2 Likes

I have build something like this .

Hi,
Is it possible to get a little more information ? Maybe dm?
Thanks

HI @MX1000 @Dakshish this sound interesting. Whats the current state of your project. Willing to join, if there is a possibility.

Thanks!

Hi,

I hit couple of roadblocks.
I started with the stt part. Deepspeech isnt supported anymore, went for whisperAI instead. But they dont have streaming capability so I had to do that
Currently im running whisper transcription on a server and my client connects to it
Next im looking at sending the stt answer to an llm model, will probably start with chatgpt just so I have something running quick

But I also realized i’ll need a wake word mecanism and proper EOS

Ive designed it so I can later on take notes with the voice, summarize them and use them with evernote, so that’s where im heading at

But I fear the cost of servers to run the llm and stt / tts will be too high for personal use

Im open for discussions, not sure the project is mature enough to collaborate tho