Creating Iron Man's Jarvis-like assistant

MX1000 · February 18, 2024, 7:13pm

I’m starting to design a system to be similar to Jarvis assistant in iron man’s movies.

I will be the only client so there is no complexity in handling requests.
So far, I’m looking to make multiple RAG images based on open sources LLMs, hosted in a private cloud.
My requests will be transformed by speech-to-text on my edges devices and my server’s responses transformed back to text-to-speech on my edge devices.
I will have a system in between RAGs servers and edge devices that will spawn/switch RAG depending on keywords. Example:
“Switch to physics context”
“Switch to python context”
“Switch to construction laws context”

Now I think of deploying this system on my phone, and computer so I can talk to it everywhere, potentially deploying it on IoT devices so it can have access to speakers in my home

For STT and TTS systems, im looking at mozilla deepspeech
For the RAG fundation llms, im looking at llama-like
For the frontend application, im not sure as I need something cross platform that can handle STT and TTS systems
For the STT, I can easily finetune that with my own’s voice labelled samples

Altough it seems to be quite a lot of work, it doesnt seem impossible. It doesnt seem that costs will be that high. Even a small gpu will be able to handle only one client

What are the potentials roadblocks or things im missing here that could make it too hard/impossible for a a personnal/one person use?

Dakshish · February 28, 2024, 3:14pm

I have build something like this .

MX1000 · March 4, 2024, 5:10pm

Hi,
Is it possible to get a little more information ? Maybe dm?
Thanks

Akshay_Condatti_Jaya · March 9, 2024, 11:39am

HI @MX1000 @Dakshish this sound interesting. Whats the current state of your project. Willing to join, if there is a possibility.

Thanks!

MX1000 · March 9, 2024, 12:24pm

Hi,

I hit couple of roadblocks.
I started with the stt part. Deepspeech isnt supported anymore, went for whisperAI instead. But they dont have streaming capability so I had to do that
Currently im running whisper transcription on a server and my client connects to it
Next im looking at sending the stt answer to an llm model, will probably start with chatgpt just so I have something running quick

But I also realized i’ll need a wake word mecanism and proper EOS

Ive designed it so I can later on take notes with the voice, summarize them and use them with evernote, so that’s where im heading at

But I fear the cost of servers to run the llm and stt / tts will be too high for personal use

Im open for discussions, not sure the project is mature enough to collaborate tho

subohd · May 29, 2025, 7:52am

Hi Everyone,

I’m also working on this. And currently I’m creating this a Desktop application my stt flow seems ohkish working on it to make it better. Already implemented wake word.

For STT I’m using whisper large v3 model which have high accuracy and multi lingual support.

For initally stage I want my assistant will work as a windows or OS service which should be up and run in the background just a simple wake up will trigger and our virtual buddy will start talking.

I’m Open for any discussions on how we can make this more cool and intresting.
Thank you

Topic		Replies	Views
Lets create ''Jarvis'' AI Discussions project	8	2440	February 10, 2024
Speech to text - Open models for transfer learning AI Discussions	1	67	May 18, 2023
LLM to enhance the output of the audio AI Discussions ai-discussions , project	0	220	March 1, 2024
Creating Personalised AI assistant that can ans my questions and assist me to do some great things ChatGPT Prompt Engineering for Developers	1	124	August 16, 2023
Path after completion Generative AI with Large Language Models feedback , week-module-4	1	120	June 10, 2024

Creating Iron Man's Jarvis-like assistant

Related topics