Please gave your suggestion on this development.
Please see google speech to text for converting user speech to text, text to speech for providing user feedback and langchain tools for executing custom code.
You can use gTTs, google’s free voice module API with some pre trained LLM. But make sure that your voice is clear to the interface and pronounce in the right way.
What drawbacks are there with Spanish language training?
There are a lot of offline speech to text and text to speech options these days. If you want to dig around in the nitty gritty of speech to text, I would suggest you install Kaldi and play with the librespeech examples - https://www.youtube.com/watch?v=tixkx-I1hM4. I have played with VOSK which is based on Kaldi. Usually the only reason to actually train a STT model is if you are adding a new language. For the most part you just need to adapt them. I am very interested in finding the least expensive way (in terms of resources) to add words to a Gr.fst and HCLr.fst grammar.
There are a few good options for Text to Speech. Flite and Festival work okay. Mimic3 and MaryTTS are better. There are other options out there that may work better for you, but I’m pretty happy with the CMU-Arctic voices that will run on any of those.
Huggingface hosts a lot of offline LLM’s in gguf format. I’m playing with Langchain in Python and am pretty happy with it so far. I’m learning about long term memory storage. Right now I am saving my conversations to a SQLite database so my assistant remembers what we’ve been talking about. There is a feature called tools that looks useful for giving the LLM access to specific information, but I’m still trying to get those examples working on offline models.
Now that all of this will run on a Raspberry Pi 5, I feel like K-9 and R2D2 are right around the corner. I am currently working on building Talkie Toaster from Red Dwarf.
Hi. I created a Jarvis like voicebot and deployed it online as a web app. It uses Javascript SpeechRecognition to convert speech to text and Javascript SpeechSynthesis to convert text to speech. Both these tools are free. The voicebot is powered by ChatGPT via the API.
Web app:
GitHub:
I hope this project gives you some ideas.
I’m using whisper for my vision to voice project, are the disadvantages its cost and speed to google TTS. I’m just a really big fan of the realistic voice inflections Wisper offers, haven’t tried out gtts Api yet.
I really liked how fast and free it was. I didn’t feel like i was providing instructions but it felt like a full fledged conversation. You are right a bit buggy on windows but it’s okay, awesome work.
I’m also working on a similar one to make a personable scalable assistant with vision. but speed is very crucial and my python code is struggling there’s about 5 to 10 seconds latency, i would love to share with yoiu if you have any advice
Hi Jonnysowl. Thank you for your encouraging feedback.
Your idea for a voice assistant with vision reminds me of the Netflix K-drama “Startup” where they create a voice assistant with vision, called NoonGil. It helps visually impaired people in their everyday lives.
In the past I’ve built python computer vision projects to control small robots using hand signals. I ran these on my old Raspberry Pi. The key to reducing latency was to use OpenCV together with Google Mediapipe for the image recognition. Mediapipe is fast because it’s designed to run on a cpu. I haven’t looked at Mediapipe recently but I’m sure its become even better since I last used it. This is one project example that may be useful, please refer to Exp_05:
Regarding the Khuluma Voicebot, I think that if was implemented in Python then, by combining it with OpenAi function calls and a Raspberry Pi or Arduino, one could get it to do things in the real world by just talking to it - switch on a light, trigger an alarm etc. The possibilities are endless. All that’s needed is for the voice instruction to trigger an output voltage. Here’s another simple example project:
I hope you find these thoughts helpful in your project.