HN Reader

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

I built a Voice agent platform my drobotics lab of my university..which is already being cloned by 330+ people within 12hrs .. I am a first year cse student and so I tried to figure out a way to actually run everything on my laptop and working on it currently to completely transform to edge ai voice assistants for the robotics and 100% private and local control of robotics related project of my lab..

The intersting features are : 1> I used json rag with real time embeddings so that for a few specs and info we don't need to set a whole pipeline..

I have already built " Hierarchical Agentic Rag with Hybrid Search ( knowledge graph + vector search) u can view that on my profile ...

I am actively trying to share as much as possible related to it but that project is actually linked with a huge set of files it's 693k points of data with pgvector+ postgress .. give a visit u will get more idea from that

2> I had tried every sort of whisper models.. faster whisper .. turbo or anything u can u think of ..even with a self c++ engine .. but that model itself was hallucintion prone architecture..

Then I moved to parakeet tdt with silero vad and not parakeet rnn for better speed and optimisations .. repo has further details ..

3> fine tuned a dataset from anthropic rlhf through space and glinner and convert that to a perfect training dataset of the Lama 3.2 3b ..

I will attach the dataset of u need or will upload that to hugging face if u want to use it for yourself..

4> attached phonetic correctors for both output from parakeet and llama for better tts working .

5> I used setfit to route the queries and confidence based semantic search for faster and accurate as much as possible

6> I am using sherpa onxx and qued the tts and stt and everything but as a experimentation I have also achieved llama generating respond and kokora processing as a batch with a full nyc working as well and everything on my laptop...

7> along with these my frontend also relies on heavy three.js and 3d view files but I had applied optimisations there which works perfectly with everything together on the laptop..

8> I also applied glued interaction to the llm model .. implemented FIFO with 5 interactions and storing them for future fine tuning and phonetic words additions.

Pls give a visit it and let me know if I should learn something new ..

One kind note : as a enthusiast spending so much energy on these things things .. I have taken help from ai for the md files and expansion or explanations in the codes for better help of every single person...

I honestly didn't expect this to get much attention, but we hit ~330+ clones in the last 24 hours.

That unexpected load actually helped me find a few bugs in the setup script (specifically with the pgvector config on Windows), which I've just patched. If anyone else hits memory issues on 4GB cards, let me know—I'm actively optimizing the quantization now

5 hours agoby shubham-coder