HN Reader

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

29 days agoby derefr

Does the 169M include the ~90M params for the Mimi codec? Interesting approach using FiLM for speaker conditioning.

29 days agoby woodson

I just had some amusing results using text with lots of exclamations and turning up the temperature. Good fun.

29 days agoby krunck

Impressive! The cloning and voice affect is great. Has a slight warble in the voice on long vowels, but not a huge issue. I'll definitely check it out - we could use voice generation for alerting on one of our projects (no GPUs on hardware).

29 days agoby convivialdingo

Very cool. I'd love a slightly larger version with hopefully improved voice quality.

Nice work!

29 days agoby lukebechtel

This is very cool! And it'll only get better. I do wonder, if, at least as a patch-up job, they could do some light audio processing to remove the raspiness from the voices.

28 days agoby LoveMortuus

Very nice to have done this by yourself, locally.

I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)

29 days agoby elaus

What could possibly go wrong...

Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?

In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.

I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.

29 days agoby jacquesm

Emm...I played the sample audio and it was...horrible?

How is it voice cloning if even the sample doesn't sound like any human being...

29 days agoby Gathering6678

It sounds a lot like RFK Jr! Does anyone have any more casual examples?

29 days agoby sergiotapia

Sorry but the quality is too bad.

I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.

28 days agoby jokethrowaway

Muito fixe. Now the next challenge (for me) is how to convert this to DART and run on Android. :-)

29 days agoby nunobrito

A scammers dream.

29 days agoby brikym