Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)
Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.