Pick the right 8 seconds.
The cloning engine performs best on a single, clean, 8-second sample. Long is not better. Studio is not required. What is required: the speaker is talking like they would on a customer call — not a podcast, not a keynote, not a voicemail greeting.
- Mono, 16 kHz minimum. We resample 48 kHz down; we cannot resample 8 kHz up.
- No music, no laughter, no second speaker, no echoey rooms.
- Speaker reads or speaks a sentence with a question and an affirmation. Cadence matters more than content.
Cadence, not pronunciation.
Once cloned, we tune three prosody dials: pace, pitch range, and pause density. The default Pro+ profile sits at +0% pace, ±2 semitones pitch range, and 220ms inter-clause pause. Most brands shift one of those dials, not all three.
| Healthcare | −5% pace · narrower pitch · longer pauses |
| Sales outbound | +8% pace · wider pitch · shorter pauses |
| Hospitality | 0% pace · medium pitch · brand-tone glossary |
| Public sector | −2% pace · narrower pitch · plain-language filter on |
A brand glossary.
For every brand we maintain a private glossary — pronunciations of company terms, partner names, product SKUs. This is the difference between "Voov-itee" and "VoIP" being said correctly the first time and the fortieth.
Don't try to glossarise everything. Start with the ten terms a misreading of which would embarrass you. Add the next ten only when you hear a problem in QA.
A human listening panel.
Before any new voice ships, a panel of three internal humans listens to ten generated calls in random order, against three baseline references. If the clone fails to be picked correctly at least 7/10 times, it does not ship; we re-record.
Voice drift over time.
Synthesis drifts. Engines update; what sounded right on Monday can sound flatter on Friday. We re-audit every cloned voice quarterly against the original reference and the QA panel. Drift correction usually means a 20-minute prosody tweak; rarely a re-clone.