Will my voice ever be used to train other models?

No. Your voice clone is generated, stored, and used only for you. We don't pool it. You can delete it from your device at any moment, and the cloud copy goes with it.

What if I sound exhausted on the recording?

We ask you to read one calm paragraph at a moment of your choosing. The model learns your timbre, not your mood. The morning voice is steadier than the source recording, by design.

Why audio instead of just text?

Hearing your own voice from a year ahead is the strongest version of future-self continuity ever built into a consumer app. Text helps. Audio lands.

No. HerDay is a daily ritual, not clinical care. If you're carrying something heavy, please also speak with a therapist. We list a few we trust in the app.

Is the AI-written copy generic?

Every morning letter is generated against your intake answers, your tone preference, and the season you're in. You can rate any morning's letter, and we use that signal to tune yours, not anyone else's.

We're in private beta. Waitlist members get first access in Q3, with a free month of voice cloning included.

How much of my voice does the model actually need?

Modern few-shot voice models are remarkably efficient. ElevenLabs' professional cloning recommends 30 minutes of high-quality audio for the best fidelity, but their instant model produces a usable clone from 60 seconds of clean speech. Internally, we found 90 to 120 seconds is the sweet spot for affirmations specifically — long enough to capture your tone and breath, short enough that recording it doesn't feel like a project. The model isn't memorizing your sentences; it's learning the timbre, pitch range, and rhythm that make your voice recognizably yours.

Does the cloned voice actually sound like me, or like a smoothed-out impression?

Both, depending on what you're listening for. A good clone captures the parts of your voice the brain uses for self-recognition — timbre, pitch contour, vowel shapes — with high accuracy. What it doesn't capture is moment-to-moment variation: the tightness in your throat when you're anxious, the slight rasp when you're tired. For affirmations, the smoothing is generally a feature. You hear yourself on a steadier day than the one you're actually having, which is closer to the voice you'd want speaking to you anyway.

Where does the model live and who can access it?

In HerDay specifically, the model is generated server-side from your sample, used to render your daily affirmation audio, and tied to your account only. We do not pool voice models across users for training, we do not sell or share them, and you can delete your clone at any moment from the settings screen. The audio file delivered to your phone is just an MP3 — there is no model on the device, and no model anywhere we don't control. This is structurally different from public cloning tools where the model may be shared or accessible to other accounts.

Could someone use my cloned voice to impersonate me?

The fear is reasonable; the structural answer is that impersonation requires access to the model, and our architecture is designed so no one outside our infrastructure ever sees it. We don't expose the model as an API, we don't let anyone else render audio from it, and we don't store the underlying sample after the model is built. The realistic impersonation risks people worry about — deepfakes of phone calls, scams using a relative's voice — come from voice models built with access to your public audio (interviews, social posts), not from a sample you recorded into a quiet bedroom for your own use.

Will the clone feel uncanny when I first hear it?

Often, yes, for about ten seconds. The discomfort is acoustic, not psychological — you normally hear yourself through bone conduction, which adds lower frequencies your skull provides. A recording or a clone is closer to how others hear you, which is initially jarring. Most users in our beta described the feeling shifting between sessions one and three: from 'that's weird, that's me' to 'that's me, on a calmer day.' The brain adapts faster than you'd guess, because the self-recognition cue (timbre, pitch contour) is doing most of the work — and that part already matches.

BlogVoice & audio

Voice cloning for affirmations, explained — what's actually happening inside the model

A 60-second sample of your voice is enough for a modern model to render new sentences in your acoustic signature. Here's what voice cloning actually does, why it matters specifically for affirmations, and how we built it inside HerDay so the model never leaves your device.

Lena Hartwell · MSc Cognitive Science

Editorial lead · Science writer

Published May 31, 2026

Updated May 31, 2026

11 min read

A short watercolor sound wave on the left of a cream paper field gently extending into a long flowing watercolor ribbon across the page — the abstract metaphor for a small voice sample becoming a full model. — Ninety seconds in. A year of mornings out.

The phrase voice cloning tends to land somewhere between deepfake and party trick in most people's heads. Both associations are unfortunate, because the underlying technique — and what it lets us do for self-talk specifically — is more interesting and considerably less dramatic than either framing suggests. This is the explainer I wish someone had handed me before we started building HerDay: what the model actually does, why ninety seconds of audio is enough, and what we had to decide about safety, ownership, and fidelity along the way.

Definition · Voice cloning (for affirmations)

The process of training a small, voice-specific model on a short audio sample of a single person so that the model can later render new written sentences in that person's acoustic signature. In a well-designed system, the sample stays private, the model is tied to one account, and the output is an audio file — not a way for anyone else to speak as you.

What voice cloning actually does

A modern voice-cloning system is, at its core, a small machine-learning model trained on the acoustic features that make a single human voice recognizable. The technical name for the family of techniques is few-shot speaker adaptation — a way of adjusting a large pretrained text-to-speech model toward a specific speaker using a tiny amount of new data.

What the model is learning is not your content — not your words, not your sentences, not anything you said in the sample. It is learning the parts of your voice that stay stable across what you say: your fundamental frequency range, your formant structure (the resonance shapes your vocal tract creates), your speaking rate, your characteristic pitch contour at the end of sentences, the way you handle vowel transitions. These features are remarkably stable across most contexts — which is why your friend recognizes you on a noisy phone call after one word.

Once the model has these features, it can be asked to produce new sentences it has never heard. The text-to-speech backbone handles the linguistic side (turning written words into phonemes and prosody). The speaker-adaptation layer renders those phonemes as you. The result is an audio file that, for the brain's self-recognition system, is functionally indistinguishable from a recording of you saying that exact sentence — even though you never did.

Three overlapping watercolor waveform tracings on cream paper, each in a slightly different shade of merlot and rose — the abstract metaphor for a model learning a voice's recurring shape across multiple recorded samples. — The model learns the shape, not the words.

The reason this matters for our purposes is that the brain uses voice for self-recognition almost reflexively. Functional MRI work has shown that hearing your own voice — even on a recording — activates self-referential networks in a way no other voice does.^{Kaplan 2008} A well-cloned voice preserves the cues this system relies on. The brain still files the audio under me.

Why this matters specifically for affirmations

Voice cloning is a general technology with many uses. Most of them — narration for audiobooks, dubbing, video voiceovers — have nothing to do with self-talk. The reason it matters here is narrower and worth saying clearly.

Daily affirmation practice runs into a specific bottleneck the moment you take it seriously. If you want the audio to be your voice (which the psychology literature suggests you should), you can either record every sentence yourself or accept that the practice will use someone else's voice. Recording every sentence yourself is unsustainable past the first week — partly because affirmations work best when phrased fresh to the morning you're actually having, partly because the morning you most need to hear something kind is precisely the morning your throat is too tight to say it.

Voice cloning resolves the bottleneck. You record once. The model renders new sentences indefinitely. The audio is, for self-recognition purposes, still you. The affirmation can be written this morning, addressed to the specific season you're in, delivered in your voice without you having to summon the steadiness to record it.

+30%

more allocated to retirement savings by participants who felt connected to their future self via age-progressed imagery. The mechanism — future-self continuity — is what voice can do at scale, daily, without requiring you to re-record.· Hershfield 2011

The Hershfield finding above is from imagery, not audio. But the broader future-self continuity work suggests that voice carries a stronger self-identification signal than vision does. Vision activates self-recognition; voice activates self-recognition and identity-relevant audio processing and the narrative-self system that links past, present, and future memory. A daily, voice-rendered statement from a steadier version of yourself is, on the evidence, the most efficient delivery format we know how to build for the mechanism Hershfield described.

Hand-drawn editorial infographic on cream paper, a small watercolor seed shape on the left labeled '90 seconds' with a thin merlot ink line growing to a larger flowering shape on the right labeled 'a year of mornings', with a small caption referencing the Hershfield 2011 future-self continuity finding. — Ninety seconds of sample. A year of fresh language.

How small the sample really is

When we first started prototyping, we assumed users would need to record something close to fifteen minutes for the clone to feel like them. Two years of model improvement later, the number is closer to ninety seconds.

The shift happened because modern few-shot speaker-adaptation techniques no longer need to learn your voice from scratch. They start from a large pretrained TTS model that already knows what human voices in general look like, and they fine-tune only the small subset of parameters that distinguish your voice from the average. The technical name for this is speaker-conditioned synthesis, and it's the same family of techniques that lets ElevenLabs, Resemble.ai, and most consumer-facing voice tools deliver instant clones from short samples.

What you record matters more than how long it is. The model wants varied prosody — questions, statements, slight emotional range — so it can learn how your voice moves between registers. We ask for two minutes of reading a short, deliberately varied script (a paragraph of declarative sentences, two questions, one quietly emphatic line) in a quiet room. Anything past three minutes is diminishing returns. Anything under sixty seconds and the model becomes slightly more average-sounding — your timbre is captured, but your prosody softens.

The safety and ownership decisions

The fear most people bring to voice cloning, before they've spent any time with the technology, is that they'll lose control of their voice. This is the correct fear. It is also a fear that can be designed around if the people building the tool take it seriously.

Here is what we decided, on the record, when we built HerDay's clone:

The sample is destroyed once the model is built. The original audio you record is used to train the speaker-adaptation layer and then deleted. We do not need it after that. Keeping it would be a liability we don't want and a temptation we don't trust.
The model is tied to one account, on our infrastructure. It is not exposed via an API, not made available to other users, not pooled into any cross-user training. The model exists to render audio for you and only for you.
There is no model on your device. What ships to your phone is an MP3 — the rendered audio for that morning. Even if your phone is compromised, the model isn't on it.
You can delete the clone at any time. From the settings screen, with one tap, the model is destroyed on our infrastructure. After that, there is no model anywhere. We don't keep a backup. We don't keep a copy "for analytics."
We will never use your clone for anything you didn't initiate. No marketing audio. No demo reels. No "imagine your voice saying this product is great." The model exists for the morning ritual. That is the only thing it does.

These are operating-system-level commitments, not feature-list bullets. They are the reason this article can be written without weasel words. The realistic impersonation risks people read about in the news — deepfaked phone calls, voice scams using a relative's voice — come from models built with access to public audio. A model built from sixty seconds you recorded into your bedroom, that lives in our infrastructure tied to one account, that you can delete tonight, is a structurally different artifact.

What the clone can't do — and why that's a feature

There are things a clone of your voice cannot do, and it's worth naming them clearly so the technology isn't oversold.

A clone cannot capture the parts of your voice that change with your state. The clipped intake of breath before you cry, the slight tremor of running on three hours of sleep, the deliberate slowness of someone trying not to interrupt — these are state-dependent variations the model doesn't see in your sample and won't reproduce in its output. For most uses, this is a limitation. For affirmations, it is the central feature.

The voice your clone renders is, deliberately, your voice on the steady day you recorded the sample on. The morning you most need a kinder voice is the morning yours will be least kind. A clone trained on a calmer instance of you gives you yourself back — not the version of yourself currently negotiating with the day, but the version of yourself who has space to address her. This is what Hershfield's future-self continuity work, taken to its audio conclusion, looks like in practice: a steadier voice, in your acoustic signature, speaking to the version of you the morning is currently happening to.

The clone isn't a recording of you. It's a recording of the calmer version of you, on loan for the morning that needs her.
— what the model is for

What HerDay actually does with the clone

The mechanic, end to end, is straightforward. You record ninety to one hundred and twenty seconds during onboarding. The audio is uploaded over an encrypted connection, the model is built server-side, and the sample is deleted. From that point onward, your morning affirmation is written fresh each day — based on what you told us during intake about the season you're in, the inner critic patterns you noticed, and any specific moment you wanted addressed — and rendered through your voice model into a short audio file. You wake up to it. It addresses you by name. The phrasing is conditional where it needs to be (we run every sentence through a check against the Wood 2009 paradox before render). It is thirty seconds long, or thereabouts.

The next morning, the audio is different. The voice is the same.

That is the entire pitch for voice cloning in self-talk. Not a clever AI trick, not a personalization gimmick — a sustainable way to deliver the audio version of a future-self statement, in the voice the brain most readily files under me, every morning, without you having to sit down and record it on a morning you don't have it in you.

Keep reading

Voice & audio

Why your own voice works better — the quiet psychology of hearing yourself

Hearing your own voice changes how a self-statement lands. Not because it's louder — because the brain processes self-voice as identity-relevant data. Here's what 30 years of research on future-self continuity, self-distancing, and voice recognition actually shows.

Science

Do affirmations actually work? A 2026 evidence-based review

Affirmations work — but only the kind grounded in your existing values, and only when phrased to match where your self-esteem actually is. Here is what 30 years of psychology research shows, and where most apps get it wrong.

Inner critic

Your inner critic isn't telling the truth — she's reading old data

The inner critic feels like a verdict. The research says she's a reflex — a learned, protective voice repeating old data with high confidence. Here's what Neff, Gilbert, and Kross actually found, and what to do with her tomorrow morning that isn't 'silence her.'