Voice entry is the one place your words leave your phone. Here’s exactly what happens to them.

Almost everything in Carlo is built so your data never leaves your control in a form anyone could read. Your transactions, notes, photos, and reflections are locked field by field with a key that only you hold, and I wrote about how that works in a separate piece. Voice entry is the exception to all of it, and this is the whole of what happens in the few seconds between tapping the button and seeing your transaction appear.

When you enter a transaction manually, the data never leaves your phone as anything but ciphertext. Voice can’t work that way. A recording of your voice has to be turned into words by something that understands speech, and those words have to be turned into a transaction by something that understands language. Neither of those things lives on your phone. They’re too large, and they change too often, to ride around in your pocket. So for a moment, when you speak instead of type, your words travel.

What actually happens

You tap the button and say “I spent eight dollars on coffee at Starbucks and I paid with the Starbucks app.” Your phone records a few seconds of audio and sends it to Carlo’s server. The server doesn’t keep it. It hands the audio to two specialists in turn, passes their answer back to you, and forgets the whole exchange.

The first specialist transcribes. It’s an open speech model called Whisper, running on a fast service called Groq, and its only job is to hear the audio and write down the words. It’s configured to keep nothing. The audio passes through, becomes text, and is gone from there.

The second specialist reads. It’s a small, fast version of Claude, the language model made by Anthropic, and it takes the plain sentence you spoke and works out the pieces: the amount, who you paid, how you paid, which category it belongs in. It hands back those structured fields and nothing else.

Then your phone has a filled-in card waiting for you. Nothing has been saved yet. You can fix anything that’s wrong. When you tap Save, the transaction enters the same locked world as everything else in Carlo, encrypted with your key before it’s stored. By then the audio is already deleted off Carlo’s server, wiped the moment the round trip finished, whether it succeeded or failed.

The part I won’t skip

There’s one more detail a careful reader would want. To help the transcriber hear you correctly, Carlo sends along a short list of your most frequent payees and your category names with the audio. It’s the difference between the service guessing at a sound and knowing that when you say the name of your coffee place, you mean the one you go to twice a week. Those names go out in readable form for the length of the request, and they come back into the locked world the moment you save.

That’s a real trade, and the word “encrypted” shouldn’t do quiet work it hasn’t earned. For the seconds your voice is being understood, some of your data is readable by the two services doing the understanding. That is the cost of voice, and it only applies when you tap the voice button. Type the transaction instead and none of it happens.

What stays behind is simpler to state. The audio is never stored on Carlo’s servers. The transcript is never written to any table. There’s no log of the words you spoke sitting in a database somewhere with your name on it. The only record that the coffee happened is the transaction itself, encrypted like all the others. If you went looking through everything Carlo keeps about you, you would not find the sentence you said. You’d find the amount, the payee, and the category, all of it ciphertext, locked with your key like everything else.

What this protects, and what it doesn’t

Voice is the one feature where your data leaves the phone in a form a person could read. Two companies see a few seconds of it. Groq is set up to keep none of it. Anthropic doesn’t use what it sees to train its models, and holds it only briefly, for catching abuse, before deleting it. That’s good, and it isn’t the same as nothing.

If that boundary bothers you, you never have to cross it. Manual entry stays sealed end to end, the way the rest of Carlo does, and it always will. Nothing leaves your phone unless you tap the voice button to send it. Both doors are yours. I only want you to know which one you’re walking through.

The reason voice exists at all isn’t speed for its own sake. It’s that the easier it is to catch the moment you spent the money, the more likely you are to actually catch it, and catching it is the whole practice. A minute a day only works if the minute is light enough to keep. Voice makes it lighter.