A podcast that creates itself, in a different voice

By Kristofer Palmvik · (2026‑05‑16)

Projects
AI
Technology
🦉 Longer read

I just started my own podcast. It gets researched, written, and recorded while I sleep. 😴

How do I help a reader (and myself) continue exploring and thinking beyond what is already written? That is something I have been thinking about while building my digital garden, my corner of the web where I collect my notes and reflections.

One way to answer that is my new podcast: In a different voice.

Some of the inspiration came from my experience of using an AI assistant to create super specific travel podcasts. I realized I could automate that for my own notes.

A reflection and expansion on my original thoughts. Created entirely by generative AI.

Here is what actually happens when a new episode of In a Different Voice is created. While I sleep, or do something else entirely.

The process is a simple, fully automated relay race:

🧑‍🔬 The Researcher: An agent breaks down my original note. It uses search grounding to find interesting facts and adjacent topics I missed.
🧑‍💻 The Scriptwriter: A second agent turns the “dry” research into an engaging podcast script, including directions on how to read it.
🎙️ The Voice: A voice model records the script, following the directions with a surprisingly human delivery.
🧑‍💼 The Producer: The audio is mixed with an intro/outro, and published with an AI-generated summary.

The result is a new 5-minute episode of “In a Different Voice”.

The pipeline starts with a query to my CMS, Sanity, where all the notes are stored.

The query finds the most recently published note that does not yet have an episode attached to it. That is the "queue". One note in, one episode out. It runs on a schedule, picking up new notes and backfilling my archive.

The Researcher

Model: Gemini 3.1 Pro, with Google Search grounding and high-level thinking enabled.

The relay race begins when The Researcher gets the plain text note with the explicit instruction: do not summarize it. Use it as a springboard.

The prompt is structured like a creative brief for a storytelling podcast. For example it asks the model to find some shocking statistic or unexpected anecdote to open with, some counter-intuitive documented fact that contradicts common sense on the topic, and three deep-dive angles.

The model has the Google search tool available to use for research. By searching it makes connections I never would have made and finds adjacent topics I did not know existed. Running Gemini Pro with high thinking and search grounding can take a minute or two, but the quality is also relatively high.

During prototyping I also tried other models, like Mistral Large which also has built in search grounding. However it resulted in so many obvious hallucinations that it wasn’t even meaningful to continue experimenting with those.

The Scriptwriter

Model: Gemini 3.1 Flash Lite, with medium thinking

The Scriptwriter receives two things: the research and the original note. It uses the research as background, not a script to convert and prioritize storytelling over coverage. Not everything needs to be covered.

The output format is strict. No headers. No markdown. No lists. Plain spoken sentences, exactly as they will be read aloud. Since the voice model supports voice directions, the script is allowed to include ”audio direction tags” like [enthusiastically], [pause], and [thoughtfully]. They describe not just what to say, but how.

The Voice

Model: Gemini 3.1 Flash TTS, main voice "Puck”, outro voice "Kore”

The TTS API does not take a full script. It takes a smaller amount of text and returns raw PCM audio (24 kHz, 16-bit, mono).

The script is chunked into paragraphs and each paragraph is processed individually. The responses stream back as base64-encoded PCM chunks that are concatenated, and eventually prepended by a WAV header.

The “audio direction tags” the Scriptwriter added are automatically processed by the TTS model, which acts on them and uses them to adjust the delivery.

The outro sequence wraps up the episode with both credit to me as the original author of the note and a disclosure that everything is AI-generated, using a different voice to signal the shift. The exact same PCM-to-WAV process is used, but only for template sentence:

“This episode of 'In a different voice' was completely AI generated based on the original note '[title]', written by Kristofer Palmvik."

The Producer

Model: Gemini 3.1 Flash Lite, low thinking

Two smaller tasks run in parallel: generate a title and generate a one-sentence teaser.

The title prompt asks for a curiosity gap under ten words, active language, no overused ”why" questions. The teaser prompt asks for a single hook sentence designed to drive downloads without over-promising. Neither of these needs much reasoning or creativity. Low thinking, fast output.

Then it is time for the audio engineering and mixing the AI-generated pieces.

The title music is a pre-recorded WAV that was manually generated once in Gemini Lyra 3 and saved as a fixed asset. It was cut to the desired length in Audacity, since all output from Lyra seems to be 30 seconds and that is far too long.

The episode mix combines the music with the generated episode voice and the outro voice. It adds normalization and fades in and out according to a defined timeline. The ffmpeg filter graph for this is rather complex, but it generates a predictable result. Every episode has a similar sound.

The mix is then exported as a 128k CBR MP3 with ID3 metadata embedded: title, artist, album, date, and a comment that links back to the original note.

The exported MP3 is uploaded to Cloudflare R2 object storage, which is similar to AWS S3 but much less expensive and naturally integrated in the Cloudflare Workers environment where most of my website lives.

The producer then finishes by updating the original note document in Sanity with a differentVoice object. That includes things like the episode title and teaser, the full research output for logging purposes, the full script to be used as the transcription, which models were used at each step, and the length and size of the generated file.

This makes it possible to query for notes that have an episode. If the object is there, the note has an episode. If not, the note is in the queue and will eventually be processed.

Cloudflare AI Gateway

All the LLM calls route through Cloudflare AI Gateway. This sits in front of the Google AI Studio API and adds caching, rate limiting, retries, and observability. This way I can see exactly how many tokens each step used, what it cost, and when it fails.

The pipeline also has its own retry logic for the more expensive models. It starts by trying the cheaper "Flex" service tier, with lower priority and lower cost similar to the batch tier. It falls back to the standard tier at normal cost if that fails repeatedly. This is a simple way to reduce cost without having to set up a multi-stage batch process.

GitHub Actions workflow

All the steps are orchestrated through one TypeScript file scheduled and run using GitHub Actions. This is usually the simplest way to schedule a background job like this.

The job runs on a standard GitHub-hosted Ubuntu runner, with FFmpeg explicitly installed during environment setup.

Putting it all together

Running through all the steps takes around 4 minutes when all models are available without retries.

The total cost per generated episode is somewhere around $0,05 (around 50 öre), depending on whether the flex tier or standard their could be used. The absolute majority of that is the cost for The Voice TTS model.

With the assets available in R2 and the metadata attached to the Sanity document, the delivery to the listeners is more or less trivial.

The web frontend running in a Cloudflare Worker has a resource loader that maps the externally visible path to the R2 asset by looking up the note and episode in Sanity. It serves the relevant audio data from the bucket, including range streaming. Cloudflare’s CDN handles the caching, although that is strictly not necessary since both the Worker and R2 are fast and inexpensive to use directly.

There is also a podcast compatible Atom XML feed automatically built from a similar Sanity query. The feed list all episodes, to make it possible to subscribe through a standalone podcast app or other service.

The end result

So, is the generated episode worth listening to? I am biased, but the high quality has surprised me!

The thing that keeps surprising me most is not that it just works, but that the output is good.

I wrote an original note using my own brain, exposing my own thinking. Then these generative AI models work together to create a podcast episode while I sleep.

The Researcher finds connections I would not have found. The Scriptwriter tells stories I would not have told. The Voice delivers them with an energy I could not have predicted (nor matched).

Some episodes even bring out spicy perspectives that contradict my point. Like whether using an AI-generated voice makes us think less about what it actually says.

In the end, the episodes help challenge my thinking and make me learn new things! And the best use of AI is not to replace your thinking, but to challenge and expand it.

It hope this can be valuable to other people too.

Additionally...

Browse through all episodes of In a different voice, or find it below any of my notes.

The idea of automatically generating a podcast was heavily inspired by what I wrote in Traveling Europe guided by our personal AI podcast. Don’t miss the spicy take in the episode about AI voices and trust.

The way it deconstructs the note was also inspired by my take on how to understand LinkedIn posts.

This is one part of my idea to share simple, stupid, fun, creative, mundane, exciting ways to use AI.

This note has of course its own episode, completely automatically generated.

And just a few days after this was built, Google announced that Gemini 3.5 Flash was released, with better reasoning and price/performance even than Gemini 3.1 Pro.

Instead of Gemini 3.1 Pro, the research is now done with Gemini 3.5 Flash (search grounding and high reasoning), while Gemini 3.5 Flash with medium reasoning replaced the Gemini 3.1 Flash scriptwriter. So far evals and new episodes are looking good, but only time will tell if it is actually better in the end.

Getting new models that changes the balance and calculations, is a constant part of the fun and exciting times we are living in: what you built last week is starting to get obsolete the next!

A podcast that creates itself, in a different voice was first published 2026‑05‑16

In a Different Voice AI-generated Content

Listen to a reflection and expansion on this note, in a different voice

Outsourcing Your Brain to an AI Podcast Host

We’ve built complex AI pipelines to turn our private thoughts into professional-grade podcasts, but we might be accidentally creating a world where no one is left to listen.

0:00 / 0:00

Transcript of the episode

In 1939, at the New York World’s Fair, if you wanted to hear a machine talk, you had to watch Helen Harper. She was a technician at Bell Labs, and she was essentially a human puppeteer for a pile of wires and vacuum tubes called the Voder. To make the machine say something as simple as, "She saw me," Helen had to furiously pump a foot pedal to control the pitch, while her fingers danced across ten keys, modulating buzzes and hisses in real-time. It was a brutal, physical performance. If she blinked, the illusion shattered. Today, you can make a machine speak, and you don’t have to lift a finger. In fact, you can be sound asleep in your bed while a relay race of autonomous agents researches, scripts, and produces an entire high-production podcast based on your own private thoughts. And it will cost you, give or take, five cents. [thoughtfully] It’s a total inversion of history. In 1939, the human was the instrument. The machine was just the medium. Now, you are the prompt, and the machine is the performer. You feed a raw, messy, half-baked thought into a digital pipeline, and by morning, you have a voice—a smooth, articulate, slightly eerie version of you—reading it back to you with the kind of narrative pacing you usually only hear from high-end public radio. But here is the strange part. You’d think hearing your own ideas synthesized by an AI would feel hollow. Like looking at a photo of a photo. But psychologists at the University of Michigan have found something called the self-distancing paradox. When you process your own thoughts from a third-person perspective—when you hear them spoken back to you as if they belong to someone else—your brain stops being defensive. You stop being the person who wrote the note and start being the person who’s listening to the argument. It creates this weird, beautiful acoustic mirror. Suddenly, that dry reflection you wrote in your digital garden isn’t just a note anymore. It’s a podcast episode. It has a hook. It has deep-dive research that you didn’t even know you needed. It challenges your own assumptions. You’re forming a parasocial relationship with your own intellect. [warmly] It’s like having a friend who happens to be a perfect echo of your own brain, but one who’s actually read the books you haven’t gotten around to yet. But there is a trap hidden in all this efficiency. We’ve built these pipelines—these beautiful, complex architectures of GitHub actions and cloud-hosted AI models—just to automate the epiphany. We’re outsourcing the heavy lifting of curiosity to a server in a data center so that when we wake up, we can be jolted awake by a version of ourselves that’s more articulate than the version that went to sleep. [pause] But I wonder what happens when everyone starts doing this. When the internet becomes a sea of these perfectly-paced, upbeat, synthetic voices, all using the same few models. What happens when the world sounds like a million identical podcasts, all trying to be just a little bit more engaging, a little bit more "produced," a little more "human" than the last one? We might reach a point where we stop listening entirely. Where our brains just learn to tune out the sound of perfection because we know, deep down, that nobody is actually at the controls. [enthusiastically] And yet, for now, it works. It’s a strange, ironic labor, isn't it? We build all this technology—the complex FFmpeg filter graphs, the tiered model routing, the R2 storage—just to drag ourselves back to the most analog, the most human act possible: changing our own minds. We’ve automated the conversation, just so we can finally hear what we’ve been trying to say to ourselves all along.

Research by Google Gemini 3.1 Pro, script written by Google Gemini 3.1 Flash Lite, and read by Google Gemini 3.1 Flash TTS.

AI-generated Content

Sections marked as AI-generated Content were generated by one or several AI models.

While it may be entertaining and informative, please be aware that it could possibly contain inaccuracies or fabricated information.