A podcast that creates itself, in a different voice

- Projects
- AI
- Technology
- 🦉 Longer read
I just started my own podcast. It gets researched, written, and recorded while I sleep. 😴
How do I help a reader (and myself) continue exploring and thinking beyond what is already written? That is something I have been thinking about while building my digital garden, my corner of the web where I collect my notes and reflections.
One way to answer that is my new podcast: In a different voice.
Some of the inspiration came from my experience of using an AI assistant to create super specific travel podcasts. I realized I could automate that for my own notes.
A reflection and expansion on my original thoughts. Created entirely by generative AI.
Here is what actually happens when a new episode of In a Different Voice is created. While I sleep, or do something else entirely.
The process is a simple, fully automated relay race:
🧑🔬 The Researcher: An agent breaks down my original note. It uses search grounding to find interesting facts and adjacent topics I missed.
🧑💻 The Scriptwriter: A second agent turns the “dry” research into an engaging podcast script, including directions on how to read it.
🎙️ The Voice: A voice model records the script, following the directions with a surprisingly human delivery.
🧑💼 The Producer: The audio is mixed with an intro/outro, and published with an AI-generated summary.
The result is a new 5-minute episode of “In a Different Voice”.
The pipeline starts with a query to my CMS, Sanity, where all the notes are stored.
The query finds the most recently published note that does not yet have an episode attached to it. That is the "queue". One note in, one episode out. It runs on a schedule, picking up new notes and backfilling my archive.
The Researcher
Model: Gemini 3.1 Pro, with Google Search grounding and high-level thinking enabled.
The relay race begins when The Researcher gets the plain text note with the explicit instruction: do not summarize it. Use it as a springboard.
The prompt is structured like a creative brief for a storytelling podcast. For example it asks the model to find some shocking statistic or unexpected anecdote to open with, some counter-intuitive documented fact that contradicts common sense on the topic, and three deep-dive angles.
The model has the Google search tool available to use for research. By searching it makes connections I never would have made and finds adjacent topics I did not know existed. Running Gemini Pro with high thinking and search grounding can take a minute or two, but the quality is also relatively high.
During prototyping I also tried other models, like Mistral Large which also has built in search grounding. However it resulted in so many obvious hallucinations that it wasn’t even meaningful to continue experimenting with those.
The Scriptwriter
Model: Gemini 3.1 Flash Lite, with medium thinking
The Scriptwriter receives two things: the research and the original note. It uses the research as background, not a script to convert and prioritize storytelling over coverage. Not everything needs to be covered.
The output format is strict. No headers. No markdown. No lists. Plain spoken sentences, exactly as they will be read aloud. Since the voice model supports voice directions, the script is allowed to include ”audio direction tags” like [enthusiastically], [pause], and [thoughtfully]. They describe not just what to say, but how.
The Voice
Model: Gemini 3.1 Flash TTS, main voice "Puck”, outro voice "Kore”
The TTS API does not take a full script. It takes a smaller amount of text and returns raw PCM audio (24 kHz, 16-bit, mono).
The script is chunked into paragraphs and each paragraph is processed individually. The responses stream back as base64-encoded PCM chunks that are concatenated, and eventually prepended by a WAV header.
The “audio direction tags” the Scriptwriter added are automatically processed by the TTS model, which acts on them and uses them to adjust the delivery.
The outro sequence wraps up the episode with both credit to me as the original author of the note and a disclosure that everything is AI-generated, using a different voice to signal the shift. The exact same PCM-to-WAV process is used, but only for template sentence:
“This episode of 'In a different voice' was completely AI generated based on the original note '[title]', written by Kristofer Palmvik."
The Producer
Model: Gemini 3.1 Flash Lite, low thinking
Two smaller tasks run in parallel: generate a title and generate a one-sentence teaser.
The title prompt asks for a curiosity gap under ten words, active language, no overused ”why" questions. The teaser prompt asks for a single hook sentence designed to drive downloads without over-promising. Neither of these needs much reasoning or creativity. Low thinking, fast output.
Then it is time for the audio engineering and mixing the AI-generated pieces.
The title music is a pre-recorded WAV that was manually generated once in Gemini Lyra 3 and saved as a fixed asset. It was cut to the desired length in Audacity, since all output from Lyra seems to be 30 seconds and that is far too long.
The episode mix combines the music with the generated episode voice and the outro voice. It adds normalization and fades in and out according to a defined timeline. The ffmpeg filter graph for this is rather complex, but it generates a predictable result. Every episode has a similar sound.
The mix is then exported as a 128k CBR MP3 with ID3 metadata embedded: title, artist, album, date, and a comment that links back to the original note.
The exported MP3 is uploaded to Cloudflare R2 object storage, which is similar to AWS S3 but much less expensive and naturally integrated in the Cloudflare Workers environment where most of my website lives.
The producer then finishes by updating the original note document in Sanity with a differentVoice object. That includes things like the episode title and teaser, the full research output for logging purposes, the full script to be used as the transcription, which models were used at each step, and the length and size of the generated file.
This makes it possible to query for notes that have an episode. If the object is there, the note has an episode. If not, the note is in the queue and will eventually be processed.
Cloudflare AI Gateway
All the LLM calls route through Cloudflare AI Gateway. This sits in front of the Google AI Studio API and adds caching, rate limiting, retries, and observability. This way I can see exactly how many tokens each step used, what it cost, and when it fails.
The pipeline also has its own retry logic for the more expensive models. It starts by trying the cheaper "Flex" service tier, with lower priority and lower cost similar to the batch tier. It falls back to the standard tier at normal cost if that fails repeatedly. This is a simple way to reduce cost without having to set up a multi-stage batch process.
GitHub Actions workflow
All the steps are orchestrated through one TypeScript file scheduled and run using GitHub Actions. This is usually the simplest way to schedule a background job like this.
The job runs on a standard GitHub-hosted Ubuntu runner, with FFmpeg explicitly installed during environment setup.
Putting it all together
Running through all the steps takes around 4 minutes when all models are available without retries.
The total cost per generated episode is somewhere around $0,05 (around 50 öre), depending on whether the flex tier or standard their could be used. The absolute majority of that is the cost for The Voice TTS model.
With the assets available in R2 and the metadata attached to the Sanity document, the delivery to the listeners is more or less trivial.
The web frontend running in a Cloudflare Worker has a resource loader that maps the externally visible path to the R2 asset by looking up the note and episode in Sanity. It serves the relevant audio data from the bucket, including range streaming. Cloudflare’s CDN handles the caching, although that is strictly not necessary since both the Worker and R2 are fast and inexpensive to use directly.
There is also a podcast compatible Atom XML feed automatically built from a similar Sanity query. The feed list all episodes, to make it possible to subscribe through a standalone podcast app or other service.
The end result
So, is the generated episode worth listening to? I am biased, but the high quality has surprised me!
The thing that keeps surprising me most is not that it just works, but that the output is good.
I wrote an original note using my own brain, exposing my own thinking. Then these generative AI models work together to create a podcast episode while I sleep.
The Researcher finds connections I would not have found. The Scriptwriter tells stories I would not have told. The Voice delivers them with an energy I could not have predicted (nor matched).
Some episodes even bring out spicy perspectives that contradict my point. Like whether using an AI-generated voice makes us think less about what it actually says.
In the end, the episodes help challenge my thinking and make me learn new things! And the best use of AI is not to replace your thinking, but to challenge and expand it.
It hope this can be valuable to other people too.
A podcast that creates itself, in a different voice was first published 2026‑05‑16
Sections marked as AI-generated Content were generated by one or several AI models.
While it may be entertaining and informative, please be aware that it could possibly contain inaccuracies or fabricated information.