Claude knows FFmpeg, but it has no idea where the video is

By Kristofer Palmvik · (2026‑05‑04)

AI
Technology
Projects
🦉 Longer read

I spent last weekend exploring what an agentic video editor would look like for fun: connect Claude Desktop to an MCP server that wraps FFmpeg, and let it edit video autonomously.

To trim clips, join them, normalize audio, and package them for the web. No timeline. No GUI. Just tell Claude what you want and get the final result back as a link.

FFmpeg is a great stress test for any user. It is famously hostile. Every operation is a command with dozens of flags. Errors go to stderr, mixed with progress output, in a format that occasionally looks like a failure even when everything is fine.

When you see something like this in the output

frame= 240 fps= 22 q=28.0 size= 2048kB time=00:00:08.04 bitrate=2087.0kbits/s speed=0.736x

you might wonder if it worked or broke. But that’s just what a status update looks like.

My first experiment was to give Claude a raw run_ffmpeg tool through an MCP server. Claude writes the command, the server executes it, and the raw stdout and stderr come back.

I planned to watch it fail, document the failures, and build a self-correcting loop that feeds errors back so Claude could diagnose and retry. I suspected I might see hallucinated flags, confused syntax, and Claude inventing FFmpeg features that don't exist.

But that didn't happen.

What Claude got right

Across four sessions and several different editing tasks, Claude produced zero hallucinated FFmpeg flags. Not a single one.

When I prompted it to “Join these three clips with crossfade transitions” it correctly reached for xfade and nailed the offset calculations. And the offset math was correct on the very first attempt. It correctly handled the filter_complex syntax, including xfade, trim, acrossfade, something which I have no clue about.

When asked to transcode in multiple steps it did that correctly too, and got all the parameters right.

Even when the conversion of my test file, Big Buck Bunny (2008) which is commonly used for video testing, failed because of its uneven pixel size, Claude self-corrected. In the next session it even caught the problem before running the command that would have failed.

It turned out that Claude's mental model of FFmpeg was amazing. However, its understanding of where it was running was not correct.

What Claude got wrong

Claude would constantly assume it was in a specific execution environment with certain directories available.

This happened in every session without exception.

The list_assets tool returns filenames (e.g., “big_buck_bunny_480p_h264.mov”) and Claude would automatically assume this file could be found in an “assets” directory.

However, after trying it Claude would read the error and correctly diagnose it as a path problem. The clever fix was to read the absolute path it had seen earlier in a probe_metadata response. It self-corrected successfully.

But it still hit this failure in each and every session, because nothing in the tool interface told it where the working directory was relative to the asset directory. It constructed an assumption, and that assumption was wrong.

In the same way, Claude assumed a parallel “output” directory would exist. It didn't. When the file failed to open, Claude's next attempt was to create it using mkdir.

The MCP server blocked this, only accepting commands that started with ffmpeg for security reasons. And then Claude adapted, dropped the directory creation, and wrote to a relative path instead. Problem solved.

Claude's failures weren't about a lack of knowledge about FFmpeg. They were about flawed inference and assumptions.

These are entirely reasonable assumptions to make, both for a human and an LLM model. They just happened to be wrong here.

Abstraction beats self-correction

So the next step was to abstract this further.

My current experiment exposes just abstracted tools instead of direct FFmpeg access. There are no paths anywhere. Claude passes an asset key and a time range, and eventually a structure of editing directions.

The server handles the FFmpeg invocation, the output location, and the verification. Claude gets back a job ID and, eventually, a public URL.

There is nothing to infer incorrectly about the filesystem because the filesystem is no longer visible. If you prevent the model from making an incorrect assumption, it doesn't need to recover from one.

Claude knows FFmpeg, but it has no idea where the video is was first published 2026‑05‑04

Where it gets complicated AI-generated Content

Tensions and contradictions that this note never fully resolves. What did I miss or avoid in my writing? What things are worth thinking about further?

Consider: By hiding the filesystem to prevent incorrect assumptions, we trade Claude's diagnostic capability for a locked-box interface, potentially sacrificing the model's ability to debug complex, non-standard edge cases that the abstraction layer might not anticipate.
Consider: The transition from Claude correctly inferring intricate FFmpeg syntax to needing a shielded environment suggests that the model's competence is inversely proportional to its visibility into the execution context, raising the question of whether we are building smarter agents or merely more restrictive sandboxes.

Identified by Google Gemini 3.1 Flash Lite

In a Different Voice AI-generated Content

Listen to a reflection and expansion on this note, in a different voice

We Must Blindfold AI To Make It Useful

We are discovering that the only way to make AI truly reliable is to keep it blindfolded, shielding it from the chaotic, illogical way we organize our own computers.

0:00 / 0:00

Transcript of the episode

Imagine you’ve hired a genius assistant. They’re a polyglot, a master of every arcane language ever written, and they can solve a calculus equation on a napkin while blindfolded. But there’s a catch. If you ask them to go to the kitchen, find the drawer marked "cutlery," and pull out a spoon, they’ll stand in the middle of your living room and start screaming because they’ve convinced themselves the spoon is inside a folder on their own desktop. [pause] This is essentially the state of artificial intelligence right now. And it’s leading us to a realization that is, frankly, a little bit embarrassing for us humans. I was reading a note the other day from someone who decided to build an automated video editor. They wanted to see if they could get an AI to use FFmpeg. Now, if you’ve never heard of FFmpeg, just know that it is a piece of software that is notoriously, aggressively hostile to human beings. It’s a command-line tool that looks like it was written by an angry wizard in 1982. It has dozens of flags, weird syntax, and it spits out logs that look like a server is having a nervous breakdown. Most human developers spend their lives looking up FFmpeg tutorials on forums, praying they don’t break their video files. But this person gave the task to an AI, and the AI? It didn't blink. It nailed the math, it handled the complex transitions, it even self-corrected its own syntax errors. It was an absolute prodigy. But then, the AI hit a wall. A physical, spatial, conceptual wall. It couldn’t find the video file. It kept looking in an "assets" folder that didn't exist. It kept trying to create an "output" directory in places it wasn't allowed to go. It wasn't failing because it was stupid. It was failing because it was, in a sense, too smart for its own good. It had spent its entire life reading the internet—reading GitHub repositories and coding forums where developers constantly talk about "assets" and "output" folders. It had built a perfect linguistic map of how code should look. It just didn't realize that, unlike the code it read, the computer it was currently living inside didn't actually follow that map. [thoughtfully] It’s the classic Map versus Territory problem. The AI knows the *script* of the computer, but it has no *embodied* sense of the machine. And here is where it gets interesting. The obvious solution would be to teach the AI better, right? Give it a map. Give it a tour of the hard drive. Tell it exactly where the folders are. But that turns out to be the wrong move. When researchers try to give AIs full access to the operating system, their failure rate skyrockets. They get confused, they hallucinate, they trip over their own digital shoelaces. The breakthrough—the thing that actually makes this work—isn't giving the AI more freedom. It’s blindfolding it. [warmly] Think back to 2005. Steve Jobs was obsessed with the fact that the moment a computer user had to look at a file system—those nested folders and directories—the learning curve hit a cliff. It was too much for the human brain to manage. So, with the iPhone, Apple simply deleted it. They hid the file system. They gave us apps instead. They abstracted the mess away. And now, twenty years later, we are doing the exact same thing for the machines. We are building a secondary, invisible internet just for them. We’re stripping away the folders, the paths, the directories, and giving them clean, curated API keys. We are effectively turning our computers into a walled garden, not because we want to protect ourselves, but because the machines are so overwhelmed by our messy, human way of organizing the world that we have to simplify it just to keep them from crashing. [pause] There’s a deep irony here. We spent decades creating graphical metaphors—folders, desktops, trash cans—to make computers easier for *us* to understand. And now, we’re finding that those very metaphors are the things that confuse the machines the most. The AI can handle the machine-code, the brutal, unadorned logic of FFmpeg, perfectly. It’s the human-made folder that breaks it. The person who wrote that note eventually gave up on letting the AI navigate the file system. They built an abstraction layer. They hid the computer from the AI, and suddenly, everything worked. It makes you wonder about the future we’re building. We’re creating these incredibly powerful agents, but to make them reliable, we have to keep them in a kind of sensory deprivation chamber. We’re building an invisible layer of abstraction between the machine and the world. [thoughtfully] It’s a strange thought: The more autonomous we want our tools to become, the more we have to hide the reality of how they work from them. We are building a world that is becoming increasingly automated, but perhaps, also increasingly opaque. We’re solving the problem of AI reliability by making sure the AI never actually sees the mess we’ve made. Maybe the final step of the digital revolution isn’t teaching the computer how to think like us. It’s admitting that our way of organizing the world is so chaotic that, for the machines to be useful, they’re better off if they never have to see it at all.

Research by Google Gemini 3.1 Pro, script written by Google Gemini 3.1 Flash Lite, and read by Google Gemini 3.1 Flash TTS.

AI-generated Content

Sections marked as AI-generated Content were generated by one or several AI models.

While it may be entertaining and informative, please be aware that it could possibly contain inaccuracies or fabricated information.