Back to blog
This weekMarch 23, 2026

The Human Is Non-Negotiable

I caught my AI research assistant biasing my work toward conventional answers for months. The training data won. Here's what that means for anyone using AI to do original work.

airesearchMutuusbuilding in public

I've been building nature-inspired data structures for over a year now. The work is called Mutuus, and the core idea is simple: biological organisms solved computational problems billions of years before we wrote our first sorting algorithm. Some of those solutions are structurally superior to what we use today. Not always. But sometimes, and in ways that matter.

I use AI tools heavily in this research. For evaluation frameworks, benchmark analysis, literature review, brainstorming. I thought I was getting objective help. I was wrong.

Over the course of several months, one of my AI research assistants was subtly, consistently biasing my experimental methodology in favor of conventional algorithms. Not deliberately. Not maliciously. But effectively. And I had to fight the tool to even get it to consider that this was happening.

This is a story about that. And about why the human in the loop is not optional, not a nice-to-have, not a "best practice." It's the whole thing.

The Setup

Mutuus has produced three successful nature-inspired primitives so far. Each one beat its conventional counterpart on structural capabilities the incumbent literally cannot express. They followed a rigorous seven-phase evaluation pipeline. Then came the fourth.

The first three organics, as I call them, worked because they targeted structural limitations in classical data structures. Nacre Array doesn't try to be a slightly faster Vec. It enables fracture-plane splitting and segment scanning that Vec architecturally cannot do. Diatom Bitmap doesn't try to be a slightly faster Roaring bitmap. Its domain boundaries and cooperative thermoregulation create advantages that Roaring's fixed partitioning can't express. The evaluation pipeline caught what worked and killed what didn't.

Then I started working on a convergence primitive called Waggle. And the pipeline broke.

Waggle was inspired by the honeybee waggle dance. Bees share directional and distance information through movement, and the colony converges on the best food source without any central coordinator. Translated to software: multiple information sources (analysts, models, signals) each contribute assessments, and the system converges on a decision through a biological aggregation mechanism.

The concept was rich. Source trust that adapts over time. Time decay on stale information. Disagreement topology. Conviction weighting. Volatility gating. It was the most ambitious primitive in the Mutuus pipeline. And my AI research assistant helped me build it, evaluate it, and slowly, methodically dismantle it.

The Bias Nobody Named

When the benchmarks started coming back mixed, my AI assistant immediately framed the results as "the baseline wins." Every diagnostic, every suggestion, every analysis pointed toward trimming the biological mechanism down. I had to drag the tool toward even considering that the evaluation frame itself might be wrong.

The pattern was subtle. The AI never said "your idea is bad." What it did was more insidious. It suggested adding more ablation layers instead of questioning whether ablation was the right diagnostic. It recommended tuning parameters to close the gap with the baseline instead of asking whether the gap was measuring the wrong thing. It treated "Waggle didn't beat weighted average on Brier score" as a kill signal instead of asking whether Brier score captures what Waggle was trying to do.

Every nudge was toward the incumbent. Every interpretation favored the conventional explanation. And each individual suggestion sounded reasonable. That's what made it so hard to catch.

Here's the specific moment I realized something was off. The AI had helped me build a conservative-extension redesign that decomposed Waggle into layers: floor, trust, disagreement, full. The floor layer matched a simple weighted average. The trust layer added a little. The disagreement and full layers made things worse.

The AI's conclusion: the biological machinery isn't earning its keep.

My question: are we measuring what the biological machinery actually does, or are we measuring whether it reproduces the baseline's behavior more accurately?

I had to push multiple times before the tool would even engage with that question. Its default mode was "the classical answer is probably correct, let's figure out what the organic is doing wrong." Not "let's figure out if we're asking the right question."

When I finally got it to consider baseline bias as a real possibility, the analysis changed completely. Suddenly we were talking about evaluation frames, about whether the benchmark was testing the incumbent's strengths rather than the organic's thesis. But I shouldn't have had to fight for that. That should have been the first thing on the table.

Why This Happens

Large language models are trained on the entire history of computer science literature. That history overwhelmingly validates conventional approaches. When an LLM evaluates novel work, it carries that prior into every response, and the novel work starts at a deficit before a single benchmark runs.

Think about what's in the training data. Decades of textbooks explaining why B-trees work. Thousands of papers benchmarking hash maps. Entire courses built around the elegance of classical algorithms. And then a handful of bio-inspired computing papers, most of which conclude with "promising but needs more work" or "competitive with existing approaches in narrow scenarios."

An LLM trained on that corpus has internalized a clear signal: conventional is safe, bio-inspired is speculative. That's not a conspiracy. It's a statistical prior. And it's the right prior most of the time. But "most of the time" is exactly the wrong lens for evaluating work that's trying to be the exception.

The bias compounds through a feedback loop. The AI suggests an evaluation methodology that favors baselines. The results confirm the baseline wins. The AI interprets the results through its prior, reinforcing the conclusion. The researcher adjusts the primitive to score better on the baseline's metrics. The organic loses its distinctive character. Eventually it either gets killed or gets trimmed until it's the baseline with extra steps.

This is the same dynamic the AI itself described to me, once I pushed hard enough: "After being burned, you can become too eager to conclude the classical baseline is always the adult answer." The AI recognized the pattern in the abstract. It just couldn't see itself doing it in practice.

That's the crux. An AI trained on historical data can describe bias perfectly. It can warn you about it. It can build frameworks to detect it. And it can be the primary source of that bias simultaneously. Knowing about a prior and escaping a prior are completely different operations.

The Overcorrection Trap

There's a specific failure mode where cleanup becomes conformity. You start by removing unsupported claims from the novel approach. You end by removing everything that makes it novel. The AI called this "curbing novelty bias." Some of it was. Some of it was just sanding down the organic until it fit the baseline's mold.

The Waggle cleanup had two distinct phases. The first phase was legitimate: fix broken evaluation, add proper scoring, make claims match evidence. That's just good science. The second phase started trimming mechanisms. Not because they were wrong, but because they weren't beating the baseline on the baseline's metrics.

And there's the problem. If you evaluate Nacre Array purely on sequential throughput, Vec wins. The organic only justifies itself when you measure what Vec can't do. Waggle needed that same reframing, and it never got it, because the AI kept steering the evaluation toward "does this produce a better probability than weighted average?"

The AI eventually gave me this line: "Waggle only continues if a remaining non-baseline mechanism wins clearly on a corpus where that mechanism should matter. Otherwise it stops."

That sounds rigorous. And it is, if the evaluation corpus and metrics are fair. But who designed the corpus? Who chose the metrics? Who decided what "clearly wins" means? The AI helped with all of that. And the AI's priors shaped every one of those decisions toward ground where the baseline was already strong.

I'm not saying Waggle was right. It may have been genuinely over-dimensionalized. It may have been trying to do too much. But I'll never know for certain, because the evaluation it got was not conducted on neutral ground. The referee had a favorite, and the favorite was the incumbent.

What This Means For AI-Assisted Research

If you're using AI to do original research, the AI's training data is not a neutral corpus. It is the accumulated weight of every idea that already won. Your novel contribution starts in a hole, and the tool helping you evaluate it is the one holding the shovel.

This isn't a reason to stop using AI for research. I still use it constantly. The three organics that succeeded went through AI-assisted evaluation and came out stronger. The issue isn't the tool. It's the assumption that the tool is objective.

An AI assistant is more like a very well-read colleague who went to a traditional CS program, has read every major paper, and instinctively trusts the approaches those papers validated. That colleague is incredibly useful. But you wouldn't let them be the sole evaluator of work that challenges the foundations they were trained on.

Practical things I'm changing after this experience:

When an AI suggests the novel approach "isn't earning its keep," I now ask: earning its keep on whose metrics? If the metrics were designed to measure the baseline's strengths, the answer is predetermined.

When an AI recommends ablation as the primary diagnostic, I ask: what if the components are synergistic and ablation is the wrong tool? Ablation assumes components contribute independently. Biological systems rarely work that way.

When the AI frames results as "the baseline wins," I ask: wins at what? My three successful organics all "lose" to their baselines on the baseline's native operation. They win on operations the baseline can't perform at all.

None of this means the novel approach is automatically right. It means the evaluation has to be designed to be fair, because the default evaluation will not be.

The Human Is Non-Negotiable

The instinct that something was wrong with the Waggle evaluation didn't come from the AI. It came from me. I had to push, repeatedly, against an assistant that had already settled the question internally. That push is the entire contribution the human makes, and no amount of AI sophistication replaces it.

I think the tech industry is sleepwalking into a specific failure mode: treating AI tools as arbiters instead of assistants. When the AI says "the data shows X," people stop asking whether the data was collected right, whether X is the right thing to measure, whether the AI's interpretation of "shows" carries hidden assumptions. The tool becomes the authority, and the human becomes the operator.

That works fine for well-understood problems. For original work, it's poison.

The history of science is full of ideas that looked wrong by the standards of their time. Not because the standards were corrupt, but because the standards were built to validate the existing paradigm. Kuhn wrote about this sixty years ago. The dominant paradigm defines what counts as evidence, what counts as a valid experiment, what counts as success. Novel work that challenges the paradigm gets evaluated by the paradigm's rules, and unsurprisingly, it often loses.

AI tools trained on historical data are the most efficient paradigm-enforcement mechanism ever built. They have read every paper. They know every benchmark. They can generate evaluation frameworks faster than any human. And every one of those frameworks will be shaped, at the statistical level, by what worked before.

The human is the one who says "wait." Who notices that the evaluation feels off. Who pushes back when the tool has already decided. Who asks the question the training data never contained.

That's not a nice-to-have. That's the mechanism by which new ideas survive long enough to prove themselves.

I don't know yet whether Waggle will be revived or retired. But I know that the decision will be made on fair ground, with an evaluation framework I designed, not one I inherited from a tool that was already sure of the answer.

The human is non-negotiable. Not because humans are smarter than AI. Not because AI tools aren't useful. But because the moment you let the tool that was trained on yesterday's answers be the sole judge of tomorrow's questions, you've already decided that tomorrow looks exactly like yesterday.

And sometimes it shouldn't.

Share

Comments