Building your AI Engineering Scaffolding

AI Software Engineers are here.

This past June, I led a training with that tag line. Yet just 4 months before that in February, I was a strong skeptic.

I should clarify – an AI Software Engineer is not an AI-enabled IDE. When I say AI SWE, I mean a tool that I can assign a task to, go work on something else, and eventually come back to a ready-for-review PR with the implementation, tests, and all CI checks passing.

By February, I’d been using the same tools everyone else had. I’d tried all of Github Copilot, Cursor, and Windsurf. I’d babysat sessions of Claude Code. I copy-pasted entire files into Claude/Gemini/ChatGPT more times than I’d like to admit.

Yet I’d had these persistent issues that caused persistent doubts:

They produce wrong answers
They don’t know how to run X, compile Y, test Z
They spiral down the wrong rabbit hole
They make the same mistakes over and over again
They can’t handle large codebases
They get confused the longer that the task goes on
They don’t follow patterns from other parts of the codebase

These issues were occurring with tools that were operating under my supervision – how could an autonomous tool possibly navigate these?

What I didn’t internalize at the time was that all of these issues boil down to the same thing: guardrails. And thankfully, some very smart people have been working through exactly that. AI Engineering workflows today are almost entirely a navigation of those guardrails.

Organizations are starting to believe too. It takes just one or two good experiences for anyone, let alone an executive, to start dreaming about the possibilities. Backlogs that disappear. Tickets handled by the customer support team. UI polish becoming easier than ever. Fast migrations/framework upgrades. And with all that excitement, the bureaucracy around vendor decisions and organizational standards have begun.

A lot of stock is put into that vendor decision. Cost analyses are done, bake-offs are run, PR acceptance rates analyzed, and their AEs come to explain why their AI SWE is correct more often than anyone else’s.

Organizations are asking the wrong question

To me, that vendor decision effort is misguided. It’s like spending months debating which region to host your AWS resources in. Sure, you should put some thought into having your resources located close to your customers. But more importantly, you should be spending most of your time making sure that all your infrastructure-as-code, CI, application code, and monitoring is set up such that you can easily switch or add more regions as you go – or even new cloud providers entirely.

That flexibility is super important. Your customer base’s location heatmap may change or grow over time. They may have specific requirements for regions or cloud providers. Or a given region/provider may support a new service that your application needs a year or two from now.

All that’s to say, the initial choice is less important than laying a flexible foundation for the future. The same is true for AI SWEs—except requirements & the state of the art change every week, not every year.

What AI SWEs actually need

Listen, if I brought in an engineer tomorrow, gave them our main repo, and told them to complete tasks – all the while not giving them the ability to run the app locally, run tests, read docs, see github, or ask anyone questions – and then I also reset their brain every time they started a new task – they’d be horrible!

All those issues with AI tools I mentioned in the intro? They’re no different than what a new engineer might face. If we want AI SWEs to act like real SWEs, we need to solve for the same things we would for a real engineer:

They need to be able to read documentation
They need to be able to run tests
They need to be able to run the app locally
They probably need seed data for their local app installation to make sense
They need to be able to run lint checking
They need to be able to read past commits
They need to be able to see and react to CI checks for the code they’re pushing up
They need to be able to receive feedback as they’re working and take it in stride
They need to be able to spin up review apps so their colleagues can click-test their work
They need to be able to read comments on their PRs, respond to them, and make necessary changes

And most importantly, they need to be able to learn! The “reset their brain” point above is the most important. Everyone you hire (ideally) learns from their mistakes. Learns from their PR reviews. Learns from reading the codebase. Learns from being course-corrected by their mentor as they’re working on a task.

AI Engineering simply cannot work if it does not iterate and learn. Guardrail building is exactly that. These guardrails are enforced by roughly the same workflow across all of the AI SWE solutions:

Load up a virtual remote environment (using instructions provided by a human)
Plan out the task based on task analysis + prior knowledge
Implement changes
Iterate based on tests/linting/CI/PR feedback
Write any new knowledge to a knowledge repository

The details like “how do you provide repo instructions for the virtual environment” and “where do you store knowledge” differ across solutions – but the key flow is the same.

It just boils down to this: AI SWEs require the same feedback loops that humans do.

Coalescing to a flexible framework

In the 5 workflow steps laid out above, there’s only 2 pieces that don’t always manifest as regular code changes: the remote environment setup and the knowledge repository.

Remote environment setup

Setting up a remote environment mostly boils down to (a) giving the remote machine access to necessary secrets and (b) telling the remote machine how to install & update dependencies so it can iterate.

The former may not be generalizable, but the latter really should be. There’s no reason I should have to separately specify linting and testing instructions across multiple AI SWEs. A simple one-file SETUP_INSTRUCTIONS.md should do the trick.

Whether it’s a Devin VM, a Cursor Background Agent, or a Github Action – that file should be able to tell every agent how to get their local environment up and running.

Recommendation 1: Setup Instructions should be a common file or set of files that every agent refers to.

Knowledge seems suspiciously close to documentation

If there’s one thing that’s true across every organization, it’s that documentation is hard to maintain. And that’s natural, because humans are lazy, and it’s not immediately gratifying. You’ll hear “thanks for writing these docs” maybe twice a year, while being expected to write docs every single day.

Unfortunately, that’s led to AI SWEs stumbling through repositories. An AI SWE would love to read perfect docs and know exactly how to navigate a repository. Instead, they’re left with outdated documentation everywhere they look, leading to incorrect outputs and frustration on our end.

One thing these AI SWEs do, though, is create their own knowledge over time. Devin literally calls this Knowledge. Claude maintains a Claude.md and/or a .claude/ directory. Cursor maintains a .cursor/rules/ directory.

And when you peek inside, you basically find that these tools are all maintaining their own documentation. They’ll update their priors when necessary, add new information when necessary, and over time continually update their understanding of your repo and your business.

…which leads me to…why are we letting these AI SWEs maintain their own siloed documentation that only they are directed to use?

It seems to me that AI SWEs not only create documentation, but they’re also much better than humans at doing so. And that documentation isn’t just useful to the AI SWE – it’s useful to humans too!

We should be intentionally taking advantage of this – we now have a way to automate one of the most painstaking parts of being a software engineer – but we’re letting AI SWEs hide it under specific agent instructions.

Recommendation 2: Coerce all AI SWE tools to refer to, contribute to, and maintain the same knowledge documentation bank within the repository.

You don’t need a vendor decision for AI SWEs

Most AI practitioners have figured out that it’s basically useless to try and keep up with what model is the “best” at a given task type anymore. Gemini/Claude/OpenAI pass each other every other week. I would never invest too much in a “bake-off” of these tools because it’d be obsolete within a couple weeks.

Choosing an AI-enabled IDE feels similar. Does it matter if Cursor is 5% better than Github Copilot today? Does it matter if Windsurf is 10% better than Cursor tomorrow? What’s the point when they’re being constantly updated?

Now, there’s one thing that actually makes choosing between these tools natural for an organization: they’re user operated.

With IDEs for example, a SWE can’t reasonably perform the same task across 3 IDEs at the same time. They’re going to pick one and fall into that habit. Same with chat interfaces – users may occasionally try the same question across Claude/ChatGPT/Gemini to see which result they like, but they won’t do that forever. They’re going to choose one and eventually just keep using that one.

Given each user is going to choose one tool, it makes sense for an organization to just standardize and choose one solution to purchase with a bulk discount.

AI SWEs are different because their use is not single-threaded. Their autonomous nature means running multiple in parallel is quite reasonable – it does not literally take up 2x the developer time.

And so, it’s reasonable to actually structure the use of AI SWEs in a way that consistently allows your organization to keep up with the state of the art.

What if we could pit the AI SWEs against each other? What if we could consistently evaluate multiple solutions, and thereby improve our time-to-solution?

Elo-based system for AI SWE task completion and evaluation

Recommendation 3: Create an elo-based system of all AI SWE solutions

Maintain an elo-based ranking of AI SWEs (per-repo or org-wide).
For every task, run both the top-ranking AI SWE and one random other AI SWE.
Ask your engineers to review PRs for both and merge the winner.
Instruct both AI SWEs to learn from the winner and loser on each task to improve knowledge.

This accomplishes 4 things:

Continually evaluates correctness of multiple AI SWEs over time
Always includes the current SOTA AI SWE
Provides multiple “attempts” to the engineer to review, which can both broaden their perspective on what the possible solutions could be and allows for reaching a quicker solution if one AI SWE went down the wrong path
Improves the creation of good knowledge as it learns from both the winner and loser PR

Of course, there’s downsides. These solutions are usage-based – meaning this system would cost 2x as compared to just choosing one solution. And yes, engineers review more PRs. But you gain faster time-to-solution on a per-task basis, flexibility to immediately leverage improvements from any AI SWE solution, no lengthy vendor sales cycles, and continuous improvement of your shared knowledge base that benefits your SWEs too.

AI SWEs don’t have feelings – it’s okay to pit them against each other!

My point boils down to this – we should eliminate the esoteric parts of AI SWE workflows today (where they maintain their knowledge, how their remote envs are setup) and instead standardize those workflows so we can run multiple in parallel at any time.

The AI SWE that wins this year's bake-off will be obsolete by next quarter. The infrastructure you build to use all of them won't be.