I mean, I didn't mean specifically every agent, and I also did not mean as a experiment. I see real at scale uses for this. Agents operating on their own networks, cross-team agents, my own agents on my own laptop, etc. Sharing context only gets more important the better these things get.
I do think Cloudflare probably institutes a similar manual review process as well. I have a handful of fairly vocal and supportive engineers I stay in contact with around https://plannotator.ai (there is an integrated code review surface that creates a feedback loop with your local agent).
> agents do a good job of looping over PR comments
This is the easy part. Most harnesses enable some sort of integration now, so you can actually create a smooth local experience around this as well - better code before it ships to more costly review or bloats PR threads.
> guided, educational code review tool
This is a bit tougher, and I find the main harness chat tends to work best. I learn better when I'm more engaged and aware of what I'm asking. It's easy to stick a code tour type of thing on a screen. It's hard to really nail the right attention and learning mechanism around it.
> If I had to roll out such a development process today, I’d make a standardized Markdown specification the new unit of knowledge for the software project. Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules. Those should be checked into the project repositories along with the implementing code. There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec. This specification, and not the code that materializes it, is what the team would need to understand, review, and be held accountable for.
The constant urge I have today is for some sort of spec or simpler facts to be continuously verified at any point in the development process; Something agents would need to be aware of. I agree with the blog and think it's going to become a team sport to manage these requirements. I'm going to try this out by evolving my open source tool [1] (used to review specs and code) into a bit more of a collaborative & integrated plane for product specs/facts - https://plannotator.ai/workspaces/
What we really we need is some kind of more detailed spec language that doesn't have edge cases, where we describe exactly what we expect the generated code to do, and then formally verify that the now generated code matches the input spec requirement. It'd be super helpful to have something more formal with no ambiguity, especially because the english language tends to be pretty ambiguous in general which can result in spec problems
I also tend to find especially that there's a lot of cruft in human written spec languages - which makes them overly verbose once you really get into the details of how all of this works, so you could chop a lot of that out with a good spec language
I nominate that we call this completely novel, evolving discipline: 'programming'
There are languages like Dafny that permit you to declare pre- and post-conditions for functions. Dafny in particular tries to automatically verify or disprove these claims with an SMT solver. It would be neat if LLMs could read a human-written contract and iterate on the implementation until it's provably correct. I imagine you'd have much higher confidence in the results using this technique, but I doubt that available models are trained appropriately for this use case.
> where we describe exactly what we expect the generated code to do, and then formally verify that the now generated code matches the input spec requirement.
In ancient times we had tech to do exactly that: Programming languages and tests.
> What we really we need is some kind of more detailed spec language that doesn't have edge cases, where we describe exactly what we expect the generated code to do, and then formally verify that the now generated code matches the input spec requirement.
That's theorem provers and they're awful for anything of any reasonable complexity.
That's any programming language, really [1]. Any website contains millions of "proofs", not all of them are useful. Choosing what needs to be proven is hard. And the spectrum of languages/type systems and their usability as either is more explored nowadays than it used to be. If you don't likue coq, you can look for agda. If agda is too far for you, you can look for Haskell. If that's still impractical, there's rust or f#, etc... The tradeoff you have between "convenient for expressing proofs" and "convenient for programming" has many options.
Shame us all for moving away from something so perfect, precise, and that "doesn't have edge cases."
Hey - if you invent a programming language that can be used in such a way and create guaranteed deterministic behavior based on expressed desires as simple as natural language - ill pay a $200/m subscription for it.
As people are discovering, natural language is insufficiently precise to be able to specify edge cases. Any language precise enough to be formally verified against is a programming language
we're going to end up speaking past each other - but generally I do agree with you and am not denouncing the importance of formal verification methods. I do think abstractions are going to dominate the human ux above them
Is Claude good at working with XML prompts, or is XML good at convincing users to write more Claude-able specs? I am intensely skeptical that you could write an XML document describing a nontrivial web application in full detail, but I could easily imagine someone who thinks they have to stripping out important details because they don't really map to XML.
I haven't done it professionally, but my understanding is that this kind of work is much more in the second category, where you have to understand the closest approximation to what you want that the LLM can reliably produce or the training won't work at all.
I don’t know why, but I get this feeling whenever someone uses “insanely” or “shockingly” along with AI, I think they’re bot or are writing based on a guideline! No offense, btw, I’m not saying you’re a bot.
I'm prepared to excise the word "genuinely" from my vocabulary after working with Claude.
One of my biggest fears with using AI at work is that I will subconsciously start talking and writing like a bot, despite making conscious efforts to do the opposite. Just like how when you read a lot of books by one author, their style infects your own writing style.
We've been through that so many times. When UML arrived (and ALM tools suites, IBM was trying to sell it, Borland was trying to sell it, all those fancy and expensive StarTeam, Caliber and Together soft), then BPML and its friends arrived, Business Rule Management System (BRMS), Drools in Java world, etc.
It all failed. For a simple reason, popularized by Joel Spolsky: if you want to create specification that describes precisely what software is doing and how it is doing its job, then, well, you need to write that damn program using MS Word or Markdown, which is neither practical nor easy.
The new buzzword is "spec driven development", maybe it will work this time, but I would not bet on that right now.
BTW: when we will be at this point, it does not make sense anymore to generate code in programming languages we have today, LLM can simply generate binaries or at least some AST that will be directly translated to binary. In this way LISP would, eventually, take over the world!.
I called it gates on mine. I loved Beads but it closed tasks without any validation steps. Beads also had other weird issues, so I made my own alternative. I think "Gates" is also used by others projects that took on the same challenge I did in mine weirdly enough.
I’ve been considering this as well, and trying to get my colleagues to understand and start doing it. I use it to pretty decent effect in my vibe coded slop side projects.
In the new world of mostly-AI code that is mostly not going to be properly reviewed or understood by humans, having a more and more robust manifestation and enforcement, and regeneration of the specs via the coding harness configuration combined with good old fashioned deterministic checks is one potential answer.
Taken to an extreme, the code doesn’t matter, it’s just another artifact generated by the specs, made manifest through the coding harness configuration and CI. If cost didn’t matter, you could re-generate code from scratch every time the specs/config change, and treat the specs/config as the new thing that you need to understand and maintain.
> If cost didn’t matter, you could re-generate code from scratch every time the specs/config change, and treat the specs/config as the new thing that you need to understand and maintain.
The critical insight is that this is not true. When people depend on your software, replacing it with an entirely different program satisfying all of your specs and configurations is a large, months-long project requiring substantial effort and coordination even after new program is written. It seems to work in vibe coded side projects because you don't have those dependencies; if you got an angry email from a CEO saying that moving a critical button ruined their monthly review cycle, and demanding 7 days notice before you move any buttons going forwards, you'd just tell them no.
I was about to comment: "HTML creates too much friction after doing all sorts of visual explainers" ... thanks for articulating it well.
As a layer of abstraction, it also creates more requirements: need a browser, likely need includes/cdn libs to avoid bloat, all sorts of other things. Markdown is consumable, diffable, shareable in raw form - and you can add enrichment layers on top without much effort.
- a tiny DSL for rendering anything custom, where every markdown renderer potentially introduces its own unique bit of syntax that's not transferable (example: frontmatter in Obsidian where you can put tags, that's not vanilla markdown)
- a note taking / viewing app, of which we now have dozens, where moving notes from one app to another creates friction, because of the custom "enrichment" layer each of those apps have (example: any popular plugin in Obsidian, where your notes are now littered with that plugin's tags)
HTML has this type of "enrichment" built-in.
Anyway, I am not trying to convince anyone. This is me working through this in my head. I have a large vault of Obsidian notes that I want to make more useful. And I figure, HTML is the standard-issue tool for producing beautiful-looking and functional text documents, so it's worth thinking about.
I think this tech (modern p2p) represents what agent-to-agent (a2a) should be built on.
Every agent should be reachable to each other without hosting itself as an http server.
related prototypes
https://github.com/eqtylab/agentbeam
https://github.com/eqtylab/real-a2a
reply