Seems like the definition of success could skew things in a couple of ways. An engineer might use Claude to evaluate an approach and conclude it's not worth pursuing for technical reasons. There'd then be no commit and the session likely doesn't count as verified success even though that was the right call.
And is tests passing a weaker signal than it looks? If I add new functionality without adding tests, existing tests can all pass but it does't tell you whether my new code actually works.
Same judgment as good system design, expressed as instruction rather than architecture doc. What's the research's take on how teams that are strong at that front-loading compare to ones running agents more reactively?
What needs to be done? What counts as done? What must not be changed? When should the agent stop? What signals verify success?
The important question is how that expertise should be passed to the AI. Not just as more context, but in a form the agent can use before execution.
reply