There is a lower bar (that gets lower over time), but ime, the config you are describing is too low still.
qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.
tl;dr - the models you appear to be trying with are too small or too quant'd
You can do this in opencode and pi (haven't used), by defining your own agents or overriding the built-in ones, so in your primary agent you can disable all tools and give it good instructions for how to delegate
I imagine most harnesses should have a way to do this today, if they don't, get a new one. OpenCode i.e. is highly customizable, Claude and VS Code both support a ton as well including custom agents (though unclear if you can create custom top-level in claude-code)
Thanks, those don't deterministically prevent the main loop from using tools thought, unless I'm wrong that's just prompting the main agent on when to use specialized sub agents
you can configure tools, thinking, permissions et al on a per agent basis in the frontmatter, or via config (which they use in the examples), either location is valid, merging order (?)
the main agent would be very different, basically an orchestrator, and you are "loop engineering" it, and turning off all the things for this main agent besides being able to run subagents
They all give slightly different results, you can dedup / fusion with heuristics / another agent
reply