Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram
Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.
Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!
One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.
There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!
Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.
What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.
>What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.
That is quite literally what I have setup :)
I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time
>Do you think the choice of quantization matters that much for other models
It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.
Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.
It isn’t though, I’ve run both through a bunch of coding evals. You nearly certainly didn’t have the right sampling parameters or quantised the KV cache?
Ds4 is impressive for what it is, but it loops and over thinks even more, burning massive wall clock time to not even get great outcomes. It’s also limited to a slow speed on my Spark
I tried a bunch of stuff with step 3.5 and step 3.7 maybe not as much as you. Could you tell me what parameters and launched you’re using ? Antirez ds4 flash q2-q4 works almost out of the box for me
To be fair: if you're happy with ds4 then IMO stick with it!
Step 3.7 is notably better than 3.5
1. Use the official StepFun GGUF, IQ4_XS - theirs is better tuned in my experience than the other quants
2. Temp 1.0 top_p 0.95 sampling parameters for reasoning/agentic coding
3. It's really quite important that you don't quantise the KV cache: it made a surprising amount of difference to the looping and over thinking I found, at least for the quantised version of the model. I'm using the full F16 for K, and Q8 for V
4. Note that it now supports `reasoning_effort: low|medium|high` in your chat_template_kwargs; this is super useful :)
For an example of how closely this is monitored see the Oklo fossil reactors[1]
The proportion of fissile isotopes being mined was off by a fraction of a percent, which caused the French government to launch an investigation. It turns out that millions of years ago the site had formed a natural fission reactor which depleted some of the fissile isotopes
You need highly educated individuals, a massive amount of energy expenditure, a massive facility to house your centrifuges, and an active mine to dig up nuclear materials.
It isn't impossible to keep such a secret, but practically it would be incredibly difficult just through the energy requirements and mining scale which would be hard to hide without anybody asking what exactly are you mining and processing.
Don't need much area, depends on the concentration of radioactives. I have a small mine that's just a pegmatite body about the size of a house which produces almost marble-sized chunks of a thorium-uranium mixed metamict mineral (I suspect samarskite but Raman and XRD can't give any ID,) you'd barely notice it from a private airplane's typical flying height, however you could dig the entirety of it up and you'd have enough unprocessed uranium for some real fun.
It requires very large, high powered centrifuges and tons of uranium. Requires an infrastructure project that is visible from space, even underground. And projects that large are difficult to keep secret anyway.
you're not supposed to spell it out loud. next thing you'll be saying that a gun type nuclear bomb is easier to build than an implosion type nuclear bomb, and then we'll all be off to the races. I mean camps I mean wait shit.
Any large and well resourced enough entity that is interested in building a nuclear weapon already knows how difficult it is to enrich uranium to purity levels necessary for a weapon. It's not exactly a secret.
You need enough people to work on it that some information will leak, and the facilities needed to build nuclear power are pretty big (uranium refinement, etc.), big enough to be visible on satellite footage. Mostly the first point.
My guess would be that sales of the high-tech gear you need, like Uranium centrifuges, are strongly sales/export controlled. Probably someone would also notice if you start mining Uranium ore.
The majority of concerts lack sufficient demand for scalpers to make money. It's only the Taylor Swifts and Beyonces whose ticket values exceed the sticker price.
Apple’s prioritization of Apple Music on their HomePod turns me off it a bit. Could help guide users more to alternatives but would reduce services sales.
Meh, I’m being kinda unfair b/c the experience is gonna be better. Shame Spotify forces streaming from phone (YouTube Music can run on HomePod itself like Apple Music). YouTube Music via HomePod might play the audio from a music video instead of playing the real song, so does make sense to shuttle normies to the Apple service, but guess I don’t find the situation perfect.
reply