More

electronsoup · 2026-06-15T19:39:50 1781552390

Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram

electronsoup · 2026-06-15T19:35:23 1781552123

> It gets into loops quite often, and surprisingly often gets the edit tool call wrong

I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn

Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off

girvo · 2026-06-15T22:43:22 1781563402

Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.

Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!

One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.

There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!

gwerbin · 2026-06-16T12:21:11 1781612471

Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.

What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

girvo · 2026-06-16T22:04:16 1781647456

>What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

That is quite literally what I have setup :)

I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time

>Do you think the choice of quantization matters that much for other models

It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.

Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.

rhdunn · 2026-06-16T14:12:39 1781619159

I use promptfoo for evaluation. I'm experimenting with tests for my workflow/use cases.

I have a custom assert for loop/repeat detection that works well:

    def count_repeats(text: str, length: int) -> int:
        n = len(text)
        pattern = text[n - length : n]
        count = 1 # Include the end of the string as matching the substring.

        text = text[: -length]
        while text.endswith(pattern):
            text = text[: -length]
            count = count + 1

        return count


    def repeats(output: str, context: dict[str, any]) -> bool|float|dict[str, any]:
        threshold = context.get('config', {}).get('threshold', 3)
        count = 0
        length = 0

        for n in range(1, (len(output) // 2) + 1):
            n_count = count_repeats(output, n)
            if n_count > count:
                count = n_count
                length = n

        if count >= threshold:
            return { 'pass': True, 'score': 1.0, 'reason': f'Output repeats {count} times with length {length}.' }
        else:
            return { 'pass': False, 'score': 0.0, 'reason': f'Output doesn\'t repeat {threshold} or more times.' }


    def no_repeats(output: str, context) -> dict[str, any]:
        result = repeats(output, context)
        result['pass'] = not result['pass']
        result['score'] = 1.0 - result['score']
        return result

Just add it to your promptfooconfig.yaml:

    defaultTest:
      assert:
        - # ----- The output doesn't repeat/get stuck in a loop.
          type: python
          value: file://asserts/repeat.py:no_repeats

ttoinou · 2026-06-16T09:22:59 1781601779

I tried Step 3.7 Flash on my mac 128GB and it seemed very dumb. antirez ds4 flash is much better !

girvo · 2026-06-16T12:01:13 1781611273

It isn’t though, I’ve run both through a bunch of coding evals. You nearly certainly didn’t have the right sampling parameters or quantised the KV cache?

Ds4 is impressive for what it is, but it loops and over thinks even more, burning massive wall clock time to not even get great outcomes. It’s also limited to a slow speed on my Spark

ttoinou · 2026-06-16T12:19:47 1781612387

I tried a bunch of stuff with step 3.5 and step 3.7 maybe not as much as you. Could you tell me what parameters and launched you’re using ? Antirez ds4 flash q2-q4 works almost out of the box for me

girvo · 2026-06-16T22:10:36 1781647836

To be fair: if you're happy with ds4 then IMO stick with it!

Step 3.7 is notably better than 3.5

1. Use the official StepFun GGUF, IQ4_XS - theirs is better tuned in my experience than the other quants

2. Temp 1.0 top_p 0.95 sampling parameters for reasoning/agentic coding

3. It's really quite important that you don't quantise the KV cache: it made a surprising amount of difference to the looping and over thinking I found, at least for the quantised version of the model. I'm using the full F16 for K, and Q8 for V

4. Note that it now supports `reasoning_effort: low|medium|high` in your chat_template_kwargs; this is super useful :)

ttoinou · 2026-06-17T15:49:37 1781711377

Thank you so much !

electronsoup · 2026-06-12T17:39:07 1781285947

> in secret is impossible without the whole world knowing.

I'm curious about why this is

Outside of an actual test detonation, presumably this could all happen in a secure place?

why_at · 2026-06-12T18:27:10 1781288830

For an example of how closely this is monitored see the Oklo fossil reactors[1]

The proportion of fissile isotopes being mined was off by a fraction of a percent, which caused the French government to launch an investigation. It turns out that millions of years ago the site had formed a natural fission reactor which depleted some of the fissile isotopes

[1]https://en.wikipedia.org/wiki/Natural_nuclear_fission_reacto...

AngryData · 2026-06-12T18:19:43 1781288383

You need highly educated individuals, a massive amount of energy expenditure, a massive facility to house your centrifuges, and an active mine to dig up nuclear materials.

It isn't impossible to keep such a secret, but practically it would be incredibly difficult just through the energy requirements and mining scale which would be hard to hide without anybody asking what exactly are you mining and processing.

lightedman · 2026-06-12T18:46:43 1781290003

"mining scale"

Don't need much area, depends on the concentration of radioactives. I have a small mine that's just a pegmatite body about the size of a house which produces almost marble-sized chunks of a thorium-uranium mixed metamict mineral (I suspect samarskite but Raman and XRD can't give any ID,) you'd barely notice it from a private airplane's typical flying height, however you could dig the entirety of it up and you'd have enough unprocessed uranium for some real fun.

literalAardvark · 2026-06-13T05:27:30 1781328450

You could only somehow sell it. If you tried to enrich that you'd get flagged so fast your head would spin.

daveguy · 2026-06-12T17:42:48 1781286168

It requires very large, high powered centrifuges and tons of uranium. Requires an infrastructure project that is visible from space, even underground. And projects that large are difficult to keep secret anyway.

fragmede · 2026-06-12T17:47:14 1781286434

you're not supposed to spell it out loud. next thing you'll be saying that a gun type nuclear bomb is easier to build than an implosion type nuclear bomb, and then we'll all be off to the races. I mean camps I mean wait shit.

daveguy · 2026-06-12T18:21:19 1781288479

Any large and well resourced enough entity that is interested in building a nuclear weapon already knows how difficult it is to enrich uranium to purity levels necessary for a weapon. It's not exactly a secret.

odo1242 · 2026-06-12T17:44:10 1781286250

You need enough people to work on it that some information will leak, and the facilities needed to build nuclear power are pretty big (uranium refinement, etc.), big enough to be visible on satellite footage. Mostly the first point.

microtonal · 2026-06-12T17:46:38 1781286398

My guess would be that sales of the high-tech gear you need, like Uranium centrifuges, are strongly sales/export controlled. Probably someone would also notice if you start mining Uranium ore.

Aspos · 2026-06-12T21:06:14 1781298374

Centrifuges dont need to be mechanically sophisticated and, frankly, do not require tech which did not exist in the 50es.

15155 · 2026-06-12T17:42:28 1781286148

Espionage.

electronsoup · 2026-06-12T16:10:39 1781280639

They didn't say they had never traveled south though

electronsoup · 2026-06-11T22:09:14 1781215754

If you put that behind an API, you could sell the service much like the AI providers

satvikpendem · 2026-06-11T22:13:39 1781216019

And then get sued for fraud and go under, like Builder.ai

fragmede · 2026-06-11T22:26:54 1781216814

What if, and I know this is utterly batshit insane to suggest, but what if we don't lie about what we're doing?

electronsoup · 2026-06-04T23:05:37 1780614337

Why this is not a PR for llama.cpp

electronsoup · 2026-05-22T20:09:12 1779480552

> likely not well thought out

Or it has been, and cruelty is the point

electronsoup · 2026-05-21T23:59:54 1779407994

Surely this will get arbitraged like anything else, where fans who get picks will onsell tickets

appplication · 2026-05-22T01:30:03 1779413403

You can also make non-transferable tickets. If it’s a decent discount for a specific intended person it makes sense.

linkregister · 2026-05-22T01:00:17 1779411617

The majority of concerts lack sufficient demand for scalpers to make money. It's only the Taylor Swifts and Beyonces whose ticket values exceed the sticker price.

electronsoup · 2026-05-21T23:58:07 1779407887

So now we need to run farms of spotify accounts playing songs to get our concert tickets?

electronsoup · 2026-05-21T18:52:15 1779389535

You may need to move on to other services like Apple Music

Barbing · 2026-05-21T20:23:44 1779395024

Apple’s prioritization of Apple Music on their HomePod turns me off it a bit. Could help guide users more to alternatives but would reduce services sales.

Meh, I’m being kinda unfair b/c the experience is gonna be better. Shame Spotify forces streaming from phone (YouTube Music can run on HomePod itself like Apple Music). YouTube Music via HomePod might play the audio from a music video instead of playing the real song, so does make sense to shuttle normies to the Apple service, but guess I don’t find the situation perfect.