Hacker Newsnew | past | comments | ask | show | jobs | submit | electronsoup's commentslogin

Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram

> It gets into loops quite often, and surprisingly often gets the edit tool call wrong

I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn

Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off


Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.

Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!

One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.

There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!


Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.

What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.


>What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

That is quite literally what I have setup :)

I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time

>Do you think the choice of quantization matters that much for other models

It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.

Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.


I use promptfoo for evaluation. I'm experimenting with tests for my workflow/use cases.

I have a custom assert for loop/repeat detection that works well:

    def count_repeats(text: str, length: int) -> int:
        n = len(text)
        pattern = text[n - length : n]
        count = 1 # Include the end of the string as matching the substring.

        text = text[: -length]
        while text.endswith(pattern):
            text = text[: -length]
            count = count + 1

        return count


    def repeats(output: str, context: dict[str, any]) -> bool|float|dict[str, any]:
        threshold = context.get('config', {}).get('threshold', 3)
        count = 0
        length = 0

        for n in range(1, (len(output) // 2) + 1):
            n_count = count_repeats(output, n)
            if n_count > count:
                count = n_count
                length = n

        if count >= threshold:
            return { 'pass': True, 'score': 1.0, 'reason': f'Output repeats {count} times with length {length}.' }
        else:
            return { 'pass': False, 'score': 0.0, 'reason': f'Output doesn\'t repeat {threshold} or more times.' }


    def no_repeats(output: str, context) -> dict[str, any]:
        result = repeats(output, context)
        result['pass'] = not result['pass']
        result['score'] = 1.0 - result['score']
        return result
Just add it to your promptfooconfig.yaml:

    defaultTest:
      assert:
        - # ----- The output doesn't repeat/get stuck in a loop.
          type: python
          value: file://asserts/repeat.py:no_repeats

I tried Step 3.7 Flash on my mac 128GB and it seemed very dumb. antirez ds4 flash is much better !

It isn’t though, I’ve run both through a bunch of coding evals. You nearly certainly didn’t have the right sampling parameters or quantised the KV cache?

Ds4 is impressive for what it is, but it loops and over thinks even more, burning massive wall clock time to not even get great outcomes. It’s also limited to a slow speed on my Spark


I tried a bunch of stuff with step 3.5 and step 3.7 maybe not as much as you. Could you tell me what parameters and launched you’re using ? Antirez ds4 flash q2-q4 works almost out of the box for me

To be fair: if you're happy with ds4 then IMO stick with it!

Step 3.7 is notably better than 3.5

1. Use the official StepFun GGUF, IQ4_XS - theirs is better tuned in my experience than the other quants

2. Temp 1.0 top_p 0.95 sampling parameters for reasoning/agentic coding

3. It's really quite important that you don't quantise the KV cache: it made a surprising amount of difference to the looping and over thinking I found, at least for the quantised version of the model. I'm using the full F16 for K, and Q8 for V

4. Note that it now supports `reasoning_effort: low|medium|high` in your chat_template_kwargs; this is super useful :)


> in secret is impossible without the whole world knowing.

I'm curious about why this is

Outside of an actual test detonation, presumably this could all happen in a secure place?


For an example of how closely this is monitored see the Oklo fossil reactors[1]

The proportion of fissile isotopes being mined was off by a fraction of a percent, which caused the French government to launch an investigation. It turns out that millions of years ago the site had formed a natural fission reactor which depleted some of the fissile isotopes

[1]https://en.wikipedia.org/wiki/Natural_nuclear_fission_reacto...


You need highly educated individuals, a massive amount of energy expenditure, a massive facility to house your centrifuges, and an active mine to dig up nuclear materials.

It isn't impossible to keep such a secret, but practically it would be incredibly difficult just through the energy requirements and mining scale which would be hard to hide without anybody asking what exactly are you mining and processing.


"mining scale"

Don't need much area, depends on the concentration of radioactives. I have a small mine that's just a pegmatite body about the size of a house which produces almost marble-sized chunks of a thorium-uranium mixed metamict mineral (I suspect samarskite but Raman and XRD can't give any ID,) you'd barely notice it from a private airplane's typical flying height, however you could dig the entirety of it up and you'd have enough unprocessed uranium for some real fun.


You could only somehow sell it. If you tried to enrich that you'd get flagged so fast your head would spin.

It requires very large, high powered centrifuges and tons of uranium. Requires an infrastructure project that is visible from space, even underground. And projects that large are difficult to keep secret anyway.

you're not supposed to spell it out loud. next thing you'll be saying that a gun type nuclear bomb is easier to build than an implosion type nuclear bomb, and then we'll all be off to the races. I mean camps I mean wait shit.

Any large and well resourced enough entity that is interested in building a nuclear weapon already knows how difficult it is to enrich uranium to purity levels necessary for a weapon. It's not exactly a secret.

You need enough people to work on it that some information will leak, and the facilities needed to build nuclear power are pretty big (uranium refinement, etc.), big enough to be visible on satellite footage. Mostly the first point.

My guess would be that sales of the high-tech gear you need, like Uranium centrifuges, are strongly sales/export controlled. Probably someone would also notice if you start mining Uranium ore.

Centrifuges dont need to be mechanically sophisticated and, frankly, do not require tech which did not exist in the 50es.

Espionage.

They didn't say they had never traveled south though

If you put that behind an API, you could sell the service much like the AI providers

And then get sued for fraud and go under, like Builder.ai

What if, and I know this is utterly batshit insane to suggest, but what if we don't lie about what we're doing?

Why this is not a PR for llama.cpp

> likely not well thought out

Or it has been, and cruelty is the point


Surely this will get arbitraged like anything else, where fans who get picks will onsell tickets


You can also make non-transferable tickets. If it’s a decent discount for a specific intended person it makes sense.


The majority of concerts lack sufficient demand for scalpers to make money. It's only the Taylor Swifts and Beyonces whose ticket values exceed the sticker price.


So now we need to run farms of spotify accounts playing songs to get our concert tickets?


You may need to move on to other services like Apple Music


Apple’s prioritization of Apple Music on their HomePod turns me off it a bit. Could help guide users more to alternatives but would reduce services sales.

Meh, I’m being kinda unfair b/c the experience is gonna be better. Shame Spotify forces streaming from phone (YouTube Music can run on HomePod itself like Apple Music). YouTube Music via HomePod might play the audio from a music video instead of playing the real song, so does make sense to shuttle normies to the Apple service, but guess I don’t find the situation perfect.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: