Hacker Newsnew | past | comments | ask | show | jobs | submit | mcyc's commentslogin

Lichess has a checkmate captcha that I think is cute.

It requires you to solve a mate-in-one puzzle to, e.g., post on the forums.

(Sorry, don't have a better link, there wasn't any non-technical I could find about it).

https://www.reddit.com/r/chess/comments/q19wgq/til_lichess_d...


Because computers turned out to be so bad at chess? :)

Reverse captcha: only robots can reprove one of the Euler problems on the fly? Statistically speaking we can round the people who can into the outlier group, right?

That's actually interesting:

Like when games detect aimbots, they don't ban people, but put them in an aimbot bracket, so everyone you play with is a cheater.

Provide a captcha that is essentially harder for a human to solve, but trivial for either a human or an AI, and transparently separate them into two communities.


Yeah, it is interesting to me that it is coming from the _city_'s government. I've seen sovereign AI things at the country level, but this is the first municipal one I have seen.

smart cookies. wish more people thought better of governance and it's workers.

You are right about most tokenizers being heavily biased towards English, but the situation is not so bad for Portuguese. Here are some results on the Goldfish corpus [1] with a few different tokenizers. This measures #characters in corpus / #subwords in tokenized corpus.

```

Llama3

english, 0.216

portuguese, 0.285

italian, 0.287

greek, 0.592

```

```

Gemma4

english, 0.219

portuguese, 0.246

italian, 0.249

greek, 0.537

```

```

Kimi2.6

english, 0.214

portuguese, 0.310

italian, 0.308

greek, 0.716

```

Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).

On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.

So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.

-----------------------

[1]: https://huggingface.co/datasets/goldfish-models/fish-food

[2]: https://arxiv.org/abs/2404.17790


> tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch.

This is very interesting, I didn't know that! Thanks for the links!


This is a fantastic guide! I did a lot of work on structured generation for my PhD. Here are a few other pointers for people who might be interested:

Some libraries:

- Outlines, a nice library for structured generation

  - https://github.com/dottxt-ai/outlines
- Guidance (already covered by FlyingLawnmower in this thread), another nice library

  - https://github.com/guidance-ai/guidance
- XGrammar, a less-featureful but really well optimized constrained generation library

  - https://github.com/mlc-ai/xgrammar

  - This one has a lot of cool technical aspects that make it an interesting project
Some papers:

- Efficient Guided Generation for Large Language Models

  - By the outlines authors, probably the first real LLM constrained generation paper

  - https://arxiv.org/abs/2307.09702
- Automata-based constraints for language model decoding

  - A much more technical paper about constrained generation and implementation

  - https://arxiv.org/abs/2407.08103
- Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

  - A bit of self-promotion. We show where constrained generation can go wrong and discuss some techniques for the practitioner

  - https://openreview.net/pdf?id=DFybOGeGDS
Some blog posts:

- Fast, High-Fidelity LLM Decoding with Regex Constraints

  - Discusses adhering to the canonical tokenization (i.e., not just the constraint, but also what would be produced by the tokenizer)

  - https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html
- Coalescence: making LLM inference 5x faster

  - Also from the outlines team

  - This is about skipping inference during constrained generation if you know there is only one valid token (common in the canonical tokenization setting)

  - https://blog.dottxt.ai/coalescence.html


Hello, the part about canonical filtering in https://openreview.net/pdf?id=DFybOGeGDS doesn't seem to try to account for pretokenization. For example, if you receive " 天天中彩票APP" in o200k, it means there has to be a lowercase letter within the span of letters, and while tokens like (4 spaces) may be pairwise compatible with tokens like "123" according to the BPE merge rules, the pretokenizer would split the span of spaces to give (3 spaces), " ", "123" instead. Are you aware of any work that does actual canonical generation for models with this kind of pretokenization regex?


> Here are a few other pointers

Proceeds to list all the libraries already listed in the guide.


I've never fully understood where Outlines fit in the stack. Is it a way to create a structured output API similar to the ones big providers have? Have you looked at something like BAML?


What a gold mine!

Automata-based constraints is fun.


This is a nice attitude. I think HN is overall pretty nice for geeking out and also hearing other people geek out, but there is still a strain of elitism (not like StackExchange thankfully) and so I'm happy to see comments like this.


> there is still a strain of elitism

Those types can't help themselves so patterns emerge and usernames become recognizable after a while. There are some people who I just don't bother engaging with any more. Of course, those experiences are my own and maybe not the same experience as others.


You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...


Just a minor nit: SentencePiece is a library, not a tokenization algorithm. It implements two tokenization algorithms, Unigram and BPE.

BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.

The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.

Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).


If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?

Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.


People may also be interested in Pynini [1], a python wrapper (+ a lot of additional ease-of-use functionality) of OpenFst [2] (a really great library for transducers).

There are some good tutorials in the form of homework assignments (from like Johns Hopkins and some others) that go through Pynini use cases.

[1] https://www.openfst.org/twiki/bin/view/GRM/Pynini

[2] https://www.openfst.org/


You might be interested in the Recurse Center (https://www.recurse.com/) and the experiences of people who have gone through it (they heavily encourage blogging about your time there so there is lots to read).

Note: I am not affiliated with the Recurse Center, just a big fan.


It's two problems:

1) the sequence length increases too much. Idk what the average token length is for Llama, but imagine it's like 5+ bytes. Using individual bytes as tokens immediately makes the context 5x longer which is super bad for inference speed and memory requirements (since attention inference is quadratic in the length of the sequence).

2) individual bytes have essentially no meaning, so byte embeddings are harder to learn. Subword tokens aren't a perfect solution, but they definitely often have some standalone meaning where embeddings make sense.

I'll give another example from a recent paper that tries to eliminate tokenizers (this is a popular research direction) [1].

Figure 4 is a really good example of why byte-level models are wasting computation. Once part of a word is generated, most of the remaining bytes are assigned basically probability 1. But a byte-level model would still have to spend time decoding them. With a subword-level model most of these easy-to-decode bytes would be packed together in a single token so you don't have to decode them individually.

When model APIs bill by the token, this is an important consideration.

[1]: https://arxiv.org/abs/2412.09871


Thank you very much for the thorough reply! I highly appreciate it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: