A quick intro to writing a parser with Treetop

barrkel · on Oct 22, 2010

This page would be more readable if there wasn't a huge black bar overlaid across the middle of it.

aarongough · on Oct 22, 2010

Would you be able to tell me what your screen resolution is?

I decided to go with the bottom-attached menu because it makes navigation painless, on the resolutions I've tested the site with the difference between the above-menu area and the menu area made reading seem fairly natural.

adbge · on Oct 22, 2010

Even at 1680x1050, I find attached menu's really grating. I think it's the lost screen real estate because the larger they are, the more irritating I find them. On the other hand, Twitter went ahead and implemented a top attached menu, so maybe I'm in the minority.

Anyways, back on topic, do you (or any HNers) have any recommendations on how to get started with formal grammars and parsers? Is there a canonical introductory text?

silentbicycle · on Oct 22, 2010

Niklaus Wirth's _Compiler Construction_ (free online, http://www-old.oberon.ethz.ch/WirthPubl/CBEAll.pdf) has a good intro to parsing, though for better or worse it's skewed towards recursive-descent parsing and has a "hit the ground running / focus on theory later" style.

After that, you could dig deeper with Andrew Appel's _Modern Compiler Techniques in {ML,C}_ books. The ML one is better, IMHO. Those cover other methods (LL, LALR, SLR, etc.) in greater detail, and I'd also recommend either for learning compilers in a heartbeat. (Appel's _Compiling with Continuations_ is also excellent, but doesn't cover parsing.)

Following that, Dick Grune's _Parsing Techniques: A Practical Guide_ (http://www.few.vu.nl/~dick/PTAPG.html) is a good reference...and thorough. If you're reading it to learn the basics, it might seem a bit dry, but I think it's at a sweet spot between depth of coverage and deference to the extensive bibliography. I have the second edition; the first is free online. Not sure about the differences, but the coverage of fundamentals probably haven't changed much. Also, while the other two are compiler books with chapters on parsing, this one is 100% parsing, and gets to a lot of interesting parsing algorithms (e.g. Earley parsing) that don't usually get much love in compiler texts.

Some people will also recommend the Dragon book, but I think those three will be more helpful. I haven't read the new edition, but the old seems drier & less thorough than the Grune book, less direct than the Wirth book, and less modern than the Appel book.

Also: For learning lex and yacc, the intros in the _4.4BSD Programmer's Supplementary Documents_ ("PSD", included with OpenBSD and probably the other BSDs, and not hard to find online) are hard to beat. The O'Reilly _Lex and Yacc_ book somehow manages to be roughly ten times as long yet less informative.

And if you have full control over the syntax used, S-expressions (Lisp), RPN (Forth), or Lua/JSON will let you dodge the issue of parsing entirely.

adbge · on Oct 22, 2010

That is likely the most thorough, informative response I have ever received. I almost qualified that with "on the internet", but then I realized, hell, it's probably more informative than any response I've ever received anywhere.

I recently saw a news piece about a philanthropist helping people in poor African villages by simply building wells. The tribesmen were so grateful, one said "I wish he lives one hundred and fifty years so that he can continue helping people like us."

Well, Scott, thank you. I hope you live one hundred and fifty years.

silentbicycle · on Oct 22, 2010

There's some real gems in the archives here. For starters, see http://news.ycombinator.com/item?id=835020 . More generally, try googling site:news.ycombinator.com keywords.

And, thanks. :)

aarongough · on Oct 22, 2010

I think the path to learning about parsers is different for everyone. Personally I started by reading docs for parser generators like YACC and Bison, then just started messing around until I felt I understood what was going on.

Picking a grammar that is fairly small is a definite help at the early stages which is why I chose S-Expressions for this post.

EDIT: getting started with lexers first may actually be an easier way to start. PEGs like in the article combine a lexer and parser, but understanding each separately is important.

silentbicycle · on Oct 22, 2010

In my experience* , PEGs are really good for basic parsing (from basics to a few steps beyond advanced Perl regex hackery), but aren't really a complete substitute for other formal parsing methods. Having separate lexing and parsing phases can make things cleaner, because the grammar no longer has to keep track of purely lexical details like whitespace. PEGs are simple and easy to build interactively, though, so they occupy a very useful middle ground. I'm surprised they aren't more popular.

* I haven't used Treetop or PEG.js, but I've done a lot with Lua's LPEG (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html), which is based on the same formalism.

aarongough · on Oct 22, 2010

My experience with really large scale grammars is extremely limited so it's good to hear your thoughts on that!

My main issue so far with PEGs is that the generators seem to produce parsers that are relatively slow. That being said I need to experiment with PEGs in languages other than Ruby. Thanks!

silentbicycle · on Oct 22, 2010

LPEG is actually quite fast. (Often as fast or faster than well-tuned regexp implementations.)

In general, Lua is usually faster than Ruby and JS because the language itself seems to have been engineered with a very clear understanding of where the language could be expressive/dynamic without unnecessarily sacrificing efficiency, and it's had a long time to mature. LPEG was written by one of the primary Lua authors, and it shows.

If you're geeked about PEGs, I highly recommend reading the paper and then the source. There are some details specific to the Lua C API, but it shouldn't be that hard to port or understand in isolation.

hsmyers · on Oct 22, 2010

At 1440x900 the bar is annoyingly in my face. I'd prefer that it be at the top. Don't mind the small amount that shows on the 'other' side of the bar, just expect menus to be at top--- perhaps if it wasn't so thick it wouldn't be so bothersome? Good article even with the bar...

wwortiz · on Oct 22, 2010

I'm at 1366x768 and your bar is insane with the amount of space it takes up. (That 768 is also taken up by browser tabs and such things, as well as a taskbar.)

Navigation is fine when it is on the top or bottom, or even both, of the page and if you really want something to scroll with the content please choose vertical navigation in a sidebar as that at least is only mildly annoying.

zach · on Oct 22, 2010

The line peeking out below the navigation bar is distracting at best. It's better than a diagonal bar, though.

febeling · on Oct 22, 2010

This example uses Treetop. The same variety of parser, PEG (parsing expression grammar), but not code-generating, but dynamically defining ruby code is Citrus. It was really a pleasure to work with. The difference is that you don't need a preliminary compile step in a rake file e.g., which I like better for a language as dynamic as ruby.

http://github.com/mjijackson/citrus

aarongough · on Oct 22, 2010

The intermediate compilaton step is actually optional when using Treetop. You can see in the code in the article that Treetop is directly loading and interpreting the grammar at runtime. (Of course there is always going to be some compilation 'behind the scenes', but there's no need for it to be explicit.)

Citrus looks interesting! From what I can see the PEG syntax used by Citrus is very similar to Treetop. I'll definitely check it out more later, I'm particularly interested in performance difference between the two.

febeling · on Oct 22, 2010

In fact, you're right, it doesn't necessarily write a source file. I missed that last time when I looked. But Treetop will still create ruby code, write it into a string and then eval that. I didn't find that approach really elegant.

aarongough · on Oct 22, 2010

I do agree to an extent. I'd be interested to see if Citrus is faster... I'll write up a test in a week or two and we shall see!

chipsy · on Oct 22, 2010

My favorite right now is PEG.js: http://pegjs.majda.cz/

RickHull · on Oct 22, 2010

ctrl-f grammer

Really?

aarongough · on Oct 22, 2010

Fixed. It must have been a late night!