Scanners are harder than parsers

I’ve been writing parsers for various projects. Writing a parser (at least with a simple recursive descent parser) typically involves writing a scanner to perform lexical analysis: turning a stream of characters into a stream of tokens. In contrast to a parser takes a stream of tokens and builds a tree.

Intuitively, it seems like an algorithm that converts a sequence to a sequence should be easier to write than an algorithm that converts a sequence into a tree, but annecdotally it really seems like most of the fiddly subtle bugs all belong to the scanner. Off by one errors in various places, accurately counting the visible characters and lines, handling newlines (thanks Windows), etc. Parser bugs tend to be obvious in comparison.