fix: Insert spaces for skipped tokens in get_text_from_interval#40
fix: Insert spaces for skipped tokens in get_text_from_interval#40jonhoo wants to merge 1 commit into
get_text_from_interval#40Conversation
When a grammar uses `WS -> skip`, whitespace tokens are completely discarded from the lexer output. `get_text_from_interval` previously concatenated adjacent tokens with no separator, producing unreadable text like `@betarecord` instead of `@beta record`. This matters because `get_text_from_interval` is used by the default error strategy to build the "input" portion of error messages like `no viable alternative at input '...'`. Without separators, these messages are confusing and hard to parse. The fix tracks each token's `get_stop()` position and inserts a single space when the next token's `get_start()` exceeds `prev_stop + 1`, indicating a positional gap from skipped content. When tokens are positionally adjacent (e.g. `(a+4)`), no space is inserted. Importantly, this does not affect grammars that use `WS -> channel(HIDDEN)` instead of `skip`. Hidden-channel tokens remain in the buffer and are included by `get_text_from_interval` natively — their positions are contiguous with neighbors, so the gap detection never fires. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Worth noting that the Java version does not do this, and so as a result produces sad-to-read outputs like |
|
If default is not sufficient, you can always implement your own |
|
I may be misunderstanding the actual offending code path here, and while the I'll be honest, I do find the concatenating in that way weird as well... but at this stage I'd rather err on the side of what the Java version does, if only by default. If I get this right, in this case, there are a few other ways to get to a proper reporting as well. |
|
@alexsnaps I'm not sure I follow — you mean not using |
|
Maybe I'm misunderstanding or doing too many assumptions here. Details... and build something like this: But I guess the issue is that Now related to the actual issue, because of the above, I wonder if it shouldn't have been a list of tokens, over concatenating them. Again, I'd stick to what the Java version does (if only, because antlr is fairly mature library, widely used), but we could add a config or something to change that behavior on demand maybe? But unsure whether the complexity is worth it given, as @rrevenantt said, you could "just" provide your own |
|
Again, no grammar expert myself, but
it feel wrong for |
|
I'll defer to you both as the experts here — this could well be a case of "the grammar is to blame here", but it's also not clear to me that I'm still not entirely clear on what the "implement the |
When a grammar uses
WS -> skip, whitespace tokens are completely discarded from the lexer output.get_text_from_intervalpreviously concatenated adjacent tokens with no separator, producing unreadable text like@betarecordinstead of@beta record.This matters because
get_text_from_intervalis used by the default error strategy to build the "input" portion of error messages likeno viable alternative at input '...'. Without separators, these messages are confusing and hard to parse.The fix tracks each token's
get_stop()position and inserts a single space when the next token'sget_start()exceedsprev_stop + 1, indicating a positional gap from skipped content. When tokens are positionally adjacent (e.g.(a+4)), no space is inserted.Importantly, this does not affect grammars that use
WS -> channel(HIDDEN)instead ofskip. Hidden-channel tokens remain in the buffer and are included byget_text_from_intervalnatively — their positions are contiguous with neighbors, so the gap detection never fires.