fix: Insert spaces for skipped tokens in `get_text_from_interval` by jonhoo · Pull Request #40 · antlr4rust/antlr4

jonhoo · 2026-02-09T18:26:11Z

When a grammar uses WS -> skip, whitespace tokens are completely discarded from the lexer output. get_text_from_interval previously concatenated adjacent tokens with no separator, producing unreadable text like @betarecord instead of @beta record.

This matters because get_text_from_interval is used by the default error strategy to build the "input" portion of error messages like no viable alternative at input '...'. Without separators, these messages are confusing and hard to parse.

The fix tracks each token's get_stop() position and inserts a single space when the next token's get_start() exceeds prev_stop + 1, indicating a positional gap from skipped content. When tokens are positionally adjacent (e.g. (a+4)), no space is inserted.

Importantly, this does not affect grammars that use WS -> channel(HIDDEN) instead of skip. Hidden-channel tokens remain in the buffer and are included by get_text_from_interval natively — their positions are contiguous with neighbors, so the gap detection never fires.

When a grammar uses `WS -> skip`, whitespace tokens are completely discarded from the lexer output. `get_text_from_interval` previously concatenated adjacent tokens with no separator, producing unreadable text like `@betarecord` instead of `@beta record`. This matters because `get_text_from_interval` is used by the default error strategy to build the "input" portion of error messages like `no viable alternative at input '...'`. Without separators, these messages are confusing and hard to parse. The fix tracks each token's `get_stop()` position and inserts a single space when the next token's `get_start()` exceeds `prev_stop + 1`, indicating a positional gap from skipped content. When tokens are positionally adjacent (e.g. `(a+4)`), no space is inserted. Importantly, this does not affect grammars that use `WS -> channel(HIDDEN)` instead of `skip`. Hidden-channel tokens remain in the buffer and are included by `get_text_from_interval` natively — their positions are contiguous with neighbors, so the gap detection never fires. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jonhoo · 2026-02-09T18:26:58Z

Worth noting that the Java version does not do this, and so as a result produces sad-to-read outputs like @betarecord that the program using the generated parser cannot disentangle.

rrevenantt · 2026-02-11T12:57:44Z

If default is not sufficient, you can always implement your own TokenStream, can't you?

alexsnaps · 2026-02-12T10:19:55Z

I may be misunderstanding the actual offending code path here, and while the msg would indeed be weird, you should probably use the Some(NoAltErr) and the associated offending_token_index to report the error straight into the actual parsed input instead. WS could be many things aside from and pointing the user back to this would be the clearest error reporting anyways, no?

I'll be honest, I do find the concatenating in that way weird as well... but at this stage I'd rather err on the side of what the Java version does, if only by default. If I get this right, in this case, there are a few other ways to get to a proper reporting as well.

jonhoo · 2026-02-28T15:34:44Z

@alexsnaps I'm not sure I follow — you mean not using get_text_from_interval and instead just doing the underlying logic myself? I suppose I could do that, although it's unfortunate to have to replicate the logic that lives in there in each user.

alexsnaps · 2026-03-01T11:01:34Z

Maybe I'm misunderstanding or doing too many assumptions here.
But is it so that you want to use get_text_from_interval to build an error message, yes? The problem is that WS could be way more than "just" space... Or anything different entirely. All that to say, that what get_text_from_interval could be very different from what the user's input could have been. I'd ideally point back to the user's input and build a "snippet" from that.

Details

... and build something like this:

ERROR: <input>:1:7: Syntax error: extraneous input 'b' expecting <EOF>
| *@a | b
| ......^

But I guess the issue is that 'b' in ...input 'b' expect..., right? Not the 2nd & 3rd lines 🤔

Now related to the actual issue, because of the above, I wonder if it shouldn't have been a list of tokens, over concatenating them. Again, I'd stick to what the Java version does (if only, because antlr is fairly mature library, widely used), but we could add a config or something to change that behavior on demand maybe? But unsure whether the complexity is worth it given, as @rrevenantt said, you could "just" provide your own TokenStream to achieve whatever goals your particular use case would benefit from.

alexsnaps · 2026-03-01T11:05:31Z

Again, no grammar expert myself, but

When a grammar uses WS -> skip, whitespace tokens are completely discarded from the lexer output. get_text_from_interval previously concatenated adjacent tokens with no separator, producing unreadable text like @betarecord instead of @beta record.

it feel wrong for WS -> skip if @beta record is to be a valid token over @betarecord or @beta record, no? I know this isn't your grammar iirc right from the stream, but... if that's a sketchy requirement to begin with, it feels wrong to adapt the library to that. But again, not a grammar expert by no means. But from the explanation above, it does feel... awkward (but I could really well be missing something here).

jonhoo · 2026-03-14T17:08:46Z

I'll defer to you both as the experts here — this could well be a case of "the grammar is to blame here", but it's also not clear to me that WS -> skip should change what gets output by get_text compared to WS -> hidden? I think the position of "we should match the Java version" is probably fine here, and so if anything maybe this should be filed as a "can we make this better" to the Java implementation.

I'm still not entirely clear on what the "implement the TokenStream yourself" version would look like, but admittedly I haven't dug that deeply into ANTLR itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Insert spaces for skipped tokens in `get_text_from_interval`#40

fix: Insert spaces for skipped tokens in `get_text_from_interval`#40
jonhoo wants to merge 1 commit into
antlr4rust:rust-targetfrom
jonhoo:fix/token-text-spacing

jonhoo commented Feb 9, 2026

Uh oh!

jonhoo commented Feb 9, 2026

Uh oh!

rrevenantt commented Feb 11, 2026

Uh oh!

alexsnaps commented Feb 12, 2026

Uh oh!

jonhoo commented Feb 28, 2026

Uh oh!

alexsnaps commented Mar 1, 2026

Uh oh!

alexsnaps commented Mar 1, 2026 •

edited

Loading

Uh oh!

jonhoo commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jonhoo commented Feb 9, 2026

Uh oh!

jonhoo commented Feb 9, 2026

Uh oh!

rrevenantt commented Feb 11, 2026

Uh oh!

alexsnaps commented Feb 12, 2026

Uh oh!

jonhoo commented Feb 28, 2026

Uh oh!

alexsnaps commented Mar 1, 2026

Uh oh!

alexsnaps commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonhoo commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexsnaps commented Mar 1, 2026 •

edited

Loading