Skip to content

fix: Insert spaces for skipped tokens in get_text_from_interval#40

Open
jonhoo wants to merge 1 commit into
antlr4rust:rust-targetfrom
jonhoo:fix/token-text-spacing
Open

fix: Insert spaces for skipped tokens in get_text_from_interval#40
jonhoo wants to merge 1 commit into
antlr4rust:rust-targetfrom
jonhoo:fix/token-text-spacing

Conversation

@jonhoo

@jonhoo jonhoo commented Feb 9, 2026

Copy link
Copy Markdown

When a grammar uses WS -> skip, whitespace tokens are completely discarded from the lexer output. get_text_from_interval previously concatenated adjacent tokens with no separator, producing unreadable text like @betarecord instead of @beta record.

This matters because get_text_from_interval is used by the default error strategy to build the "input" portion of error messages like no viable alternative at input '...'. Without separators, these messages are confusing and hard to parse.

The fix tracks each token's get_stop() position and inserts a single space when the next token's get_start() exceeds prev_stop + 1, indicating a positional gap from skipped content. When tokens are positionally adjacent (e.g. (a+4)), no space is inserted.

Importantly, this does not affect grammars that use WS -> channel(HIDDEN) instead of skip. Hidden-channel tokens remain in the buffer and are included by get_text_from_interval natively — their positions are contiguous with neighbors, so the gap detection never fires.

When a grammar uses `WS -> skip`, whitespace tokens are completely
discarded from the lexer output. `get_text_from_interval` previously
concatenated adjacent tokens with no separator, producing unreadable
text like `@betarecord` instead of `@beta record`.

This matters because `get_text_from_interval` is used by the default
error strategy to build the "input" portion of error messages like
`no viable alternative at input '...'`. Without separators, these
messages are confusing and hard to parse.

The fix tracks each token's `get_stop()` position and inserts a
single space when the next token's `get_start()` exceeds
`prev_stop + 1`, indicating a positional gap from skipped content.
When tokens are positionally adjacent (e.g. `(a+4)`), no space is
inserted.

Importantly, this does not affect grammars that use
`WS -> channel(HIDDEN)` instead of `skip`. Hidden-channel tokens
remain in the buffer and are included by `get_text_from_interval`
natively — their positions are contiguous with neighbors, so the
gap detection never fires.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jonhoo

jonhoo commented Feb 9, 2026

Copy link
Copy Markdown
Author

Worth noting that the Java version does not do this, and so as a result produces sad-to-read outputs like @betarecord that the program using the generated parser cannot disentangle.

@rrevenantt

Copy link
Copy Markdown

If default is not sufficient, you can always implement your own TokenStream, can't you?

@alexsnaps

Copy link
Copy Markdown
Member

I may be misunderstanding the actual offending code path here, and while the msg would indeed be weird, you should probably use the Some(NoAltErr) and the associated offending_token_index to report the error straight into the actual parsed input instead. WS could be many things aside from and pointing the user back to this would be the clearest error reporting anyways, no?

I'll be honest, I do find the concatenating in that way weird as well... but at this stage I'd rather err on the side of what the Java version does, if only by default. If I get this right, in this case, there are a few other ways to get to a proper reporting as well.

@jonhoo

jonhoo commented Feb 28, 2026

Copy link
Copy Markdown
Author

@alexsnaps I'm not sure I follow — you mean not using get_text_from_interval and instead just doing the underlying logic myself? I suppose I could do that, although it's unfortunate to have to replicate the logic that lives in there in each user.

@alexsnaps

Copy link
Copy Markdown
Member

Maybe I'm misunderstanding or doing too many assumptions here.
But is it so that you want to use get_text_from_interval to build an error message, yes? The problem is that WS could be way more than "just" space... Or anything different entirely. All that to say, that what get_text_from_interval could be very different from what the user's input could have been. I'd ideally point back to the user's input and build a "snippet" from that.

Details

... and build something like this:

ERROR: <input>:1:7: Syntax error: extraneous input 'b' expecting <EOF>
| *@a | b
| ......^

But I guess the issue is that 'b' in ...input 'b' expect..., right? Not the 2nd & 3rd lines 🤔

Now related to the actual issue, because of the above, I wonder if it shouldn't have been a list of tokens, over concatenating them. Again, I'd stick to what the Java version does (if only, because antlr is fairly mature library, widely used), but we could add a config or something to change that behavior on demand maybe? But unsure whether the complexity is worth it given, as @rrevenantt said, you could "just" provide your own TokenStream to achieve whatever goals your particular use case would benefit from.

@alexsnaps

alexsnaps commented Mar 1, 2026

Copy link
Copy Markdown
Member

Again, no grammar expert myself, but

When a grammar uses WS -> skip, whitespace tokens are completely discarded from the lexer output. get_text_from_interval previously concatenated adjacent tokens with no separator, producing unreadable text like @betarecord instead of @beta record.

it feel wrong for WS -> skip if @beta record is to be a valid token over @betarecord or @beta record, no? I know this isn't your grammar iirc right from the stream, but... if that's a sketchy requirement to begin with, it feels wrong to adapt the library to that. But again, not a grammar expert by no means. But from the explanation above, it does feel... awkward (but I could really well be missing something here).

@jonhoo

jonhoo commented Mar 14, 2026

Copy link
Copy Markdown
Author

I'll defer to you both as the experts here — this could well be a case of "the grammar is to blame here", but it's also not clear to me that WS -> skip should change what gets output by get_text compared to WS -> hidden? I think the position of "we should match the Java version" is probably fine here, and so if anything maybe this should be filed as a "can we make this better" to the Java implementation.

I'm still not entirely clear on what the "implement the TokenStream yourself" version would look like, but admittedly I haven't dug that deeply into ANTLR itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants