Skip to content

ParseCST should accept ASTParseOptions (or equivalent) to reach lexkit escape hatches #3

@fredxfred

Description

@fredxfred

Summary

metaparser.ParseCST always calls
lexkit.ParseAST(src, gd.GetName(), startRule, v1gd) (see
cst.go:54),
which funnels through ParseASTWithOptions(..., nil) and therefore
uses lexkit.EBNFParseOptions() unconditionally. That means callers
of v2 cannot register:

  • TokenMatchers — for per-production custom tokenisation (needed
    to recover XML attribute values, text content, string literals,
    etc.)
  • IsLexical — whitespace-sensitive productions. EBNFParseOptions
    hardcodes IsLexical: func(string) bool { return false }, so
    v2 grammars always run in syntactic mode.
  • Preprocessor — input transformation (e.g. Go's semicolon
    insertion; for DOCX I had to do this myself in a separate Go file).

Concrete case

I tried to use gluon v2 to parse word/document.xml from DOCX files.
In XML, whitespace inside element content is significant (it's the
paragraph's text). But every matchTerminal call in syntactic mode
starts with skipWSAndComments()
(parse_ast.go:387),
which erases the whitespace before it can be captured. I ended up
preprocessing XML into a whitespace-separated token stream in a
companion Go file, discarding text bytes entirely:

If IsLexical were reachable, I could mark text_body as lexical and
let the parser see the raw bytes. If TokenMatchers were reachable,
I could register a matcher for text_body that reads up to the next
<. Neither path is available through v2 today.

Proposal

Add an options-bearing overload to both the RPC and the pure-Go entry
point, e.g.:

// metaparser/cst.go
func ParseCSTWithOptions(req *pb.CstRequest, opts *lexkit.ASTParseOptions) (*pb.ASTDescriptor, error)

or a CstRequest field that carries serialized options (matcher
function registry stays Go-only; IsLexical can be a list of
production names).

Either shape would unlock markup-language grammars without forcing
callers to drop down to lexkit directly (which re-introduces the v1
dependency v2 is trying to make invisible).

Full context

#2 — experience report
with the broader picture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions