Summary
metaparser.ParseCST always calls
lexkit.ParseAST(src, gd.GetName(), startRule, v1gd) (see
cst.go:54),
which funnels through ParseASTWithOptions(..., nil) and therefore
uses lexkit.EBNFParseOptions() unconditionally. That means callers
of v2 cannot register:
TokenMatchers — for per-production custom tokenisation (needed
to recover XML attribute values, text content, string literals,
etc.)
IsLexical — whitespace-sensitive productions. EBNFParseOptions
hardcodes IsLexical: func(string) bool { return false }, so
v2 grammars always run in syntactic mode.
Preprocessor — input transformation (e.g. Go's semicolon
insertion; for DOCX I had to do this myself in a separate Go file).
Concrete case
I tried to use gluon v2 to parse word/document.xml from DOCX files.
In XML, whitespace inside element content is significant (it's the
paragraph's text). But every matchTerminal call in syntactic mode
starts with skipWSAndComments()
(parse_ast.go:387),
which erases the whitespace before it can be captured. I ended up
preprocessing XML into a whitespace-separated token stream in a
companion Go file, discarding text bytes entirely:
If IsLexical were reachable, I could mark text_body as lexical and
let the parser see the raw bytes. If TokenMatchers were reachable,
I could register a matcher for text_body that reads up to the next
<. Neither path is available through v2 today.
Proposal
Add an options-bearing overload to both the RPC and the pure-Go entry
point, e.g.:
// metaparser/cst.go
func ParseCSTWithOptions(req *pb.CstRequest, opts *lexkit.ASTParseOptions) (*pb.ASTDescriptor, error)
or a CstRequest field that carries serialized options (matcher
function registry stays Go-only; IsLexical can be a list of
production names).
Either shape would unlock markup-language grammars without forcing
callers to drop down to lexkit directly (which re-introduces the v1
dependency v2 is trying to make invisible).
Full context
#2 — experience report
with the broader picture.
Summary
metaparser.ParseCSTalways callslexkit.ParseAST(src, gd.GetName(), startRule, v1gd)(seecst.go:54),
which funnels through
ParseASTWithOptions(..., nil)and thereforeuses
lexkit.EBNFParseOptions()unconditionally. That means callersof v2 cannot register:
TokenMatchers— for per-production custom tokenisation (neededto recover XML attribute values, text content, string literals,
etc.)
IsLexical— whitespace-sensitive productions.EBNFParseOptionshardcodes
IsLexical: func(string) bool { return false }, sov2 grammars always run in syntactic mode.
Preprocessor— input transformation (e.g. Go's semicoloninsertion; for DOCX I had to do this myself in a separate Go file).
Concrete case
I tried to use gluon v2 to parse
word/document.xmlfrom DOCX files.In XML, whitespace inside element content is significant (it's the
paragraph's text). But every
matchTerminalcall in syntactic modestarts with
skipWSAndComments()(parse_ast.go:387),
which erases the whitespace before it can be captured. I ended up
preprocessing XML into a whitespace-separated token stream in a
companion Go file, discarding text bytes entirely:
gluon/tokenize.go— https://github.com/accretional/proto-docx/blob/main/gluon/tokenize.gogluon/xml.ebnf— https://github.com/accretional/proto-docx/blob/main/gluon/xml.ebnfIf
IsLexicalwere reachable, I could marktext_bodyas lexical andlet the parser see the raw bytes. If
TokenMatcherswere reachable,I could register a matcher for
text_bodythat reads up to the next<. Neither path is available through v2 today.Proposal
Add an options-bearing overload to both the RPC and the pure-Go entry
point, e.g.:
or a
CstRequestfield that carries serialized options (matcherfunction registry stays Go-only;
IsLexicalcan be a list ofproduction names).
Either shape would unlock markup-language grammars without forcing
callers to drop down to
lexkitdirectly (which re-introduces the v1dependency v2 is trying to make invisible).
Full context
#2 — experience report
with the broader picture.