ccombinator is a simple parser combinator library for C written in C23.
Notable features:
- Top-down recursive descent LL-Parser with backtracking.
- Parser memory is managed through reference-counting through the
cc_retainandcc_releasefunctions instead of garbage collection. - Parsers get compiled down to a stack-based internal representation, which then gets interpreted. Therefore, there is no internal recursion.
- Parser results are lazy, they get evaluated only, if they are really needed.
- Parsers can be automatically generated by creating a grammar from regular or bnf expressions. (Not implemented yet.)
- Full UTF-8 input support.
- No (visible) internal library state. Most functions (except IO-operations like
cc_open/cc_close) are pure.
This library is inspired by orangeduck's mpc library.
There are however stark differences in the implementation and design choices that went into ccombinator compared to mpc (e.g. the stack-based IR, full lazy-evaluation, ...).
Parsers can be constructed using various terminal and combinatorial parsers:
// parse a simple variable identifier:
struct cc_parser *alpha = cc_or(2, cc_range('a', 'z'), cc_range('A', 'Z'));
struct cc_parser *digit = cc_range('0', '9');
struct cc_parser *underscore = cc_char('_');
struct cc_parser *p = cc_and(3,
cc_fold_concat,
cc_or(2, cc_retain(alpha), cc_retain(underscore)),
cc_many(cc_fold_concat, cc_or(3, alpha, digit, underscore))
);The source text can be provided in either a UTF-8 encoded string literal or file:
// string input:
struct cc_source *s = cc_string_source(u8"uint64_t");
// file input:
struct cc_source *s = cc_open("/path/to/your/file");The source can then be parsed using the constructed parser:
struct cc_result r;
cc_parse(s, p, &r);
if(r.err) { // syntax error
cc_err_print(r.err);
cc_err_free(r.err);
}
else { // successful
printf("parse result: %s\n", (char*) r.out);
}More examples can be found in the examples directory.
Building:
- Clone this repository:
$ git clone https://github.com/spydr06/ccombinator
$ cd ccombinator- Run the Makefile
$ makeSee make help for more information.
Installing:
# make install
Pull requests are welcome. For major changes, please open an issue first for discussion. Make sure to update tests as appropriate.
This project (as all of my projects) is written without any help of AI. To keep it that way, I will not accept any AI-written or assisted pull requests.
Please note that this library is in early development. Issues can occur and are expected.
Memory management is done through reference counting. The core functions are:
-
Incrementing reference count:
struct cc_parser *cc_retain(struct cc_parser *p);
-
Decrementing reference count: If the count reaches
0, the parser and its contained data is freed automatically.struct cc_parser *cc_release(struct cc_parser *p)
-
A parser can be copied to a new location while preserving the individual reference counts:
void cc_parser_copy(struct cc_parser *d, const struct cc_parser *s);
After parser creation, the default reference count is 1.
Functions that accept a parameter of type struct cc_parser * consume a reference to the parser, so be sure to always increment the reference-count as needed.
-
Parsers are run on an input source using
cc_parse. This attempts to parse the source using the provided parser.int cc_parse(const struct cc_source *s, struct cc_parser *p, struct cc_result *r);
After parsing,
rreturns the parsing result as either a value inr.outor an error report inr.err. -
If you just want to check, if a source is in the language of a parser, and do not care about the return value,
cc_matchesruns the parserpon the input stringinand returnsCC_MATCH_OK,CC_MATCH_NOMATCHor a negative errno value on error:int cc_matches(const char8_t *in, struct cc_parser *p, struct cc_error **e);
-
Errors can be either printed directly or converted to a formatted error string:
char *cc_err_string(struct cc_error *e); int cc_err_print(struct cc_error *e); int cc_err_fprint(struct cc_error *e, FILE *f);
-
Errors need to be manually freed:
void cc_err_free(struct cc_error *e);
-
Error Generation:
struct cc_error *cc_error(const char *failure); struct cc_error *cc_errorf(const char *fmt, ...);
-
Adds an
expected '...'-field:struct cc_error *cc_add_expected(struct cc_error *e, const char *exp); struct cc_error *cc_add_expectedf(struct cc_error *e, const char *fmt, ...);
-
Sets the
at '...'-field:struct cc_error *cc_with_received(struct cc_error *e, char32_t received);
-
Sets the error location:
struct cc_error *cc_with_filename(struct cc_error *e, const char *filename); struct cc_error *cc_with_location(struct cc_error *e, struct cc_location loc);
-
-
Results:
Values of type
struct cc_resultrepresent a successful return value, if theerrfield isNULL, or contain an error message inerrotherwise.- Constructors:
struct cc_result cc_ok(void *out); struct cc_result cc_err(struct cc_error *err);
- Constructors:
Functions returning int (e.g. cc_parse), return 0 on success or a non-zero errno value on failure.
The following functsion construct primitve terminal parsers.
If an error occurrs, NULL is returned and errno set accordingly.
-
Matches any character except end of file (EOF):
struct cc_parser *cc_any(void);
-
Matches the given UTF-8 string exactly:
struct cc_parser *cc_string(const char8_t *s);
sis expected to stay allocated the entire lifetime of the parser. -
Matches the given UTF-32 character:
struct cc_parser *cc_char(char32_t c);
-
Matches a character in the range of
lotohi(inclusive):struct cc_parser *cc_range(char32_t lo, char32_t hi);
-
Matches a character occurring in/exactly once/not in the given
charsset:struct cc_parser *cc_anyof(const char32_t chars[]); struct cc_parser *cc_oneof(const char32_t chars[]); struct cc_parser *cc_noneof(const char32_t chars[]);
charsis aNULL-terminated UTF-32 character array. AU"..."-string may be used.charsis expected to stay allocated the entire lifetime of the parser. -
Matches any character, for which the function
freturns a non-zero value:struct cc_parser *cc_match(int (*f)(char32_t));
-
Matches the end of file:
struct cc_parser *cc_eof(void);
-
Matches the start of file:
struct cc_parser *cc_sof(void);
-
Matches any whitespace character (
' ','\n','\t','\v'):struct cc_parser *cc_whitespace(void);
-
Matches any blank character:
struct cc_parser *cc_blank(void);
-
Matches the newline (
'\n') character. Equal tocc_char('\n'):struct cc_parser *cc_newline(void);
-
Matches the tab (
'\t') character. Equal tocc_char('\t'):struct cc_parser *cc_tab(void);
-
Matches a digit character (
'0'to'9'):struct cc_parser *cc_digit(void);
-
Matches a hexadecimal digit character (
'0'to'9','a'to'f','A'to'F'):struct cc_parser *cc_hexdigit(void);
-
Matches an octal digit character (
'0'to'7'):struct cc_parser *cc_octdigit(void);
-
Matches an alphabetical character:
struct cc_parser *cc_alpha(void);
-
Matches a lower-case character:
struct cc_parser *cc_lower(void);
-
Matches an upper-case character:
struct cc_parser *cc_upper(void);
-
Matches an underscore (
'_') character:struct cc_parser *cc_underscore(void);
-
Matches an alphanumerical character. Equal to
cc_or(2, cc_digit(), cc_alpha()):struct cc_parser *cc_aplhanum(void);
-
Always succeeds without consuming input:
struct cc_parser *cc_pass(void);
-
Fails with the error message
e:struct cc_parser *cc_fail(const char *e);
-
Fails with a formatted error message:
CC_format_printf(1) struct cc_parser *cc_failf(const char *fmt, ...);
-
Always succeeds and returns the given value:
struct cc_parser *cc_lift(cc_lift_t lf); struct cc_parser *cc_lift_val(void *val);
-
Always succeeds and returns a copy of the current parser location. This copy must be freed manually:
struct cc_parser *cc_location(void);
-
Looks up and returns a parser by its binding
name. Fails ifnameis not a known binding:struct cc_parser *cc_lookup(const char *name);
-
Defines a binding
nameto the parseraand executes parserpin this context:struct cc_parser *cc_bind(const char *name, struct cc_parser *a, struct cc_parser *p);
The following functions construct parser combinators that take in other parsers. All arguments of type struct cc_parser* consume a reference, so all parsers passed must have at least a reference count of 1.
-
Runs parser
pand forwards its return value on success. Ifpfails, an erroreis produced:struct cc_parser *cc_expect(struct cc_parser *p, const char *e);
-
Runs parser
pand forwards its return value on success. Ifpfails, a formatted error is produced:CC_format_printf(2) struct cc_parser *cc_expectf(struct cc_parser *p, const char *fmt, ...);
-
Applies a function
fto the return value ofp, ifpsucceeds:struct cc_parser *cc_apply(struct cc_parser *p, cc_apply_t f);
-
Succeeds if parser
pfails and fails ifpsucceeds. No error messages are generated:struct cc_parser *cc_not(struct cc_parser* p);
-
Runs 2 parsers in sequence. Similar to
cc_and(2, ...):struct cc_parser *cc_seq(cc_fold_t f, struct cc_parser *a, struct cc_parser *b);
-
Runs one of the two parsers. Similar to
cc_or(2, ...):struct cc_parser *cc_either(struct cc_parser *a, struct cc_parser *b);
-
Runs
nparsers in sequence. On success, the results are combined using the folding functionf:struct cc_parser *cc_and(unsigned n, cc_fold_t f, ...); struct cc_parser *cc_andv(unsigned n, cc_fold_t f, struct cc_parser **ps);
-
Checks
nparsers and returns the result of the first succeeding parser. Fails if no parser succeeds:struct cc_parser *cc_or(unsigned n, ...); struct cc_parser *cc_orv(unsigned n, struct cc_parser **ps);
-
Runs parser
pzero or more times in sequence untilpfails. Parser results are combined using the folding functionf:struct cc_parser *cc_many(cc_fold_t f, struct cc_parser *p);
-
Runs parser
auntil parserendsucceeds. Parser results are combined using the folding functionf:struct cc_parser *cc_many_until(cc_fold_t f, struct cc_parser *a, struct cc_parser *end);
-
Runs parser
pexactlyntimes in sequence. Parser results are combined using the folding functionf:struct cc_parser *cc_count(unsigned n, cc_fold_t f, struct cc_parser *p);
-
Runs parser
pexactlynor more times in sequence untilpfails. Parser results are combined using the folding functionf:struct cc_parser *cc_least(unsigned n, cc_fold_t f, struct cc_parser *p);
-
Runs parser
aand a sequence of parsersop,puntilopfails. Parser results mimic the form [aopaop...a] and are combined using the folding functionfif anyopparser succeeded. Otherwise, the result of the firstpparser is returned directly:struct cc_parser *cc_chain(cc_fold_t f, struct cc_parser *a, struct cc_parser *op);
-
Runs parser
aonce and a sequence of parserop. Parser results mimic the form [aopop...op] and are combined using the folding functionfif anyopparser succeeded. Otherwise, the result of the firstpparser is returned directly:struct cc_parser *cc_postfix(cc_fold_t f, struct cc_parser *a, struct cc_parser *op);
-
Runs parser
s,aandein sequence, while only generating values fora. Similar tocc_and(3, cc_fold_middle, cc_noreturn(s), a, cc_noreturn(e)):struct cc_parser *cc_between(struct cc_parser *s, struct cc_parser *a, struct cc_parser *e);
-
Attempts to parse any number of whitespace, then
aand then any number of whitespace again. Similar tocc_and(3, cc_fold_middle, cc_many(cc_whitespace()), a, cc_many(cc_whitespace())):struct cc_parser *cc_token(struct cc_parser *a);
-
Attempts to run parser
pand always succeeds:struct cc_parser *cc_maybe(struct cc_parser *p);
-
Constructs a recursive parser from the fixpoint-function
f.pcan be used to pass additional arguments tof:struct cc_parser *cc_fix(cc_fix_t f, void *p);
-
Temporarily disables return value generation for the parser
p. This can reduce the memory footprint of the parser. This causeslift,fold,applyand other passed callback functions inpand descendants not to be called. This flag is set automatically by combinators taking in folding functions whenNULLis passed:struct cc_parser *cc_noreturn(struct cc_parser *p);
-
Temporarily disables error report generation for the parser
p. This can reduce the memory footprint of the parser:struct cc_parser *cc_noerror(struct cc_parser *p);
-
Enables the
FREE_DATAflag on the parserp. Whenpgets deleted, the user data associated with it is passed tofree:struct cc_parser *cc_free_data(struct cc_parser *p);
A cc_grammar represents a list of named cc_parsers (Rules).
-
Returns a parser with the name
namein the grammarg:struct cc_parser *cc_parser_by_name(const struct cc_grammar *g, const char *name);
If no such parser is found,
NULLis returned. -
Grammars need to be freed manually:
void cc_grammar_free(struct cc_grammar *g);
This also calls
cc_releaseon every contained parser. -
Gets a single rule (as an entry-point for parsing):
struct cc_parser *cc_rule(const struct cc_grammar *g, const char *name);
The following functions aid constructing a parser by supplying a regular expression:
struct cc_parser *cc_regex_from(const struct cc_source *s, struct cc_error **e);
struct cc_parser *cc_regex(const char8_t *re, struct cc_error **e);Supported regular expressions:
- Any symbol
. - End-of-file (EOF) symbol
$ - Start-of-file (SOF) symbol
^ - Character classes
\a,\d,\D,\l,\s,\S,\u,\w,\W,\x - Posix-like character classes
[:alpha:],[:digit:], ... - Character selections
[a-zA-z0-9_] - Negated character selections
[^a-zA-Z] - Groupings
(...) - Quantifiers
*,+,? - Alternations
a|b|c
Grammars are generated from an input source in an extended Backus-Naur Form.
struct cc_grammar *cc_bnf_from(const struct cc_source *s, const struct cc_action actions[], struct cc_error **e);
struct cc_grammar *cc_bnf(const char8_t *s, const struct cc_action actions[], struct cc_error **e);-
Actions:
Actions are used to reference callback functions from within the BNF code. As they come in various types, multiple constructors exist:
struct cc_action cc_action_value(const char *name, void *value); struct cc_action cc_action_fold(const char *name, cc_fold_t fold); struct cc_action cc_action_apply(const char *name, cc_apply_t apply); struct cc_action cc_action_lift(const char *name, cc_lift_t lift); struct cc_action cc_action_match(const char *name, cc_match_t match);
When passing actions to
cc_bnf/cc_bnf_from, be sure to terminate the array withCC_NULL_ACTION(). It is best to use theCC_ACTIONS(...)macro for this.
Supported expressions:
- Rule Defintions:
myrule = ... ;(top-level) - Terminals:
"foo",'bar', ... - Lift/value actions:
@action(generatecc_lift(action)/cc_lift_val(action)parsers) - Matching actions:
@action(generatescc_match(action)) - Groupings:
( ... ) - Function applications:
@action(...)(generatescc_apply(..., action)) - Optionals:
[ ... ] - Repetitions:
{ ... } - Repetitions with folding action:
@action{ ... }(generatescc_many(action, ...)) - Exceptions:
a - b(match everything in a and not in b) - Concatenations:
a , b , ... - Concatenations with folding action:
@action: a, b, ... - Alternations:
a | b | ...
To keep this library simple, only input formatted as UTF-8 is supported.
-
Constructs a
cc_sourceform an UTF-8 string:struct cc_source *cc_string_source(const char8_t *s); struct cc_source *cc_nstring_source(const char8_t *s, size_t n);
-
Frees a
cc_sourceand closes all associated IO objects:int cc_close(struct cc_source *s);
- Dumps a parser tree into a
FILE-streamf(default:stderr)int cc_debug_dump(struct cc_parser *p); int cc_debug_fdump(struct cc_parser *p, FILE *f);
-
Gets the version string of the ccombinator library:
const char *cc_version(void);
-
Gets the major version component in numerical form:
int cc_version_major(void);
-
Gets the minor version component in numerical form:
int cc_version_minor(void);
ccombinator is licensed under the MIT License. See ./LICENSE for more information.