Normalize diacritics in JSON schema by vincent-botbol · Pull Request #1029 · CatalaLang/catala

vincent-botbol · 2026-05-11T15:09:20Z

Currently, this fails if two fields/cases would be normalized to the same string. Should we would rename those to fresh identifiers instead? This would complexifies the logic by quite a lot.

ALSO, while working on that the generated JSON values from the different backends are not consistent with this new schema. Incidentally, this was also not previously the case (even for the outputs). This needs a bit of future work to achieve this.

denismerigoux · 2026-05-12T07:57:30Z

I thought these renaming and name clashes problem were solved in a general fashion thanks to @AltGr's Renaming module?

https://github.com/CatalaLang/catala/blob/master/compiler/shared_ast/renaming.mli

But you're right, if each backend has its own renaming rules they won't agree on a unique name that can be re-exported in JSON Schema. This is getting more complicated that planned but I feel it's one of these cases where we have to do the extra mile to make sure everything works out perfectly... I suggest we discuss this next week and leave this PR as is in the meantime? Thanks Vincent!

AltGr · 2026-05-12T08:39:11Z

The Renaming module is highly customisable, and tuned to each backend. This way we can do minimal renaming in the backends that have proper scoping/shadowing, while still avoiding clashes in the backends that don't (also dodging the keywords in each backend). But that also means the internal constructors and field names don't necessarily match between backends.

The idea when printing the values to the user was to use the original Catala name, which was consistent, readable and avoided clashes. At this point it's not a design flaw, but a bug in some backends which don't use the correct printing function — that part should not be hard to fix, everything is in place. But since we are making other changes...

However, at that point I noticed that the JSON standard allowed arbitrary identifiers, so I thought it was a good idea to leverage this and use the same original Catala source idents. This is where #1017 and this PR come to importance, as some user-level tools have much stricter restrictions on the JSON they accept.

We discussed with @vincent-botbol yesterday and could see a few ways to solve this reliably, but they all have drawbacks:

renaming the constructors, struct names and fields with a specific renaming function (maybe taken from Renaming, but without the backend-specific configuration; anyway this probably will have to be done earlier than backend-specific renaming). Then we can either:
- use these names within the user and JSON printers; this is probably the simplest, but a regression on the user printer which will only see the mangled names (fields α, β would be renamed to e.g. x, x__2)
- change the runtime-type representations in all backends to store two names (original and normalised) for everything instead of one, and choose the proper one for user vs. JSON printing
implement the normalisation directly in the JSON printing functions of each backend. May be difficult, and we'll need to ensure they all follow the schema, but we get both printers without additional info. We'll need to rely on this when we implement reading Catala types from JSON as well.
somehow have the backends access a separate JSON schema to retrieve the correct field mappings (??)
fix the current bugs and keep our JSON with the original names ; it might be simpler after all to post-process this JSON and its schema for normalisation when it needs to be used in non standard-compliant tools.

vincent-botbol · 2026-05-12T10:08:30Z

I suggest we discuss this next week and leave this PR as is in the meantime? Thanks Vincent!

Indeed, we continued the discussion this morning and the simplest course of action would to be add to each backend runtime's types the normalized form as well. Thus, it would contain the Catala's original name and its normalized version that went through the same renaming process for each backend. This way we would maintain the consistency. The (non-existing yet) deserialization in the backend is another pair of hands though, this deserve an important discussion, yes.

Let's put it on hold for now.

Add json schema diactric tests

a31a2ad

github-project-automation Bot added this to Catala - language & tooling May 11, 2026

Normalize JSON field names

4b9a0b8

vincent-botbol force-pushed the normalize-diacritics branch from fa7db96 to 4b9a0b8 Compare May 11, 2026 15:29

denismerigoux assigned vincent-botbol May 12, 2026

denismerigoux moved this to In Progress in Catala - language & tooling May 12, 2026

vincent-botbol marked this pull request as draft May 12, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize diacritics in JSON schema#1029

Normalize diacritics in JSON schema#1029
vincent-botbol wants to merge 2 commits into
masterfrom
normalize-diacritics

vincent-botbol commented May 11, 2026

Uh oh!

denismerigoux commented May 12, 2026

Uh oh!

AltGr commented May 12, 2026

Uh oh!

vincent-botbol commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vincent-botbol commented May 11, 2026

Uh oh!

denismerigoux commented May 12, 2026

Uh oh!

AltGr commented May 12, 2026

Uh oh!

vincent-botbol commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants