Skip to content

Normalize diacritics in JSON schema#1029

Draft
vincent-botbol wants to merge 2 commits into
masterfrom
normalize-diacritics
Draft

Normalize diacritics in JSON schema#1029
vincent-botbol wants to merge 2 commits into
masterfrom
normalize-diacritics

Conversation

@vincent-botbol

Copy link
Copy Markdown
Contributor

Fixes #1017

Currently, this fails if two fields/cases would be normalized to the same string. Should we would rename those to fresh identifiers instead? This would complexifies the logic by quite a lot.

ALSO, while working on that the generated JSON values from the different backends are not consistent with this new schema. Incidentally, this was also not previously the case (even for the outputs). This needs a bit of future work to achieve this.

@denismerigoux

Copy link
Copy Markdown
Contributor

I thought these renaming and name clashes problem were solved in a general fashion thanks to @AltGr's Renaming module?

https://github.com/CatalaLang/catala/blob/master/compiler/shared_ast/renaming.mli

But you're right, if each backend has its own renaming rules they won't agree on a unique name that can be re-exported in JSON Schema. This is getting more complicated that planned but I feel it's one of these cases where we have to do the extra mile to make sure everything works out perfectly... I suggest we discuss this next week and leave this PR as is in the meantime? Thanks Vincent!

@AltGr

AltGr commented May 12, 2026

Copy link
Copy Markdown
Contributor

The Renaming module is highly customisable, and tuned to each backend. This way we can do minimal renaming in the backends that have proper scoping/shadowing, while still avoiding clashes in the backends that don't (also dodging the keywords in each backend). But that also means the internal constructors and field names don't necessarily match between backends.

The idea when printing the values to the user was to use the original Catala name, which was consistent, readable and avoided clashes. At this point it's not a design flaw, but a bug in some backends which don't use the correct printing function — that part should not be hard to fix, everything is in place. But since we are making other changes...

However, at that point I noticed that the JSON standard allowed arbitrary identifiers, so I thought it was a good idea to leverage this and use the same original Catala source idents. This is where #1017 and this PR come to importance, as some user-level tools have much stricter restrictions on the JSON they accept.

We discussed with @vincent-botbol yesterday and could see a few ways to solve this reliably, but they all have drawbacks:

  • renaming the constructors, struct names and fields with a specific renaming function (maybe taken from Renaming, but without the backend-specific configuration; anyway this probably will have to be done earlier than backend-specific renaming). Then we can either:
    • use these names within the user and JSON printers; this is probably the simplest, but a regression on the user printer which will only see the mangled names (fields α, β would be renamed to e.g. x, x__2)
    • change the runtime-type representations in all backends to store two names (original and normalised) for everything instead of one, and choose the proper one for user vs. JSON printing
  • implement the normalisation directly in the JSON printing functions of each backend. May be difficult, and we'll need to ensure they all follow the schema, but we get both printers without additional info. We'll need to rely on this when we implement reading Catala types from JSON as well.
  • somehow have the backends access a separate JSON schema to retrieve the correct field mappings (??)
  • fix the current bugs and keep our JSON with the original names ; it might be simpler after all to post-process this JSON and its schema for normalisation when it needs to be used in non standard-compliant tools.

@vincent-botbol vincent-botbol marked this pull request as draft May 12, 2026 10:02
@vincent-botbol

Copy link
Copy Markdown
Contributor Author

I suggest we discuss this next week and leave this PR as is in the meantime? Thanks Vincent!

Indeed, we continued the discussion this morning and the simplest course of action would to be add to each backend runtime's types the normalized form as well. Thus, it would contain the Catala's original name and its normalized version that went through the same renaming process for each backend. This way we would maintain the consistency. The (non-existing yet) deserialization in the backend is another pair of hands though, this deserve an important discussion, yes.

Let's put it on hold for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

JSON inputs and schema should rename variable with diacritics

3 participants