Skip to content

Portugues#2150

Draft
dantee-e wants to merge 56 commits into
Automattic:masterfrom
dantee-e:portugues
Draft

Portugues#2150
dantee-e wants to merge 56 commits into
Automattic:masterfrom
dantee-e:portugues

Conversation

@dantee-e

@dantee-e dantee-e commented Nov 7, 2025

Copy link
Copy Markdown

Description

Adding support to the portuguese language to Harper. It is not even close to being operational.
For now, added only superficial alterations:

  • New portuguese dictionary and annotations with new rules for stress in words, which is important when determining the plural for words in the language.
  • Created new enum for the languages supported.
  • New parser for plain portuguese. Changed tiny things on it, removing checks that are wrong for the language. The function lex_token now has another variable that receives the language.
  • Some temporary helper files have been added to make it easier to build the portuguese dictionary
  • Started progress on the expansion of verbs from infinitive to their conjugated forms. Thanks to Radamés Manosso for his wonderful Portuguese verb conjugator in excel, amazing feat of skill. I had no idea excel could be used for this kind of thing

Demo

Nothing yet

How Has This Been Tested?

No tests yet

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

If any maintainer sees this, I apologize for not keeping the commit history clean using the conventional commits practices. If there's a way to correct this besides squashing them all into one commit when merging, let me know. I make many commits just as a tiny checkpoint, so I don't usually have the patience to make it nice.

Dante Ramacciotti added 4 commits November 6, 2025 12:48
No processo de incluir ox parox e proparoxitonas
Helper file ox-paros-proparox that classifies the words based on the
stress
Adding derive default to the language enum
@dantee-e dantee-e marked this pull request as draft November 7, 2025 15:11
Dante Ramacciotti added 25 commits November 7, 2025 17:18
It invokes different dictionaries depending on what language you pass to
it
Refactored a bit of the main function to add global arguments for
language and dialect. Not yet decided how to deal with dialects for
other languages than English
Added both masculine and feminine genders. Also added a neutral gender
so that expansions to other languages is made easier.
Now the replace can have a condition (for example, checks if it's a noun
or a verb before replacing)
Now Dialect is a trait that can be used to implement other Dialects. The
old Dialect struct is now called EnglishDialect, and has the same
functionalities as before. The only caveat is that now one must include
also the traits to be able to use the functions
Put the new properties in the wrong place. It was the }. Just a
misplaced }.
Documentation had the wrong use of the Dialog trait
Now DictWordMetadata takes a DialectFlagsEnum instead of a
EnglishDialect. There are some adaptations that needed to be made, for
example creating the Dialect trait, and adding to this trait a function
that gets most used dialects from document using a provided language
(LanguageFamily, which is a variant of the enum that doesn't have the
dialect contained inside it). This  function should not be used for any
of the specific DialectFlags implementations, only for the
DialectFlagsEnum.

On the counterpart, the variant of the function that doesn't receive any
languages should only be used by the specific implementations, which
should be progressively more private, being replaced in general by the
enum.

By the way, the original idea was using in the DictWordMetadata struct a
Box<dyn DialogFlag>, but in the end the trait is not dyn compatible,
hence this workaround
Adapting what needs to be adapted from Language to LanguageFamily
Basically EnglishDialect::American =>
Language::English(EnglishDialect::American) everywhere that uses
LintGroup::new_curated
@dantee-e

dantee-e commented Feb 9, 2026

Copy link
Copy Markdown
Author

Oh sorry I didn't see this when you first posted.

Thanks for the answer! Don't worry at all.
I checked out the post you mentioned and answered about the progress I had, just to give them some perspective on how it's going.

We picked up a great new contributor a week or two ago who's been active in some reasonably deep parts of the code.

This is really great to hear, I hope the project gets a lot of traction when we release the structure for other languages. Whenever I have something a bit more solid I'll get in contact with the more active contributors to do some planning.

@hippietrail

Copy link
Copy Markdown
Collaborator

In case you are not aware, there is now also #3402 as an attempt at adding German support.
Perhaps you can compare implementation ideas or combine your efforts, etc.?

@dantee-e

Copy link
Copy Markdown
Author

Thanks for bringing my attention to it! I took a look at the code, and it's not terrible but it's terribly disorganized. There are some okay things there, but I'll once again ask if it is possible for you or @elijah-potter to check the code I wrote, so that we can merge it and other efforts for other languages are more structured. What I saw there, and I don't want to come off as mean, but it was completely unsustainable, and some structural choices assumed clearly that German would be the only other language to be integrated into Harper.

Of course my code is also not perfect, but at least for most of the languages using the Latin alphabet it is enough and expandable (of course, that's as far as I can tell. I only speak Portuguese, English, interm. German and bit of Spanish and Italian). The other initiatives of expanding harper are typically very specific and local, and I imagine a whole lot of effort (or tokens) is being wasted this way. I am currently in the middle of writing my bachelor thesis, so I can't devote that much attention to this project (regardless of how much I like it and need it), but if my fork is greenlit structure wise I can spend one or two afternoons ironing out whatever bugs are left (there is one right now annoying the fuck out of me) and merging the last few commits so that it can be successfully merged into harper master.

Thanks again,
Dante

@hippietrail

hippietrail commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Thanks for bringing my attention to it! I took a look at the code, and it's not terrible but it's terribly disorganized. There are some okay things there, but I'll once again ask if it is possible for you or @elijah-potter to check the code I wrote, so that we can merge it and other efforts for other languages are more structured.

Unfortunately I just don't know too many parts of the codebase it touches to have confidence reviewing it. I might try to have a look anyway later today.

What I saw there, and I don't want to come off as mean, but it was completely unsustainable, and some structural choices assumed clearly that German would be the only other language to be integrated into Harper.

That's the kind of thing I foresee in any attempt to tackle this. My attempt to avoid it when I tried a Spanish one ages ago stalled because I tried to take everything into account and I got bogged down, especially in all the parts of the codebase where I didn't know how too many things worked.

Of course my code is also not perfect, but at least for most of the languages using the Latin alphabet it is enough and expandable (of course, that's as far as I can tell. I only speak Portuguese, English, interm. German and bit of Spanish and Italian). The other initiatives of expanding harper are typically very specific and local, and I imagine a whole lot of effort (or tokens) is being wasted this way.

Yes we started to get a little influx of AI code. Of much more wildly varying quality than I expected. Some worse than I thought was possible, and some better than what I've managed so far.

I am currently in the middle of writing my bachelor thesis, so I can't devote that much attention to this project

Understood! Thanks for the detailed reply!

@hippietrail hippietrail requested a review from elijah-potter May 20, 2026 04:24
@Dronakurl

Copy link
Copy Markdown
Contributor

I'm aware of PR #2150. The Portuguese PR takes a different approach that makes breaking changes to core types like `Dialect` (changing it from an enum to a trait) and modifies `FstDictionary::curated()` to require a `LanguageFamily` parameter. This breaks compilation of the CLI and requires refactoring throughout the codebase.

Our approach incorporates some of the Portuguese PR's functionality but maintains backward compatibility by adding new functionality without modifying existing types, using a modular `language/` directory structure. It compiles, works today with German, and demonstrates extensibility to Portuguese and other languages without breaking existing code.

@dantee-e

Copy link
Copy Markdown
Author

My friend, @Dronakurl, the code you wrote added German basically as another dialect. Even your answer to this thread is AI generated. You made no separate annotations file, no separation of language from what I can see (besides the language directory, but I'm talking code wise), and you didn't even mention the Language enum, which makes me think you didn't really look at my code.

My changes are breaking changes, that you got right. Yours, however, are not appropriate for merging. It is too simplistic an approach for a problem that is not as simple. Thankfully, as stated somewhere else, very few other crates use harper directly, instead opting for harper-ls, which has all it's functionality preserved with both our codes (or at least I think so, I didn't look at this part in your code).

Finally, "our approach"? Who's we? Only you contributed to the repository, are you referencing claude here? Or is it referencing you?

I apologize for any aggression, but I don't really like when people even answer messages with AI.

@Dronakurl

Copy link
Copy Markdown
Contributor

I apologize, I was carried away and was overly enthusiastic about AI (I read the message before posting it, but you are right about calling me out). You are right about the Dialect approach. It works (at least for German, I am using it right now within helix), but it is ugly. I will try to come back with an approach which follows the Language enum approach.

@dantee-e

Copy link
Copy Markdown
Author

@Dronakurl no, I apologize. I was a bit too aggressive. Nevertheless, I saw some of your code and some parts of it would be good additions to my own code, especially the weir rules for German. As @hippietrail suggested, I think we could join efforts, if you would like to. I tried making my code so that extensions to new languages could be done, I don't know how easily though. If you'd like to take a look and try your hand at it, I would love to get some pull requests to it. This way we could both move forwards instead of having competing implementations.

If you have any doubts about how some things work in my part of the code, be sure to send me a message and I'll answer as quickly as possible. Looking forward to working with you!

@hippietrail

Copy link
Copy Markdown
Collaborator

My friend, @Dronakurl, the code you wrote added German basically as another dialect. Even your answer to this thread is AI generated. You made no separate annotations file, no separation of language from what I can see (besides the language directory, but I'm talking code wise), and you didn't even mention the Language enum, which makes me think you didn't really look at my code.

My changes are breaking changes, that you got right. Yours, however, are not appropriate for merging. It is too simplistic an approach for a problem that is not as simple. Thankfully, as stated somewhere else, very few other crates use harper directly, instead opting for harper-ls, which has all it's functionality preserved with both our codes (or at least I think so, I didn't look at this part in your code).

I wouldn't worry about breaking changes for something this big. We do breaking changes for major version bumps, as happened recently. I can appreciate both extending the meaning of Dialect to mean a combination of language + dialect but I think it would become fragile especially when dealing with regionalisms. My best second language, Spanish, has tons of them, especially in slang, and I know German does and would be shocked if Portuguese didn't (-:

Finally, "our approach"? Who's we? Only you contributed to the repository, are you referencing claude here? Or is it referencing you?

I actually also tend to use "us"/"our" when I've used an agent to an extent where I can't personally vouch for 100% of the code. So from that angle it's a bit more honest.

I apologize for any aggression, but I don't really like when people even answer messages with AI.

We might have to think about some way of indicating when comments are from AIs. I try to engage them and direct them so far but do so in a different way than I would with a human.

@Dronakurl

Dronakurl commented May 24, 2026

Copy link
Copy Markdown
Contributor

@dantee-e : The first thing I did, was to port your architecture onto the current master branch (version 2.2.1). There were some conflicts, so I could not just rebase it easily. Thus, I smushed it all together in one commit on top of the master branch. Then I added the German language support on top of it, together with a modular setup, so that it is extensible to support more languages.

I have not reviewed it yet, but just that you know that I am trying to make it work.

https://github.com/Dronakurl/harper/tree/language-support-new

@dantee-e

Copy link
Copy Markdown
Author

@Dronakurl Nice! I just remembered it now, if possible, try working on top of the merge-main branch. I was stupid and forgot to merge it back into the portuguese branch, so its quite a bit ahead of the others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants