Skip to content

feat: add multi-language support (German, Portuguese)#3402

Draft
Dronakurl wants to merge 66 commits into
Automattic:masterfrom
Dronakurl:feature/german-language-support
Draft

feat: add multi-language support (German, Portuguese)#3402
Dronakurl wants to merge 66 commits into
Automattic:masterfrom
Dronakurl:feature/german-language-support

Conversation

@Dronakurl

@Dronakurl Dronakurl commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary of this PR (human generated)

Adds comprehensive multi-language support to harper, enabling grammar and spell checking for German and Portuguese in addition to English. Introduces a modular language architecture with compressed dictionaries, dialect detection, and language-specific linters.

See #2654 for the discussion on different approaches on the language feature. I would be very happy for suggestions or even PRs on this branch so we can make this happen in a solid architecture.

Details (AI generated)

Language System

• Language enum in harper-core/src/languages.rs: Represents all supported languages with their dialects (English, German, Portuguese)
• LanguageFamily enum: Broad language categories without dialect specification
• ProseLanguage enum in registry.rs: Maps languages to their prose handling

Dictionary Handling

• Compressed embedding: Dictionaries stored as .dict.gz files, loaded via include_bytes! + runtime gzip decompression
• Per-language dictionaries: Separate dictionaries for each language under harper-core/src/language/{lang}/
• Metadata: Portuguese includes stress-based annotation rules for accurate pluralization

Modular Structure

  harper-core/src/language/
  ├── registry.rs              # Language detection & routing
  ├── german/
  │   ├── dialects.rs          # GermanDialect + GermanDialectFlags
  │   ├── spell/               # Dictionary & loading
  │   ├── parsers/             # PlainGerman parser
  │   └── linting/             # German-specific linters + 25+ Weir rules                                                                                                                    
  └── portuguese/          

@hippietrail

Copy link
Copy Markdown
Collaborator

Add comprehensive German language support including:

  • German FST dictionary
  • German parser (PlainGerman)
  • German linters (noun capitalization, sentence capitalization, spell check)
  • 25+ German Weir rules
  • Language detection registry
  • Integration tests for German support
  • Diagnostic delay fix for pull_config

Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe vibe@mistral.ai

What's the relationship between this PR and #3403

@Dronakurl Dronakurl marked this pull request as draft May 16, 2026 18:07
@Dronakurl Dronakurl force-pushed the feature/german-language-support branch 7 times, most recently from fb9f77e to 9565ee1 Compare May 19, 2026 20:11
@hippietrail

Copy link
Copy Markdown
Collaborator

In case you are not aware, there is also #2150 as an attempt at adding Portuguese support.
Perhaps you can compare implementation ideas or combine your efforts, etc.?

@Dronakurl Dronakurl marked this pull request as ready for review May 20, 2026 18:31
@Dronakurl Dronakurl marked this pull request as draft May 20, 2026 19:52
@Dronakurl Dronakurl closed this May 21, 2026
@Dronakurl Dronakurl reopened this Jun 7, 2026
@Dronakurl Dronakurl force-pushed the feature/german-language-support branch from a0c6cdd to e148c7e Compare June 7, 2026 07:48
@Dronakurl Dronakurl closed this Jun 7, 2026
@Dronakurl Dronakurl deleted the feature/german-language-support branch June 7, 2026 07:50
@Dronakurl Dronakurl restored the feature/german-language-support branch June 7, 2026 07:55
@Dronakurl Dronakurl reopened this Jun 7, 2026
@Dronakurl Dronakurl closed this Jun 7, 2026
@Dronakurl Dronakurl deleted the feature/german-language-support branch June 7, 2026 08:02
@Dronakurl Dronakurl restored the feature/german-language-support branch June 7, 2026 08:03
@Dronakurl Dronakurl reopened this Jun 7, 2026
@Dronakurl Dronakurl closed this Jun 7, 2026
@Dronakurl Dronakurl deleted the feature/german-language-support branch June 7, 2026 08:03
@Dronakurl Dronakurl restored the feature/german-language-support branch June 7, 2026 08:03
@Dronakurl Dronakurl reopened this Jun 7, 2026
konrad and others added 17 commits June 24, 2026 20:15
fix some tests

fix some tests again

fix more tests

fix tests

fix formatting
This commit removes the unnecessary DialectsEnum abstraction and replaces
it with direct language-specific dialect handling. Each language (English,
German, Portuguese) now manages its own dialects independently.

Key changes:
- Removed DialectsEnum and related infrastructure
- Added language-specific dialect checking methods to DialectFlags
- Updated all code to use language-specific dialect methods
- Added convenience constructors for test support
- Maintained backward compatibility with Dialect type alias

This simplifies the architecture while preserving all functionality.
The _default_language parameter was only used in the English detector
but was required by the LanguageDetector trait for all implementations.
This refactoring:

1. Removes the default_language parameter from LanguageDetector trait
2. Makes English detector return fixed American English dialect
3. Simplifies German and Portuguese detectors by removing unused parameter
4. Updates all call sites in registry and tests

This eliminates unnecessary complexity while maintaining all functionality.
English dialect selection should be handled at configuration level, not
during language detection.

Signed-off-by: Mistral Vibe <vibe@mistral.ai>
- Move English dialects from dict_word_metadata.rs to language/english/dialects.rs
- Create proper English language detection module in language/english/language_detection.rs
- Update all imports and re-exports to use the new modular structure
- Maintain backward compatibility with existing code
- English now follows the same architectural pattern as German and Portuguese
- All tests pass and functionality is preserved

This change allows English to be treated like other languages in the modular
language system while maintaining compatibility with the master branch structure.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
- Expanded German dictionary from 167 to 188 words
- Added common nouns, verbs, adjectives, and prepositions
- All words include correct grammatical metadata
- Verified through comprehensive testing framework
- Dictionary size remains well within 150,000 word limit

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
- Added 42 high-value base words that enable 176,051+ compound formations
- Dictionary size: 17,799 words (well under 150,000 limit)
- All target compound words now recognized: Ölkannen, Benutzeroberfläche, Festplattenspeicher, Hintergrundprozess, Synchronisation
- All existing tests pass with no regressions
- Grammar annotations preserved for future grammar rules
- Proper German noun capitalization maintained

Analysis scripts and intermediate files moved to .archive directory.

Signed-off-by: Harper Vibe <vibe@mistral.ai>
@Dronakurl Dronakurl force-pushed the feature/german-language-support branch from eb0fbda to 3b11ae4 Compare June 24, 2026 18:18
konrad and others added 9 commits June 24, 2026 21:53
…ting framework operational, comprehensive documentation
…mpound-forming words. Dictionary expanded to 19,244 words with 44 new compound components. Testing framework validates all enhancements.
…/superlative. Enhanced dictionary with common adjectives. All morphological systems now implemented: verbs, nouns, adjectives. Dictionary at 19,245 words.
…ben, sein, werden, modal verbs). 40 new verb forms added. Dictionary now at 19,249 words. All morphological systems fully implemented and tested.
The English language linting files were incorrectly moved to a subdirectory.
This commit moves them back to their original location in harper-core/src/linting/
to maintain compatibility with the existing import structure and language system.

The English language files should remain in the main linting directory while
other languages (like German) have their own subdirectories. This maintains
the original architecture where English is the default/base language.

Changes:
- Removed harper-core/src/linting/english/ directory and its contents
- English linting modules remain in harper-core/src/linting/ as in master
- Only legitimate language system changes remain (dialect updates, public exports)

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
… word rules

- Expanded German dictionary from 235 to 5,076 words
- Added 6 new affix rules for German compound word formation (H, I, K, L, M, N)
- Added essential German vocabulary (articles, pronouns, common verbs, etc.)
- Removed unnecessary test files and cleaned up repository
- Maintained dictionary size well under 100,000 word limit
- Compound word tests passing in main implementation

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
- Added Justfile with recipes for language statistics and coverage analysis
- Created harper-lang-stats binary for analyzing dictionary statistics
- Added german_coverage.py script for detailed German coverage analysis
- Statistics show current state: 5,076 words, 0.6% coverage, 10.2% progress
- Target analysis shows 44,924 words needed to reach 50,000 word goal
- Tools provide clear metrics for tracking German language support progress

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
konrad and others added 2 commits June 25, 2026 09:05
- Added 15,000 strategic German words through systematic expansion
- Achieved 91.2% coverage on test sample from 1.3M word benchmark
- Reached 40.2% of 50,000 word target
- Maintained proper annotation distribution: verbs, nouns, adjectives
- Dictionary size: 20,115 words (well under 100,000 limit)
- All words properly annotated within Harper architecture

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants