Python-native Arabic morphology engine powered by NumPy.
PySarf performs root extraction, morphological pattern identification, segmentation, and stemming for Arabic text. It uses a rule-based hypothesis-and-rank algorithm with vectorized NumPy batch processing -- no ML models, no Java dependencies, no subprocess calls.
- Root extraction -- derive the trilateral/quadrilateral root from any Arabic word
- Pattern identification -- identify the morphological pattern (wazn/وزن) of a word
- Segmentation -- split words into prefixes + stem + suffixes
- Stemming -- fast greedy affix removal
- Batch processing -- vectorized NumPy pipeline for processing word lists
- Gulf dialect -- built-in support for Gulf Arabic normalization and lexical mappings
- Pure Python --
pip installand go. Single dependency: NumPy
pip install pysarfFor development:
git clone https://github.com/Rashidbm/pysarf.git
cd pysarf
pip install -e ".[dev]"from pysarf import PySarf
sarf = PySarf()
# Full morphological analysis
result = sarf.analyze("المكتبات")
print(result.root) # كتب
print(result.pattern) # مفعل
print(result.stem) # مكتب
print(result.prefixes) # ('ال',)
print(result.suffixes) # ('ات',)
print(result.score) # 0.85 (confidence 0.0-1.0)from pysarf import PySarf
# Standard MSA analyzer
sarf = PySarf()
# Gulf dialect analyzer
sarf_gulf = PySarf(dialect="gulf")
# Custom data directory
sarf_custom = PySarf(data_dir="/path/to/data")Parameters:
dialect--"msa"(default) for Modern Standard Arabic,"gulf"for Gulf Arabicdata_dir-- optional path to custom data files (defaults to bundled data)
Full morphological analysis: root extraction, pattern matching, and segmentation.
result = sarf.analyze("يكتبون")
result.root # "كتب"
result.root_letters # ("ك", "ت", "ب")
result.pattern # "يفعلون" (Arabic pattern)
result.pattern_name # "yaFCaLuun" (transliterated)
result.stem # "كتب"
result.prefixes # ("ي",)
result.suffixes # ("ون",)
result.score # 0.8 (confidence 0.0-1.0)
result.is_oov_guess # FalseExtract just the root. Returns None if unknown.
sarf.extract_root("استخراج") # "خرج"
sarf.extract_root("مكتبة") # "كتب"
sarf.extract_root("hello") # NoneFast greedy stemming (affix removal only, no root validation).
sarf.stem("وسيكتبون") # "كتب"
sarf.stem("المدرسة") # "مدرس"Segment a word into its morphological components.
seg = sarf.segment("وبالمدرسة")
seg.prefix_segments # ("و", "ب", "ال")
seg.stem # "مدرس"
seg.suffix_segments # ("ة",)
seg.segments # ("و", "ب", "ال", "مدرس", "ة")Identify the morphological pattern (wazn) of a word.
sarf.identify_pattern("كاتب") # "فاعل"
sarf.identify_pattern("مكتوب") # "مفعول"All batch methods accept a list[str] and use vectorized NumPy processing for lists of 10+ words.
results = sarf.analyze_batch(["كاتب", "مدرسة", "يدرسون"])
len(results) # 3
results.roots # ["كتب", "درس", "درس"]
results.patterns # ["فاعل", "مفعل", ...]
results.stems # ["كاتب", "مدرس", "درس"]
results.scores # [0.8, 0.85, 0.75]
# Index into individual results
result = results[0] # AnalysisResult for "كاتب"roots = sarf.extract_roots_batch(["كتاب", "مدرسة", "جميل"])
# ["كتب", "درس", "جمل"]stems = sarf.stem_batch(["المكتبات", "يدرسون", "كاتبة"])
# ["مكتب", "درس", "كاتب"]segments = sarf.segment_batch(["والكاتب", "بالمدرسة"])
# List of SegmentResult objects| Field | Type | Description |
|---|---|---|
word |
str |
Original input word |
root |
str | None |
Extracted root (e.g., "كتب") |
root_letters |
tuple[str,...] | None |
Root as individual letters |
pattern |
str | None |
Arabic pattern (e.g., "فاعل") |
pattern_name |
str | None |
Transliterated pattern (e.g., "FaaCiL") |
stem |
str |
Stem after affix removal |
prefixes |
tuple[str,...] |
Stripped prefixes, in order |
suffixes |
tuple[str,...] |
Stripped suffixes, in order |
score |
float |
Confidence score (0.0 - 1.0) |
is_oov_guess |
bool |
True if root not found in database |
| Field | Type | Description |
|---|---|---|
word |
str |
Original input word |
segments |
tuple[str,...] |
All segments in order |
prefix_segments |
tuple[str,...] |
Prefix segments only |
stem |
str |
The stem segment |
suffix_segments |
tuple[str,...] |
Suffix segments only |
| Field | Type | Description |
|---|---|---|
words |
list[str] |
Original input words |
roots |
list[str | None] |
Extracted roots |
patterns |
list[str | None] |
Pattern names |
stems |
list[str] |
Stems |
scores |
list[float] |
Confidence scores |
Supports len() and indexing (results[i] returns an AnalysisResult).
Gulf Arabic support includes character normalization and lexical mappings:
sarf_gulf = PySarf(dialect="gulf")
# Gulf-specific characters normalized to MSA
# پ → ب, چ → ج, گ/ک → ك, ی → ي
# Lexical mappings (Gulf → MSA equivalents)
result = sarf_gulf.analyze("وين") # Gulf "where" → analyzed via MSA mappingRoot extraction accuracy benchmarked against two standard Arabic corpora:
| Dataset | Words | Accuracy |
|---|---|---|
| Quranic Arabic Corpus | 14,316 | 97.2% |
| Arabic Digital Humanities | 2,064 | 89.3% |
PySarf combines a rule-based hypothesis-and-rank algorithm with corpus-verified correction tables. It uses no machine learning models -- accuracy comes from linguistic rules, a root database of 9,520 entries, root frequency weighting, and 1,601 corpus-verified word-level overrides.
To run the benchmarks yourself:
python benchmarks/bench_accuracy.pyPySarf uses a hypothesis-and-rank algorithm:
- Normalize -- strip diacritics, expand shadda, normalize alef variants
- Segment -- generate all valid prefix/suffix stripping hypotheses
- Match patterns -- for each stem hypothesis, match against 60 morphological patterns using vectorized NumPy broadcasting
- Extract roots -- extract candidate roots from pattern slots
- Validate -- check each candidate root against a database of 9,520 Arabic roots
- Transform -- try weak-letter substitutions for hollow, defective, and assimilated roots
- Score and rank -- score candidates by pattern frequency, root validity, segmentation confidence, and morphological features
- Return best -- return the highest-scoring candidate (or an OOV guess if no valid root found)
PySarf ships with bundled linguistic data:
| Resource | Count | Source |
|---|---|---|
| Trilateral roots | 6,385 | arabic-roots (Taha Zerrouki) |
| Quadrilateral roots | 3,135 | arabic-roots |
| Broken plural maps | 5,628 | Arramooz dictionary |
| Morphological patterns | 60 | Lengths 3-8 |
- Python >= 3.10
- NumPy >= 1.24
MIT