Describe the bug
Compiling the Pomsky expression [word] targeting the Python flavor produces \w. But Python's \w doesn't match the Unicode spec:
-
It matches the Letter (Lm, Lt, Lu, Ll, Lo) general categories, instead of the Alphabetic property
-
It matches code points with a Numeric_Type of Digit, Decimal, or Numeric, but it should match just the Decimal_Number (Nd) general category.
-
It doesn't match the Mark (Mn, Mc, Me) general categories, nor Connector_Punctuation (Pc), except for the underscore _.
-
It doesn't match characters with the Join_Control property (U+200C, U+200D)
To Reproduce
Run pomsky -f python '[word]+'
Run regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"
Expected behavior
Note that Python's re module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.
Therefore, [word] should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce [a-zA-Z0-9_].
This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find \w useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting the regex module, which has much better Unicode support.
Alternatives
Add a nonstandard_unicode mode, so \w can be used in flavors where \w matches some non-ASCII word characters, but not all (i.e. Python and .NET)
Related
python/cpython#44795
Describe the bug
Compiling the Pomsky expression
[word]targeting the Python flavor produces\w. But Python's\wdoesn't match the Unicode spec:It matches the
Letter(Lm,Lt,Lu,Ll,Lo) general categories, instead of theAlphabeticpropertyIt matches code points with a
Numeric_TypeofDigit,Decimal, orNumeric, but it should match just theDecimal_Number(Nd) general category.It doesn't match the
Mark(Mn,Mc,Me) general categories, norConnector_Punctuation(Pc), except for the underscore_.It doesn't match characters with the
Join_Controlproperty (U+200C, U+200D)To Reproduce
Run
pomsky -f python '[word]+'Run
regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"Expected behavior
Note that Python's
remodule does not support Unicode properties, so it's impossible to polyfill proper Unicode support.Therefore,
[word]should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce[a-zA-Z0-9_].This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find
\wuseful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting theregexmodule, which has much better Unicode support.Alternatives
Add a
nonstandard_unicodemode, so\wcan be used in flavors where\wmatches some non-ASCII word characters, but not all (i.e. Python and .NET)Related
python/cpython#44795