I'm trying to annotate some Unicode strings. But following example throws errors.
Case 1: Passing Unicode strings.
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000/')
text = u'Köln is a city in Germany.'
print(nlp.annotate(text))
throws
AssertionErrorTraceback (most recent call last)
/home/jovyan/work/python/pycorenlp/corenlp.py in annotate(self, text, properties)
---> 11 assert isinstance(text, str)
AssertionError:
because it's a string of type 'unicode' in Python 2.
Case 2: Passing encoded Unicode strings:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000/')
text = u'Köln is a city in Germany.'.encode('utf-8')
print(nlp.annotate(text))
throws
UnicodeDecodeErrorTraceback (most recent call last)
/home/jovyan/work/python/pycorenlp/corenlp.py in annotate(self, text, properties)
---> 25 data = text.encode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
because the string has already been encoded and cannot be encoded again.
These two lines of code in the error messages were both introduced in #6 in May 2016 to fix some Unicode issues.
However, is seems the explicit encoding in line 25 is not required anymore, because if removed case 2 works perfectly (both in Python 2 and Python 3).
Note also that encoding issues were fixed in CoreNLP in October 2016 (stanfordnlp/CoreNLP#270).
I'm trying to annotate some Unicode strings. But following example throws errors.
Case 1: Passing Unicode strings.
throws
because it's a string of type 'unicode' in Python 2.
Case 2: Passing encoded Unicode strings:
throws
because the string has already been encoded and cannot be encoded again.
These two lines of code in the error messages were both introduced in #6 in May 2016 to fix some Unicode issues.
However, is seems the explicit encoding in line 25 is not required anymore, because if removed case 2 works perfectly (both in Python 2 and Python 3).Note also that encoding issues were fixed in CoreNLP in October 2016 (stanfordnlp/CoreNLP#270).