You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below:
Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026');
Case 2: builder.append(text, 0, MAX_TEXT).append('…');
In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:
if (token_type=='String'):
try:
outfile.write(item.value)
exceptUnicodeEncodeError:
outfile.write(item.value.encode('unicode-escape').decode('utf-8'))
but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.
Good discovery, thanks.
Regards,
Steve
From: chenzimin
Sent: Wednesday, February 6, 2019 2:31 AM
To: c2nes/javalang
Cc: Steve Kommrusch; Author
Subject: Re: [c2nes/javalang] String values don't properly handle unicodeescapes (#58)
Hi Steve,
If you comment out this line, https://github.com/c2nes/javalang/blob/7a4af7f5136dd4f4f4b1846b3872f5688429e5db/javalang/tokenizer.py#L489, the unicode string will be stored as raw string, not converted to characters.
And one more benefit is that the position will also be correct for files containing unicode. I found this when I tried to debug the position error.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
jose
linked a pull request
Mar 25, 2021
that will
close
this issue
I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below:
Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026');
Case 2: builder.append(text, 0, MAX_TEXT).append('…');
In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:
but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.
The text was updated successfully, but these errors were encountered: