-
-
Notifications
You must be signed in to change notification settings - Fork 86
Fix #139: Handle non-ASCII characters correctly when annotating text #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
LGTM, needs CHANGELOG entry |
Good point, I'll add one in a bit. |
| incoming.write(data[i]); | ||
| for (int i = 0; i < s.length(); i++) { | ||
| // We write one UTF-16 code unit at a time so that our counting methods can handle multibyte characters correctly. | ||
| byte[] chars = String.valueOf(s.charAt(i)).getBytes(charset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more, why do we even use charset here? The text is already a String, and we only use AnsiOutputStream for its calls to emitHtml in the emitter and discard the bytes it writes to the underlying stream. We'd still need to fix this bug no matter what encoding we used here, buy maybe clearer to hard-code UTF-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we need iterate through the Unicode code points rather than the UTF-16 code units in case any characters are not in the BMP? I tried adding emoji to the test case, but it looks like the emoji is converted into the replacement character for unmappable characters somewhere along the line before annotate is called (not sure where), so I don't think it's possible to handle it correctly here (although maybe I was doing something wrong in my test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this entire (src/main/) patch could be simplified to
-byte[] data = s.getBytes(charset);
+byte[] data = s.getBytes("US-ASCII");plus deleting charset and the code that sets it. After all, we do not really care what any of the >U+007F characters in the log are for this purpose—we are throwing out the output and just looking for ESC and stuff, so if we send some (byte) '?' through it will not matter.
(Alternately, do not bother getting a byte[] at all, just loop over chars in s and write one byte for each—either the same, if ASCII, or something arbitrary like '?' otherwise.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it seemed easiest to just switch to using US-ASCII so I did that and added an explanatory comment. Edit: Switched to going char-by-char in ffb5045 so that non-BMP characters are handled correctly.
| private static final long serialVersionUID = 1; | ||
|
|
||
| private final String colorMapName; | ||
| private final String charset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could cause issues if someone upgraded the plugin to the version with this fix and then downgraded to 0.6.0, but it's not clear to me what the lifecycle of a serialized ConsoleAnnotator would be to know for sure if it would matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only serialized while a console page for a running build is being held open, so I do not think you need to care.
| * We only use AnsiHtmlOutputStream for its calls to Emitter.emitHtml when it encounters ANSI escape | ||
| * sequences. The output of the stream will be discarded. We encode the String using US_ASCII because | ||
| * all ANSI escape sequences can be represented in ASCII using a single byte, and any non-ASCII | ||
| * characters will be replaced with '?', which is good, since it means that byte offsets will match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for commenting out of nowhere, but will that work with non-BMP characters, like emojis?
In Java, "😀".length() is 2 but "😀".getBytes("US-ASCII").length seems to be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it wouldn't, but see #143 (comment) where my initial tests seemed to show that emojis are already mangled by the time they get to this method. I will double-check whether that is the case, but probably safer to switch to going char-by-char so that this would work for non-BMP characters assuming everything else worked as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, non-BMP characters were broken by the earlier method because markup text was getting inserted in between the two code units of a surrogate pair in some cases. After ffb5045 they should work correctly. Here's a screenshot of running the following pipeline with that commit:
ansiColor {
echo '\033[94;1m[ INFO ] 😀😀Wooh\033[0m😀NotCoolored'
}
Thanks for the comment!
|
@jglick can I leave all this to you to merge/release/comment? I haven't been using Jenkins lately ;) |
jglick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release it I guess!

Fixes #139.
The cause of the issue is that we were using byte indices as
Stringindices, which does not work when theStringcontains characters that require than one byte in the target encoding. To fix the issue, this PR keeps track of the number of multi-byte characters we've seen and how many bytes they require, and uses that information to convert the byte count obtained fromincoming.getCount()into an index into the original text.I originally worked on this in #137 but extracted it here so it can be reviewed independently from the less-important issues in that PR.