Fix #139: Handle non-ASCII characters correctly when annotating text #143

dwnusbaum · 2018-12-04T19:15:06Z

Fixes #139.

The cause of the issue is that we were using byte indices as String indices, which does not work when the String contains characters that require than one byte in the target encoding. To fix the issue, this PR keeps track of the number of multi-byte characters we've seen and how many bytes they require, and uses that information to convert the byte count obtained from incoming.getCount() into an index into the original text.

I originally worked on this in #137 but extracted it here so it can be reviewed independently from the less-important issues in that PR.

dblock · 2018-12-05T00:54:58Z

LGTM, needs CHANGELOG entry

dwnusbaum · 2018-12-05T13:50:00Z

needs CHANGELOG entry

Good point, I'll add one in a bit.

dwnusbaum · 2018-12-05T16:27:53Z

src/main/java/hudson/plugins/ansicolor/ColorConsoleAnnotator.java

-                    incoming.write(data[i]);
+                for (int i = 0; i < s.length(); i++) {
+                    // We write one UTF-16 code unit at a time so that our counting methods can handle multibyte characters correctly.
+                    byte[] chars = String.valueOf(s.charAt(i)).getBytes(charset);


Thinking about this some more, why do we even use charset here? The text is already a String, and we only use AnsiOutputStream for its calls to emitHtml in the emitter and discard the bytes it writes to the underlying stream. We'd still need to fix this bug no matter what encoding we used here, buy maybe clearer to hard-code UTF-8?

Also, maybe we need iterate through the Unicode code points rather than the UTF-16 code units in case any characters are not in the BMP? I tried adding emoji to the test case, but it looks like the emoji is converted into the replacement character for unmappable characters somewhere along the line before annotate is called (not sure where), so I don't think it's possible to handle it correctly here (although maybe I was doing something wrong in my test).

I suspect this entire (src/main/) patch could be simplified to

-byte[] data = s.getBytes(charset); +byte[] data = s.getBytes("US-ASCII");

plus deleting charset and the code that sets it. After all, we do not really care what any of the >U+007F characters in the log are for this purpose—we are throwing out the output and just looking for ESC and stuff, so if we send some (byte) '?' through it will not matter.

(Alternately, do not bother getting a byte[] at all, just loop over chars in s and write one byte for each—either the same, if ASCII, or something arbitrary like '?' otherwise.)

~~Ok, it seemed easiest to just switch to using US-ASCII so I did that and added an explanatory comment.~~ Edit: Switched to going char-by-char in ffb5045 so that non-BMP characters are handled correctly.

dwnusbaum · 2018-12-11T15:42:07Z

src/main/java/hudson/plugins/ansicolor/ColorConsoleAnnotator.java

    private static final long serialVersionUID = 1;

    private final String colorMapName;
-    private final String charset;


I guess this could cause issues if someone upgraded the plugin to the version with this fix and then downgraded to 0.6.0, but it's not clear to me what the lifecycle of a serialized ConsoleAnnotator would be to know for sure if it would matter.

It is only serialized while a console page for a running build is being held open, so I do not think you need to care.

vincentisambart · 2018-12-12T03:04:06Z

src/main/java/hudson/plugins/ansicolor/ColorConsoleAnnotator.java

+                 * We only use AnsiHtmlOutputStream for its calls to Emitter.emitHtml when it encounters ANSI escape
+                 * sequences. The output of the stream will be discarded. We encode the String using US_ASCII because
+                 * all ANSI escape sequences can be represented in ASCII using a single byte, and any non-ASCII
+                 * characters will be replaced with '?', which is good, since it means that byte offsets will match


Sorry for commenting out of nowhere, but will that work with non-BMP characters, like emojis?
In Java, "😀".length() is 2 but "😀".getBytes("US-ASCII").length seems to be 1.

No, it wouldn't, but see #143 (comment) where my initial tests seemed to show that emojis are already mangled by the time they get to this method. I will double-check whether that is the case, but probably safer to switch to going char-by-char so that this would work for non-BMP characters assuming everything else worked as well.

Ok, non-BMP characters were broken by the earlier method because markup text was getting inserted in between the two code units of a surrogate pair in some cases. After ffb5045 they should work correctly. Here's a screenshot of running the following pipeline with that commit:

ansiColor { echo '\033[94;1m[ INFO ] 😀😀Wooh\033[0m😀NotCoolored' }

Thanks for the comment!

dblock · 2018-12-16T22:46:04Z

@jglick can I leave all this to you to merge/release/comment?

I haven't been using Jenkins lately ;)

jglick

Release it I guess!

Handle non-ascii characters in annotations correctly

3506cfb

dwnusbaum mentioned this pull request Dec 4, 2018

java.lang.IndexOutOfBoundsException with v0.6.0 #139

Closed

dwnusbaum requested a review from jglick December 4, 2018 19:17

dwnusbaum changed the title ~~Fix #139: Handle non-ASCII characters in annotations correctly~~ Fix #139: Handle non-ASCII characters correctly when annotating text Dec 4, 2018

Update changelog

cc93232

dwnusbaum commented Dec 5, 2018

View reviewed changes

dwnusbaum added 2 commits December 11, 2018 10:37

Simplify fix by removing charset and always using ASCII encoding

e324c82

Restore comment

0ff2f82

dwnusbaum commented Dec 11, 2018

View reviewed changes

vincentisambart reviewed Dec 12, 2018

View reviewed changes

dwnusbaum added 2 commits December 12, 2018 10:38

Support non-BMP characters in annotated text correctly

ffb5045

Improve comment explaining character handling in ColorConsoleAnnotator

bee95aa

dblock assigned jglick Dec 16, 2018

jglick approved these changes Jan 3, 2019

View reviewed changes

jglick mentioned this pull request Jan 3, 2019

Allow escape sequences to span multiple lines and support default fg/bg colors #137

Merged

dwnusbaum merged commit 6abfdd7 into jenkinsci:master Jan 4, 2019

dwnusbaum deleted the non-ascii-fix branch January 4, 2019 16:21

dwnusbaum mentioned this pull request Dec 9, 2019

Ansicolor 0.6 is corrupting input for warnings-ng for MSBuild #154

Closed

Uh oh!

Fix #139: Handle non-ASCII characters correctly when annotating text #143

Fix #139: Handle non-ASCII characters correctly when annotating text #143

Uh oh!

Conversation

dwnusbaum commented Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dblock commented Dec 5, 2018

Uh oh!

dwnusbaum commented Dec 5, 2018

Uh oh!

dwnusbaum Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Dec 6, 2018

Choose a reason for hiding this comment

Uh oh!

jglick Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Dec 11, 2018

Choose a reason for hiding this comment

Uh oh!

jglick Jan 3, 2019

Choose a reason for hiding this comment

Uh oh!

vincentisambart Dec 12, 2018

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Dec 12, 2018

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Dec 12, 2018

Choose a reason for hiding this comment

Uh oh!

dblock commented Dec 16, 2018

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dwnusbaum commented Dec 4, 2018 •

edited

Loading

dwnusbaum Dec 5, 2018 •

edited

Loading

dwnusbaum Dec 11, 2018 •

edited

Loading