Skip to content

Conversation

@dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Dec 4, 2018

Fixes #139.

The cause of the issue is that we were using byte indices as String indices, which does not work when the String contains characters that require than one byte in the target encoding. To fix the issue, this PR keeps track of the number of multi-byte characters we've seen and how many bytes they require, and uses that information to convert the byte count obtained from incoming.getCount() into an index into the original text.

I originally worked on this in #137 but extracted it here so it can be reviewed independently from the less-important issues in that PR.

@dwnusbaum dwnusbaum requested a review from jglick December 4, 2018 19:17
@dwnusbaum dwnusbaum changed the title Fix #139: Handle non-ASCII characters in annotations correctly Fix #139: Handle non-ASCII characters correctly when annotating text Dec 4, 2018
@dblock
Copy link
Member

dblock commented Dec 5, 2018

LGTM, needs CHANGELOG entry

@dwnusbaum
Copy link
Member Author

needs CHANGELOG entry

Good point, I'll add one in a bit.

incoming.write(data[i]);
for (int i = 0; i < s.length(); i++) {
// We write one UTF-16 code unit at a time so that our counting methods can handle multibyte characters correctly.
byte[] chars = String.valueOf(s.charAt(i)).getBytes(charset);
Copy link
Member Author

@dwnusbaum dwnusbaum Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more, why do we even use charset here? The text is already a String, and we only use AnsiOutputStream for its calls to emitHtml in the emitter and discard the bytes it writes to the underlying stream. We'd still need to fix this bug no matter what encoding we used here, buy maybe clearer to hard-code UTF-8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we need iterate through the Unicode code points rather than the UTF-16 code units in case any characters are not in the BMP? I tried adding emoji to the test case, but it looks like the emoji is converted into the replacement character for unmappable characters somewhere along the line before annotate is called (not sure where), so I don't think it's possible to handle it correctly here (although maybe I was doing something wrong in my test).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this entire (src/main/) patch could be simplified to

-byte[] data = s.getBytes(charset);
+byte[] data = s.getBytes("US-ASCII");

plus deleting charset and the code that sets it. After all, we do not really care what any of the >U+007F characters in the log are for this purpose—we are throwing out the output and just looking for ESC and stuff, so if we send some (byte) '?' through it will not matter.

(Alternately, do not bother getting a byte[] at all, just loop over chars in s and write one byte for each—either the same, if ASCII, or something arbitrary like '?' otherwise.)

Copy link
Member Author

@dwnusbaum dwnusbaum Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it seemed easiest to just switch to using US-ASCII so I did that and added an explanatory comment. Edit: Switched to going char-by-char in ffb5045 so that non-BMP characters are handled correctly.

private static final long serialVersionUID = 1;

private final String colorMapName;
private final String charset;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could cause issues if someone upgraded the plugin to the version with this fix and then downgraded to 0.6.0, but it's not clear to me what the lifecycle of a serialized ConsoleAnnotator would be to know for sure if it would matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only serialized while a console page for a running build is being held open, so I do not think you need to care.

* We only use AnsiHtmlOutputStream for its calls to Emitter.emitHtml when it encounters ANSI escape
* sequences. The output of the stream will be discarded. We encode the String using US_ASCII because
* all ANSI escape sequences can be represented in ASCII using a single byte, and any non-ASCII
* characters will be replaced with '?', which is good, since it means that byte offsets will match

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for commenting out of nowhere, but will that work with non-BMP characters, like emojis?
In Java, "😀".length() is 2 but "😀".getBytes("US-ASCII").length seems to be 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it wouldn't, but see #143 (comment) where my initial tests seemed to show that emojis are already mangled by the time they get to this method. I will double-check whether that is the case, but probably safer to switch to going char-by-char so that this would work for non-BMP characters assuming everything else worked as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, non-BMP characters were broken by the earlier method because markup text was getting inserted in between the two code units of a surrogate pair in some cases. After ffb5045 they should work correctly. Here's a screenshot of running the following pipeline with that commit:

ansiColor {
    echo '\033[94;1m[ INFO ] 😀😀Wooh\033[0m😀NotCoolored'
} 

screen shot 2018-12-12 at 10 36 37

Thanks for the comment!

@dblock
Copy link
Member

dblock commented Dec 16, 2018

@jglick can I leave all this to you to merge/release/comment?

I haven't been using Jenkins lately ;)

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release it I guess!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

java.lang.IndexOutOfBoundsException with v0.6.0

4 participants